Chameleon based on clustering feature tree and its ... - Springer Link

0 downloads 0 Views 1006KB Size Report
Jun 12, 2008 - Chameleon based on clustering feature tree and its application in customer segmentation. Jinfeng Li · Kanliang Wang · Lida Xu. Published ...
Ann Oper Res (2009) 168: 225–245 DOI 10.1007/s10479-008-0368-4

Chameleon based on clustering feature tree and its application in customer segmentation Jinfeng Li · Kanliang Wang · Lida Xu

Published online: 12 June 2008 © Springer Science+Business Media, LLC 2008

Abstract Clustering analysis plays an important role in the filed of data mining. Nowadays, hierarchical clustering technique is becoming one of the most widely used clustering techniques. However, for most algorithms of hierarchical clustering technique, the requirements of high execution efficiency and high accuracy of clustering result cannot be met at the same time. After analyzing the advantages and disadvantages of the hierarchical algorithms, the paper puts forward a two-stage clustering algorithm, named Chameleon Based on Clustering Feature Tree (CBCFT), which hybridizes the Clustering Tree of algorithm BIRCH with algorithm CHAMELEON. By calculating the time complexity of CBCFT, the paper argues that the time complexity of CBCFT increases linearly with the number of data. By experimenting on sample data set, this paper demonstrates that CBCFT is able to identify clusters with large variance in size and shape and is robust to outliers. Moreover, the result of CBCFT is as similar as that of CHAMELEON, but CBCFT overcomes the shortcoming of the low execution efficiency of CHAMELEON. Although the execution time of CBCFT is longer than BIRCH, the clustering result of CBCFT is much satisfactory than that of BIRCH. Finally, through a case of customer segmentation of Chinese Petroleum Corp. HUBEI branch; the paper demonstrates that the clustering result of the case is meaningful and useful. Keywords Data mining · Clustering · Hierarchical clustering · Customer segmentation

The research is partially supported by National Natural Science Foundation of China (grants #70372049 and #70121001). J. Li · K. Wang () · L. Xu The School of Management, Xi’an Jiaotong University, Xi’an, 710049, China e-mail: [email protected] L. Xu Department of Information Technology and Decision Science, Old Dominion University, Norfolk, VA 23529, USA L. Xu College of Economics and Management, Beijing Jiaotong University, Beijing 100044, China

226

Ann Oper Res (2009) 168: 225–245

1 Introduction Data mining is to mine the useful knowledge from large databases. Clustering analysis, as one of the most important techniques of data mining, aims to recognize a set of clustering rules and group the data into several clusters. Nowadays, clustering analysis is riveting a great number of management researchers, especially those in the field of customer segmentation (Hu and Sheu 2003; Chicc et al. 2006; Wang et al. 2004). Clustering algorithms are mainly classified into five categories (Han and Kamber 2001; Chen et al. 1996), including partitioned clustering (Han and Kamber 2001; Ng and Han 1994), density-based clustering (Ester et al. 1996; Ankerst et al. 1999; Hinneburg and Keim 1998), grid-based clustering (Wang et al. 1997; Sheikholeslami et al. 1998), fuzzy clustering (Li et al. 2004) and hierarchical clustering (Zhang et al. 1996; Karypis et al. 1999; Jain et al. 1999; Guha et al. 1998, 1999). Most algorithms of partitional clustering, such as k-means and k-centroid (Chen et al. 1996), are vulnerable to outliers and not suitable to group data of none-protruding shape (Halkidi et al. 2001). Moreover, these algorithms require users to input the number of clusters beforehand, except for CLARANS (Ng and Han 1994). But CLARANS requires users to input the number of neighbor points of each point and the low-bound point number of each cluster (Ng and Han 1994). PAM and CLARA (Han and Kamber 2001) are based on the sample points of data set. If these sample points are selected inappropriately, the accuracy of clustering result will decrease. Most algorithms of density-based clustering, such as DBSCAN (Ester et al. 1996), OPTICS (Ankerst et al. 1999), and DENCLUE (Hinneburg and Keim 1998), can be used to recognize outliers and find clusters of arbitrary shape (Wan et al. 2005), but these algorithms require users to input the radius of each point’s domain, the number of nearest neighbor points, and parameters of density, which makes these algorithms sensitive to these initial information. The low time complexity is the most significant advantage for all algorithms of grid-based clustering. However, the clustering result of these algorithms is not satisfactory (Wan et al. 2005; Halkidi et al. 2001). Different from rules of common clustering method, fuzzy clustering breaks the boundary between a pair of clusters, which makes each point not entirely fall into a certain cluster. In other words, each point is categorized into several different clusters with different probabilities. There exists a dilemma in all algorithms of hierarchical clustering: the high execution efficiency and the high accuracy of clustering result cannot be met at the same time. Currently, researchers are trying to improve the clustering accuracy by adding variables in hierarchical clustering (Jung and Kim 2001; Qian et al. 2002) or decrease the time complexity by hybridizing hierarchical clustering with none-hierarchical clustering (Lin and Chen 2005). It is indeed that their clustering algorithms bear a better clustering accuracy with a satisfactory time complexity, but all of these researchers do not apply their improved algorithms to the practical cases. So they do not demonstrate that their algorithms can solve practical problems and the clustering results are meaningful and useful. In this paper, we propose a two stage clustering algorithm named CHAMELEON Based on Clustering Feature Tree (CBCFT) hybridizing Cluster Feature Tree (CFT) with CHAMELEON. In the first stage, CBCFT preprocesses the large amount of data by using CFT. In this stage, it groups the large amount of data into quite a few sub-clusters with a rapid running speed. In the second stage, we take the centroid of each sub-cluster to represent the sub-cluster, and use CHAMELEON to group these sub-clusters. It has been proved that CHAMELEON is a hierarchical clustering algorithm with perfect effectiveness. It can identify different shapes and sizes effectively, and is robust to outliers, except for its low

Ann Oper Res (2009) 168: 225–245

227

running efficiency. That means the perfect effectiveness of CHAMELEON takes cost of the running time. On the other hand, CFT makes the running speed of BIRCH rapid, but the effectiveness of BIRCH for clustering is not satisfactory. It is only suitable to find clusters of spherical data. Thus, by hybridizing CFT and CHAMELEON, we demonstrate that CBCFT can often find the right clusters with a rapid running speed. According to the time complexity of the CBCFT, we prove that the time complexity of CBCFT increases linearly with the volume of data. In the rest of the paper, by experimenting on sample data set, we demonstrate that CBCFT can find right clusters and identify different shapes and sizes. Furthermore, CBCFT is not sensitive to outliers; and it bears a better clustering result than BIRCH and the result is much as similar as CHAMELEON. By experimenting on various volumes of sample points, we also illustrate that the execution time of CBCFT is far less than that of CHAMELEON with the increase of the number of points. In the last part of the paper, we give an application of CBCFT, and illustrate that CBCFT can solve practical problem successfully. We also demonstrate that the clustering result of this application is more accurate than BIRCH and is almost as accurate as CHAMELEON, and the execution efficiency of CBCFT is far higher than that of CHAMELEON. The rest of the paper is organized as follows. In Sect. 2, we analyze several commonly used algorithms of hierarchical clustering, and summarize advantages and disadvantages of these algorithms in running speed and clustering accuracy. In Sect. 3, we put forward a twostage clustering algorithm, called Chameleon Based on Clustering Feature Tree (CBCFT), and analyze the operation efficiency of CBCFT. In Sect. 4, by experimenting on sample data set, we compare the performances of the three algorithms. In Sect. 5, we apply CBCFT to the customer segmentation of Chinese Petroleum Corp. HUBEI branch.

2 Related work of hierarchical clustering algorithms Hierarchical clustering algorithms aim to group data into several hierarchies, and then output a tree of several clusters. Currently, common hierarchical clustering algorithms include Single-Link (Jain et al. 1999), Complete-Link (Jain et al. 1999), BIRCH (Zhang et al. 1996), CURE (Guha et al. 1998), ROCK (Guha et al. 1999), CHAMELEON (Karypis et al. 1999), etc. Taking the distances between all pairs of points into account, Single-Link (Jain et al. 1999) selects two points, the distance between which is the shortest, as the representatives of the two clusters respectively. Single-Link can find clusters of arbitrary shape and different size (Karypis et al. 1999). However, this algorithm is not suitable to group large volume of data due to the high time complexity, which is O(n2 ) (where n is the volume of points). Furthermore, this algorithm selects a single point to represent a certain cluster. If this point is an outlier, the clustering accuracy will decrease in a sense. Complete-Link (Jain et al. 1999) is almost the same as Single-Link. The only difference between them is the method of calculating distance between pairs of points. Complete-Link can not only reduce the influence of outliers and noise on clustering result effectively, but also result in a better clustering result than Single-Link (Han and Kamber 2001). Unfortunately, Complete-Link also has a high time complexity, which is O(n2 log n). By using representative points to group points, CURE (Guha et al. 1998) makes each cluster be represented by several points rather than a single point, which causes CURE to excel at grouping data with none-spherical shape. Moreover, CURE adjusts locations of representative points by using Shrinking Factor, so CURE is robust to noise and outliers. ROCK (Guha et al. 1999) is able to group points of discrete numerical variable effectively. Differences between clusters are identified by neighbor points rather than distance

228

Ann Oper Res (2009) 168: 225–245

or Jacobi Factor. By scaling the inter-connectivity between a pair of clusters, each cluster does not depend on a single point. Thus, ROCK can recognize the similarity between a pair of clusters effectively and is robust to outliers. The time complexity of ROCK is O(n2 + nmm ma + n2 log n), and the space complexity of ROCK is O(min{n2 , nmm ma }) (where mm represents the maximum number of neighbor points, ma represents the average number of neighbor points). We can see that ROCK has a relatively high computational complexity. BIRCH (Zhang et al. 1996) adopts Clustering Feature (CF) and Clustering Feature Tree (CFT), which makes BIRCH able to condense data with large volume and store the condensed information into memory. So BIRCH can read the useful information directly from the memory, which can increase the I/O efficiency. Moreover, BIRCH puts forward the concept of Global Clustering: it groups the sub-clusters, grouped by CFT, by using a robust agglomerative hierarchical algorithm, which can increase the accuracy of clustering result. BIRCH has a low time complexity, which is O(n). However, BIRCH is only suitable to find clusters with spherical shape because CF is defined through radius or diameter. CHAMELEON (Karypis et al. 1999) is a two-stage algorithm. In the first stage, CHAMELEON partitions a sparse graph into several separate sub-clusters by using a graph partition algorithm. In the second stage, CHAMELEON groups the sub-clusters, grouped in the first stage, over and over again by using an agglomerative hierarchical algorithm. Not only does CHAMELEON concern the connectivity and similarity between a pair of clusters, but also it takes the connectivity and similarity inside every cluster into account. Therefore, CHAMELEON can identify the features inside certain cluster automatically, discover clusters of arbitrary shape and various densities, and overcome the shortcoming caused by merely using distance and Jacobi Factor to scale the similarity between a pair of clusters. CHAMELEON performs better than Single-Link, Complete-Link, ROCK, CURE and BIRCH, because the latters neglect the features inside the clusters. However, CHAMELEON is not suitable to group large volume of data. When CHAMELEON groups data with highdimension, the time complexity of it is O(nm + n2 + n log n + m2 log m), and it becomes O(nm + 2n log n + m2 log m) when grouping data with low-dimension (where m represents the number of sub-clusters into which hMETIS partitions the k-nearest graph).

3 Chameleon based on clustering feature tree By adopting the Clustering Feature Tree (CFT), BIRCH bears a more rapid running speed than other hierarchical clustering algorithms, so it is suitable to group data with large volume. However, the accuracy of its clustering result is not satisfactory because it is merely suitable to discover clusters of protruding-shape and spherical-shape. On the other hand, except for the relatively low computing efficiency, CHAMELEON performs well in the accuracy of clustering result and in recognizing outliers. Thus, in order to increase the running speed as well as bear a better accuracy of clustering result, we hybridize the CFT with CHAMELEON and put forward a new algorithm, called Chameleon Based on Clustering Feature Tree (CBCFT). 3.1 Clustering feature tree (CFT) − → Clustering feature Let cluster A has N points of d-dimension {Xi }, i = 1, 2, 3, . . . , N, − → then the Clustering Feature is defined as (N, LS, SS). (N is the volume of clustering points, N −  − − → → →2 LS is the linear sum of points ( N i=1 Xi ), and SS is square sum of points ( i=1 Xi ).)

Ann Oper Res (2009) 168: 225–245

229

Clustering feature tree (CFT) A CFT is a height-balanced tree with two parameters: branching factor B and threshold radius. The maximum number of Entries, in the form of [CF i , childi ], for each none-leaf node is B (i = 1, 2, 3, . . . , B, childi is a pointer pointing to the ith child node, CF i stores the sum of CF of its child node, that is, CF i contains the clustering feature information of its child node). Each leaf node at most contains T Entries, in the form of CF i (i = 1, 2, 3, . . . , T , each leaf node has a common threshold radius which restricts the maximum radius or diameter of clustering CF i ). Each leaf node has two pointers: prev and next, used to connect all of the leaf nodes and facilitate to access to the information of leaf node. Due to the restriction of parameter Radius in the CFT, two points, which ought to be grouped into the same leaf node, could be distributed into two different leaf nodes respectively. This is the main shortcoming of CFT. 3.2 CHAMELEON CHAMELEON has three main clustering stages as follows: (1) construct k-nearest graph, (2) partition k-nearest graph into several sub-clusters, (3) group the sub-clusters, partitioned in the second stage, over and over again and result in the final clustering result. Construct k-nearest graph CHAMELEON constructs a k-nearest graph Gk = V , E) by using k-d tree (Weiss 2004) (Let v be the point, and v ∈ V ). If vj is one of the k-nearest point of vi , then there exists a weighted edge e(vi , vj ) between vi and vj . The weight of each edge of Gk represents the similarity between the pair of points. Partition k-nearest graph into several sub-clusters CHAMELEON min-cuts the Gk (that is, separates Gk into two nearly similar parts), partitions Gk into quite a few unconnected sub-graphs, and views these sub-graphs as the initial sub-clusters. Repeat the second stage until certain conditions are satisfied. Currently, hMETIS (Karypis and Kumar 1998) has several relevant graph-partition algorithms. Group the sub-clusters After partitioning a graph into quite a few sub-graphs, CHAMELEON discovers the most similar sub-clusters in these sub-clusters by adopting Relative Inter-Connectivity RI (Ci , Cj ) and Relative Closeness RC (Ci , Cj )and then group these sub-clusters. There are two schemes for grouping sub-clusters in CHAMELEON as follows: RI(Ci , Cj ) ≥ TRI and RC(Ci , Cj ) ≥ TRC

(1)

RC(Ci , Cj ) ∗ RI(Ci , Cj ) > Pa

(2)

α

TRI , TRC and Pa are the parameters we need to specify. In CBCFT, we choose the second scheme as our grouping criterion (see Fig. 1).

Fig. 1 The general process of CBCFT

230

Ann Oper Res (2009) 168: 225–245

3.3 Chameleon based on clustering feature tree (CBCFT) CBCFT is a hybrid clustering algorithm with two stages, which hybridizes the CFT with CHAMELEON. In the first stage of clustering, CBCFT groups data with large volume into quite a few sub-clusters with a rapid running speed by using CFT. This process is not merely an initial clustering stage, but it can also reduce the amount of data, which will be grouped in the second stage. Thus, CBCFT can overcome the shortcoming of the low computing efficiency − → of CHAMELEON. In the second stage of clustering, we take the LS of each leaf in the CFT to represent each sub-cluster and use CHAMELEON to group these sub-clusters. In this way, we can not only overcome the shortcoming of the CFT (due to the restriction of parameter Radius in the CFT, two points, which ought to be grouped into the same leaf node, could be distributed into two different leaf nodes respectively), but also identify clusters of arbitrary shape and increase the accuracy of clustering. CBCFT has seven parameters, which are branching factor B and threshold Radius, the number of Entries for each leaf node (T), the value k of the k-nearest graph, the number of sub-clusters required by hMETIS (sn), α and Pa required by the CHAMELEON as the grouping criterion. In the fourth part of the paper, we will demonstrate that CBCFT has both the advantages of rapid running speed and satisfactory accuracy of clustering result through a practical case. The process of CBCFT is as follows (see Fig. 2): Step 1 constructs a CFT according to − → parameter B,T and Radius; Steps 2–6 constructs a new data set LEAF according to the LS of each leaf node in the CFT, and takes the data set LEAF as the input of CHAMELEON; Step 7 constructs k-d tree according to data set LEFT; Steps 8–14 constructs a Link-Table for each point according to k-d tree, in which each Link-Table stores the k nearest points for each point; Steps 10–12 stores the similarity of each point with its k-nearest neighbor points, and takes the reciprocal of distance as the measure of similarity; According to the Link-Table of each point, Step 15 partitions the k-nearest graph into sn sub-clusters by using hMETIS; Steps 16–22 views each subCluster as a single cluster, calculates the RI i,j and RCi,j between them and constructs a heap named MaxHeap, which takes RI i,j as the value key, and the elements of this MaxHeap are ordered none-increasedly from root node to child node; Step 23 traverses each elements of MaxHeap until no elements is in the MaxHeap; In the Step 24, the key value of the root of MaxHeap is the maximum value among all the values of elements in the MaxHeap; If the key of subClusteri and subClusterj is the maximum among all keys in MaxHeap, Steps 25–28 will unite subClusteri with subClusterj; Step 29 deletes the root of MaxHeap and renovates the MaxHeap. 3.4 Time complexity The first stage of clustering The structure of CFT is similar with B+ tree (Weiss 2004). In the course of inserting points into CFT, a searching route from root to leaf node has to be found, whose time complexity is O(B logB M) (Zhang et al. 1996) (where M is the number of nodes of CFT, B is the branching factor). The general time complexity is O(nB logB M). The second stage of clustering For the data with low-dimension, the k-d tree can construct k-nearest graph rapidly, whose time complexity is O(n log n) (Lewis and Chase 2004; Friedman et al. 1977). The time complexity of CHAMELEON is O(nm + n log n + m2 log m) (Karypis et al. 1999) (m represents the number of sub-clusters into which hMETIS partitions the k-nearest graph).

Ann Oper Res (2009) 168: 225–245

Fig. 2 The procedure of CBCFT

231

232

Ann Oper Res (2009) 168: 225–245

Summary of the above stages According to the analysis of time complexity above, the time complexity of CBCFT is O(nB logB M + cm + c log c + m2 log m) (where B is the branching factor; n is the volume of points; c represents the number of sub-clusters grouped by the CFT; m represents the number of sub-clusters into which hMETIS (Karypis and Kumar 1998) partitions the k-nearest graph).

4 Experimental results In this section, we will study the performance of CBCFT and demonstrate its effectiveness for clustering compared with BIRCH and CHAMELEON. From our result, we can conclude that (1) BIRCH fails to identify clusters with large variance in size and may group clusters faultily. In contrast to CBCFT, BIRCH has a poor clustering result; (2) CBCFT is able to identify clusters of different sizes and shapes and is robust to outliers; (3) The clustering result of CBCFT is as similar as that of CHAMELEON, which illustrates that CBCFT has an accuracy of clustering as high as CHAMELEON; (4) The execution time of CBCFT is much less than CHAMELEON. Data set We experiment on several different data sets containing points with two dimensions. Due to the lack of space, we only expose one of those in the paper. The data set is showed in Fig. 3 and the volume of points with different shapes is described in Table 1. There are five clusters in the data set, which includes 70 outliers. 4.1 Qualitative comparison CBCFT We need to specify the following seven parameters of CBCFT: the value B, which is the branching factor, the value of threshold T , the value of Radius of leaf in CFT, the value k of the k-nearest graph, the value sn, which is the sub-cluster number required by hMetis, and the value α and the value Pa used as the criterion of combining RI and RC. All the values of these parameters we specify are showed in the Table 2. The result of clustering is illustrated in Fig. 4. In Fig. 4, points belonging to the same cluster are represented by the same color. Figure 4 shows that CBCFT succeeds in finding five different clusters; and it is not quite sensitive to outliers. Furthermore, the result shows that CBCFT can identify different shapes and variance in sizes. Table 1 The volume of points with different shapes

Data set Oval 1 (left)

Oval 2 (right)

Circle 1 Circle 2 Circle 2 Outliers (big) (small) (small)

Volume

642

3854

630

654

653

70

Total: 6503

Table 2 Parameters of CBCFT

Parameter

B

T

k

Radius

sn

α

Pa

Value

8

10

6

1.3

60

0.9

1

Ann Oper Res (2009) 168: 225–245

233

Fig. 3 Data set

Fig. 4 Result of CBCFT

BIRCH BIRCH had to consider the memory in the process of clustering because of the small limited memory aforetime. Nowadays, the memory is large enough to store the Clustering Feature Tree, so we ignore the memory while grouping points. In the stage of Global Clustering, BIRCH adopts a robust agglomerative hierarchical algorithm to group points, and we use Complete-Link Clustering as the hierarchical algorithm. The parameters we need specify are as follows: the value B, which is the branching factor, the value of threshold T , the value of Radius of leaf in CFT, and the value S to stop the Complete-Link Clustering to agglomerate. All the values of these parameters are showed in the Table 3. Figure 5 shows the result of the BIRCH. It is obvious that BIRCH groups two Ovals faultily, which should be identified as two different clusters. Moreover, the big Circle is not

234

Ann Oper Res (2009) 168: 225–245

well identified as a single cluster. That means BIRCH is fragile to shape with the large size. Compared with CBCFT, the result of BIRCH is less satisfied. CHAMELEON The process of CHAMELEON is the same as the second stage of CBCFT, and we also use the second schema for combining RI and RC. Therefore, we need to specify the following parameters: the value k of the k-nearest graph, the value sn, which is the subcluster number required by hMetis, and the value α and the value Pa. All of these parameters are showed in Table 4. Figure 6 shows that five different clusters are identified by CHAMELEON, and the result is perfect. Although there exists a line between the two Ovals, CHAMELEON can also classify the two different clusters. Moreover, the result of CHAMELEON is much as similar as that of CBCFT, which testify the effectiveness for clustering of CBCFT. 4.2 Comparison of execution time In this subsection, our goal is to demonstrate that the execution time of CBCFT is between BIRCH and CHAMELEON and is much less than that of CHAMELEON. We run the three algorithms on the above data set with different density of each shape. That is, the volume of points in each shape is altered, but the size of each shape is invariant. All the points of each Table 3 Parameters of BIRCH

Table 4 Parameter of CHAMELEON

Fig. 5 Result of BIRCH

Parameter

B

T

Radius

S

Value

8

10

1.3

0.8

Parameter

k

sn

α

Pa

Value

25

100

0.9

0.2

Ann Oper Res (2009) 168: 225–245

235

Fig. 6 Result of CHAMELEON

Fig. 7 Comparison of execution time

shape are sampled randomly. In general, the three algorithms identify most different clusters with the above data set, so the parameters of these three are not changed in this experiment. Data set with different density is illustrated in Table 5. Figure 7 and Table 6 demonstrate the performance of the three algorithms with points increasing from 1017 to 6433. The execution time includes all the time consumed by each algorithm. Obviously, the running time of BIRCH increases very little with the increase of the number of points. In contrast to BIRCH, CBCFT and CHAMELEON are not efficient clustering algorithms. However, fortunately the running time of CBCFT is apparently much less than that of CHAMELEON after 2000 points. Especially when CBCFT runs on data set

236

Ann Oper Res (2009) 168: 225–245

Table 5 Number of points in each data set

Oval 1 (left)

Oval 2 (right)

Circle 1 (big)

Circle 2 (small)

Circle 2 (small)

Total

Data set 1

101

98

607

103

108

1017

Data set 2

204

208

1202

209

205

2028

Data set 3

315

311

1850

302

308

3086

Data set 4

408

399

2420

404

391

4022

Data set 5

504

510

3035

516

520

5085

Data set 6

630

642

3854

654

653

6433

Table 6 Comparison of execution time Time (in second)

Data set 1

Data set 2

Data set 3

Data set 4

Data set 5

Data set 6

BIRCH

3

4

5

7

8

10

CBCFT

9

19

38

68

122

202

42

145

297

552

891

1312

CHAMELEON

with 6433 points, the running time is less than that of CHAMELEON by 90%. Therefore, we can safely conclude that CBCFT is far more efficient than CHAMELEON.

5 The application of CBCFT in the customer segmentation of PetroChina HUBEI branch PetroChina HUBEI branch is a subsidiary company of PetroChina, which is one of the two biggest state owned enterprises in China. Since five years ago, this enterprise has been trying to coordinate and integrate various enterprise resources, such as labor-force, finance and customers, etc, by implementing the ERP. The CRM, as one of the most significant factor in ERP, should be well attached importantly, as it determines the effectiveness and efficiency of the performance of ERP (Palomino and Whitley 2007; Olson and Zhao 2007; Warfield 2007). One of the main problems of ERP of HUBEI branch is that the segmentation of customers is too subjective. That is, the segmentation of customers still depends on the subjective opinion of directors. On account of the lack of standard, the segmentation of customers is inaccurate and unscientific, which makes HUBEI branch spend a large amount of human resource, material resource and finance on the unimportant customers. So how to segment customers accurately and identify the valuable customers is the primary problem needed to be solved. The rest of the paper will utilize CBCFT to group the customers of HUBEI branch so as to identify the valuable customers. 5.1 Data set There are many indicators in the database of the HUBEI branch. In order to classify the customers by their importance to the HUBEI branch, we consult the director of the branch and decide to choose two indicators to scale the importance of customers. The first indicator is the Average of Frequency (AF); the second indicator is the Average of Spending (AS). The frequency of purchasing petroleum and spending may vary with months. So the first indicator (AF) shows the average of frequency in a year, and the second indicator (AS)

Ann Oper Res (2009) 168: 225–245 Table 7 Details of indicators

237 Indicators

Data type

Range

Average of frequency (AF)

Integer

1–1,627

Average of spending (AS)

Double

695–10,041,034

shows the average of spending in a year. The details of these two indicators are showed in Table 7. HUBEI branch has 2064 customers totally, so there are 2064 records needed to be grouped. Due to the big distinction on value between AF and AS, we standardize the data on these two indicators by the method as follows: 1. Calculate the mean absolute deviationSf : Sf =

 1 |X1f − mf | + |X2f − mf | + |X3f − mf | + · · · + |Xnf − mf | n

(3)

where X1f , X2f , . . . , Xnf are n measurements of f , and mf is the mean value of f , that is, 1 (X1f + X2f + · · · + Xnf ) n 2. Calculate the standardized measurement, or z-score: mf =

Zif =

Xif − mf Sf

(4)

(5)

After the data on these two indicators have been standardized, the value of point is not restricted between −1 and 1. That is, if the value of a point is far more than 1 or far less than −1, we can safely regard it as an exception or outlier, because these exceptions will definitely not be grouped with other clusters and will influence the result of clustering. Thus, we delete points when their values exceed 10. Finally, we obtain 1908 records. 5.2 Group the customers of HUBEI branch After preprocessing the data of customers, we obtain 1908 records. Each record stores the identity of each customer, the Average of Frequency (AF), and the Average of Spending (AS), which have been standardized. The first stage of clustering—using CFT to group customers We first group the customers on AF and AS by using CFT, and eliminate the outliers in the data set. The parameters we specify are as follows: the value B, which is the branching factor, the value of threshold T and the value of Radius of leaf in CFT. They are illustrated in Table 8. The result shows that the height of CFT is 7 and we take each leaf in the CFT as a single sub-cluster, then we get − → 48 sub-clusters. Finally, we take the LS of each leaf in the CFT to represent each sub-cluster. The second clustering stage—using CHAMELEON to group customers We view the 48 sub-clusters obtained in the first stage as 48 points and take these points as the input of CHAMELEON. Then we specify the parameters as follows: the value k in the k-nearest graph, the value sn, and the value α and Pa. They are illustrated in Table 9. Table 10 shows that CHAMELEON groups the 48 sub-clusters into 4 clusters.

238 Table 8 Parameters of the first stage

Table 9 Parameters of the second stage

Ann Oper Res (2009) 168: 225–245 Parameter

B

T

Radius

Value

5

8

0.6

Parameter

k

sn

α

Pa

Value

10

15

0.9

0.1

Table 10 The information of customer segmentation Identity of cluster The number of customers

1

2 257

3 111

4 66

13.47%

5.82%

Average frequency of purchasing petroleum

41

4

27

6

25,102

130,182

65,454

27,592

Average spending

3.45%

1474

Percentage of all the customers

77.25%

Fig. 8 Customer group—the 1st cluster (“frequent customer”)

Fig. 9 Customer group—the 2nd cluster (“spending customer”)

Figures 8 to 11 show the distributions of the four clusters grouped by CHAMELEON. Each cluster has a good interior homogeneity. Furthermore, each cluster is easily understandable. The customers of the first cluster (“frequent customer”) purchase petroleum quite frequently, but their average spending is relatively low. The customers of the second cluster (“spending customer”) spend a lot each time, but they do not purchase frequently. The customers of the third cluster (“best customer”) not only purchase petroleum frequently but also spend a lot each time. The customers of the fourth cluster (“uncertain customer”) purchase petroleum occasionally as well as spend a little each time.

Ann Oper Res (2009) 168: 225–245

239

Fig. 10 Customer group—the 3rd cluster (“best customer”)

Fig. 11 Customer group—the 4th cluster (“uncertain customer”)

Each line in Figs. 8–11 represents a single sub-cluster, which is obtained from the first stage of clustering. The pair of ends of each line represents AF and AS respectively, the value of which has been standardized. 5.3 The clustering result In order to show the validity of the clustering result, we use ANalysis Of Variance (ANOV) to demonstrate that the four clusters we have obtained possess of a good quality. That is, elements in the same cluster are quite similar, but elements from different clusters differ significantly. Comparison of the general averages Suppose that the averages of AF in each cluster are equal, and the averages of AS in each cluster are also equal. The result of ANOV is shown in Table 11, the last column of which demonstrates that all of the significant (sig) values are below 0.05. That is, the four clusters—“frequent customer”, “spending customer”, “best customer” and “uncertain customer”—differ significantly on the averages of AF and AS. Generally speaking, significant differences exist between pairs of clusters, and elements in the same cluster have a high homogeneity. Thus, different types of customers are differentiated by CBCFT. Comparison of the averages between pairs of clusters We compare the averages between pairs of clusters on AF and AS. Table 12 shows the result of ANOV. Table 13 shows that (“frequent customer”, “spending customer”), (“frequent customer”, “best customer”), (“frequent customer”, “uncertain customer”), (“spending customer”), and (“best customer”, “uncertain customer”) differ significantly on the average of AF respectively, and (“frequent customer”, “spending customer”), (“frequent customer”, “best customer”), (“spending customer”, “best customer”), (“spending customer”, “uncertain customer”), and (“best customer”, “uncertain customer”) differ significantly on the average of AS respectively. All of the above results show that on the average of AF, “spending customer” differs from “uncertain customer” insignificantly, but significantly from the other two types of customer. That

240

Ann Oper Res (2009) 168: 225–245

Table 11 ANOV table—comparison of the general averages Sum of squares AF

AS

df

Mean square

Between groups

1477.967

3

492.656

Within groups

1005.583

1904

.528

Total

2483.550

1907

Between groups

434.833

3

144.944

Within groups

243.821

1904

.128

Total

678.654

1907

Table 12 ANOV table—comparison of the averages between pairs of clusters

F

Sig.

933.061

.000

1132.375

.000

Dependent (I) type (J) type Mean difference (I-J) Std. error Sig. variable AF

1.00

2.00

3.00

4.00

2.00

2.5455a

.07530

.000

3.00

1.0651a

.09727

.000

4.00

2.4280a

.04638

.000

1.00

−2.5455a

.07530

.000

3.00

−1.4804a

.10719

.000

4.00

−.1175

.06466

.069

1.00

−1.0651a

.09727

.000

2.00

1.4804a

.10719

.000

4.00

1.3629a

.08929

.000

1.00

−2.4280a

.04638

.000

.06466

.069

2.00

AS

1.00

2.00

3.00

4.00 a Represents the pair of clusters whose significant value is below 0.05

.1175

3.00

−1.3629a

.08929

.000

2.00

−1.8085a

.03708

.000

3.00

−1.0330a

.04790

.000

4.00

−.0387

.02284

.090

1.00

1.8085a

.03708

.000

3.00

.7755a

.05278

.000

4.00

1.7698a

.03184

.000

1.00

1.0330a

.04790

.000

2.00

−.7755a

.05278

.000

4.00

.9943a

.04396

.000

1.00

.0387

.02284

.090

2.00

−1.7698a

.03184

.000

3.00

−.9943a

.04396

.000

Table 13 The analysis of difference on the averages of different clustering variable Type of clusters Variable

1

2

3

4

Have significant difference (α = 0.05)

AF

41

4

27

6

(1, 2), (1, 3), (1, 4), (2, 3), (3, 4)

AS

25,102

130,182

65,454

27,592

(1, 2), (1, 3), (2, 3), (2, 4), (3, 4)

Ann Oper Res (2009) 168: 225–245

241

is, “spending customer” and “uncertain customer” are classified as customer group who purchases petroleum infrequently, but “frequent customer” and “best customer” are classified as customer groups who purchase petroleum frequently. Moreover, “frequent costumer” differs from “uncertain customer” insignificantly on the average of AS, but significantly from the other two types of customers. That is, “frequent customer” and “uncertain customer” spend a little each time, but “spending customer” and “best customer” spend a lot each time. 5.4 Comparison among CHAMELEON, BIRCH and CBCFT In this subsection, by studying the performance of CBCFT, BIRCH and CHAMELEON, we demonstrate that the result of CBCFT is more accurate than that of BIRCH and is as similar as that of CHAMELEON. Furthermore, the running time of CBCFT is much less than CHAMELEON and the result of CBCFT is more valuable than that of BIRCH in this case. In the rest of the paper, we use Dunn-like to scale accuracy of clustering results. The index of evaluating the clustering result Dunn-like is an index used to describe the similarity of elements inside certain cluster and difference between pairs of clusters. The larger the Dunn-like is, the more similar are the elements inside certain cluster and the more different are the elements between a pair of clusters. Dunn-like is defined as follows.    d(ci , cj ) Dk = min min (6) i=1,2,...,c j =1,2,...,c maxk=1,2,...,c diamMST k c is the number of clusters; d(ci , cj ) = minx∈ci ,y∈cj d(x, y); The graph Gk = (V , E) is constructed by adjacency matrix; V represents all of the nodes inside cluster k and the weights of E are the distances between pairs of points. We can utilize KRUSKAL to construct a MST. Let W be the set of the weights of all edges, and then diamMST = maxi∈W i. k The result of BIRCH Because the first stage of CBCFT adopts the CFT, identical with the first stage of BIRCH, the number of sub-clusters of CBCFT is the same as that of CHAMELEON. We view these 48 sub-clusters as 48 points, and set S = 0.8 to stop the agglomeration of Complete-Link algorithms. Finally, we get six clusters showed in Table 14. From Table 14, cluster 1, cluster 2 and cluster 3 should belong to “frequent customer” because the value of AF of cluster 1, cluster 2 and cluster 3 is relatively high and the value of AS is relatively low; cluster 4 and cluster 5 should belong to “spending customer” because the value of AF of cluster 4 and cluster 5 is relatively low and the value of AS is relatively high; cluster 6 should belong to “uncertain customer” because the values of AS and AF are relatively low. Table 14 The clustering result of BIRCH

BIRCH Type

AF

AS

Number of customers

Cluster 1

2.44

−0.54

76

Cluster 2

5.11

−0.61

18

Cluster 3

8.11

−0.55

10

Cluster 4

−0.63

1.33

49

Cluster 5

−0.70

4.34

8

Cluster 6

−0.43

−0.52

1747

Sum: 1908

242

Ann Oper Res (2009) 168: 225–245

Table 15 The parameters of CHAMELEON

Parameter

k

sn

α

Pa

Value

10

15

0.9

0.1

Table 16 The clustering result of CHAMELEON

AF

AS

Number of customers

Percentage of all customers

Cluster 1

1.65

−0.54

188

9.9%

Cluster 2

4.01

−0.45

35

1.83%

Cluster 3

6.74

−0.62

16

0.84%

Cluster 4

−0.82

1.45

77

4.04%

Cluster 5

−0.25

2.37

22

1.15%

Cluster 6

−0.83

5.09

4

0.21%

Cluster 7

0.92

0.22

54

2.83%

Cluster 8

3.83

0.53

5

0.26%

Cluster 9

−0.55

−0.51

1507

78.93%

Table 17 Comparison between CHAMELEON and CBCFT CHAMELEON Type

AF

CBCFT AS

Number of customers

Cluster 1

1.65

−0.54

188

Cluster 2

4.01

−0.45

35

Cluster 3

6.74

−0.62

16

Cluster 4

−0.82

1.45

77

Cluster 5

−0.25

2.37

22

Cluster 6

−0.83

5.09

4

Cluster 7

0.92

0.22

54

Cluster 8

3.83

0.53

5

Cluster 9

−0.55

−0.51

1507 sum: 1908

Type

AF

AS

Number of customers

Common customers Cus1 ∩ Cus2

Cluster1

1.80

−0.58

257

212

Cluster 2

−0.68

1.82

111

105

Cluster 3

0.92

0.33

66

50

Cluster 4

−0.58

−0.57

1474

1438

sum: 1908

1805

The result of CHAMELEON We take the 1908 records as the input of CHAMELEON and adopt the second scheme to merge RI and RC. The values of the parameters are shown in Table 15. Table 16 shows the result of CHAMELEON. Comparison between CHAMELEON and CBCFT Table 17 shows that cluster 1, cluster 2 and cluster 3 of CHAMELEON are similar with the first cluster of CBCFT (“frequent customer”) on AF and AS respectively; cluster 4, cluster 5 and cluster 6 of CHAMELEON are similar with the second cluster of CBCFT (“spending customer”) on AF and AS respectively; cluster 7 and cluster 8 are similar with the third cluster of CBCFT (“best customer”) on AF and AS respectively; cluster 9 is similar with the fourth cluster of CBCFT on AF and AS respectively. So we view cluster 1, cluster 2 and cluster 3 as a new cluster, view cluster 4,

Ann Oper Res (2009) 168: 225–245 Table 18 Dunn-like on 1908 records

Table 19 Running time on 1908 records

243 Algorithms

BIRCH

CBCFT

CHAMELEON

Value of Dunn-like

0.00238

0.00937

0.01143

Algorithms Running time (sec) Total (sec)

BIRCH

CBCFT

CHAMELEON

Stage 1

3

3

131

Stage 2

1

8

4

11

131

cluster 5, cluster 6 as a new cluster, and view cluster 7 and cluster 8 as a new cluster. Thus, we can compare the results between CHAMELEON and CBCFT. The last column of Table 17 shows the common customers on each type of classifications between CHAMELEON and CBCFT, which aims to show the similarity of the clustering result between the two algorithms. The last line of Table 17 shows that there are totally 1805 common customers, accounting for 94.6% of all the customers, between CHAMELEON and CBCFT. Thus, the clustering result of CBCFT is quite similar with that of CHAMELEON. Comparison between BIRCH and CBCFT BIRCH only recognizes three types of customer group. The “best customer” is not found by BIRCH. These customers are extremely important for HUBEI branch due to the high values of AS and AF. Therefore, BIRCH does not mine the most valuable customers for HUBEI branch. On the other hand, CBCFT recognizes the “frequent customer”, the “spending customer”, the “uncertain customer” and mines the “best customer” as well. Furthermore, CBCFT demonstrates that these four types of customers differ significantly through ANOV. So CBCFT is more valuable than BIRCH for dealing with practical cases. Comparison of accuracy of result among the three algorithms In this paragraph, we use Dunn-like to scale the accuracy of result. In order to make the three algorithms able to be compared on Dunn-like, we make the nine clusters of CHAMELEON be four clusters showed in Table 16, and make the six clusters of BIRCH be four clusters showed in Table 15. The values of Dunn-like on 1908 records are showed in Table 18. Table 18 illustrates the value of Dunn-like of CBCFT is bigger than that of BIRCH, which means points inside the clusters of CBCFT are more similar than BIRCH and points between pairs of clusters are more different than BIRCH. So the accuracy of result of CBCFT is higher than BIRCH. Similarly, the Dunn-like of CBCFT is smaller than that of CHAMELEON. That means the accuracy of result of CBCFT is lower than CHAMELEON. Comparison of the running time among the three algorithms The running time of each algorithm on 1908 records are illustrated in Table 19. Table 19 illustrates that the running speed of CBCFT is much faster than that of CHAMELEON in this case. What is more, the running time of CBCFT is just slightly more than that of BIRCH. 6 Conclusion After analyzing pros and cons of various hierarchical clustering algorithms, we conclude that the running speed of CFT is rapid and CHAMELEON can always find right clusters.

244

Ann Oper Res (2009) 168: 225–245

Then we hybridize CFT with CHAMELEON and put forward a new two-stage clustering algorithm, named Chameleon Based on Clustering Feature Tree (CBCFT). We argue that the time complexity of CBCFT increases linearly with the number of data. Through experimenting on sample data set, we conclude that CBCFT is able to identify clusters with large variance in size and shape and is robust to outliers. Compared with BIRCH, CBCFT can always find the right clusters. Moreover, we illustrates that the execution efficiency of CBCFT is rather high compared with CHAMELEON. In the last part of the paper, we apply CBCFT to segmenting customers of Chinese Petroleum Corp. HUBEI branch. We demonstrate that CBCFT can solve practical problem successfully and the result of this application is more accurate than BIRCH and is almost as accurate as CHAMELEON. The clustering result of CBCFT is as similar as that of CHAMELEON, but CBCFT overcomes the shortcoming of the low execution efficiency of CHAMELEON. Although the execution time of CBCFT is longer than BIRCH, the clustering result is much more satisfactory than BIRCH. Moreover, the clustering result of CBCFT shows that the segmentation of customers is reasonable and easily understandable, which demonstrates that CBCFT is able to recognize various types of customer groups. These customer segmentations are helpful for managers of HUBEI branch to make marketing decision and implement the customer relation management. Acknowledgement gestions.

The authors are grateful to anonymous referees for their helpful comments and sug-

References Ankerst, M., Breunig, M., & Kriegel, H. P. Sander, J. (1999). OPICS: Ordering points to identify the clustering structure. In Proceedings of 1999 ACM-SIGMOD international conference on management of data (SIGMOD’99), Philadelphia, June 1999 (pp. 49–60). Chen, M. S., Han, J., & Yu, P. S. (1996). Data mining: an overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering, 8(6), 866–883. Chicc, G., Napoli, R. C., & Piglione, F. (2006). Comparisons among clustering techniques for electricity customer classification. IEEE Transactions on Power Systems, 21(2). Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the second international conference of knowledge discovery and data mining (pp. 226–231). Friedman, J. H., Bentley, J. L., & Finkel, R. A. (1977). An algorithm for finding best matches in logarithmic expected time. ACM Transactions on Mathematical Software, 3, 209–226. Guha, S., Rastogi, R., & Shim, K. (1998) CURE: An efficient clustering algorithm for large databases. In Proceedings of ACM–SIGMOD international conference on management of data (pp. 73–84). Guha, S., Rastogi, R., & Shim, K. (1999). ROCK: a robust clustering algorithm for categorical attributes. In Proceedings of the 15th international conference on data engineering (p. 512). Halkidi, M., Batistakis, Y., & Vazirgiannis, M. (2001). Clustering algorithms and validity measures. In Proceedings of the thirteenth international conference on scientific and statistical database management (Vol. 3, pp. 1099–3371). Han, J., & Kamber, M. (2001). Data mining concepts and techniques. BeiJing: Higher Education Press. Hinneburg, A., & Keim, D. A. (1998). An efficient approach to clustering in large multimedia databases with noise. In Proc. of 1998 int. conf. on knowledge discovery and data mining (KDD’98), New York, August 1998 (pp. 58–65). Hu, T. L., & Sheu, J. B. (2003). A fuzzy-based customer classification method for demand-responsive logistical distribution operations. Fuzzy Sets and Systems, 139, 431–450. Jain, A. K., Murty, M. N., & Flynn, P. J. (1999). Data clustering: a review. ACM Computing Surveys (CSUR), 31(3), 264–323. Jung, S. Y., & Kim, T. S. (2001). An agglomerative hierarchical clustering using partial maximum array and incremental similarity computation method. In First IEEE international conference on data mining (p. 265).

Ann Oper Res (2009) 168: 225–245

245

Karypis, G., Han, E. H., & Kumar, V. (1999). Chameleon: Hierarchical clustering using dynamic modeling. IEEE Computer, 32(8), 68–75 (Special Issue on Data Analysis and Mining). Karypis, G., & Kumar, V. (1998). hMETIS 1.5: A hypergraph partitioning package. Technical report, Department of Computer Science, University of Minnesota. http://www.cs.umn.edu/~metis. Lewis, J., & Chase, J. (2004). Data structures. Java edition. Englewood Cliffs: Prentice-Hall. Li, C., Becerra, V. M., & Deng, J. (2004). Extension of fuzzy c-means algorithm. In Proceedings of the IEEE conference on cybernetics and intelligent systems, Singapore, 1–3 December 2004. Lin, C. R., & Chen, M. S. (2005). Combining partitional and hierarchical algorithms for robust and efficient data clustering with cohesion self-merging. IEEE Transactions on Knowledge and Date Engineering, 17(2). Ng, R., & Han, J. (1994). Efficient and effective clustering method for spatial data mining. In Proceedings of international conference of very large data bases (VLDB’94), Santiago, Chile, September 1994 (pp. 144–155). Olson, D. L., & Zhao, F. (2007). CIOs’ perspectives of critical success factors in ERP upgrade projects. Enterprise Information Systems, 1(1), 129–138. Palomino, M. A., & Whitley, E. A. (2007). The effects of national culture on ERP implementation: a study of Colombia and Switzerland. Enterprise Information Systems, 3(1), 301–325. Qian, Y. T., Shi, Q. S., & Wang, Q. (2002). CURE-NS: A hierarchical clustering algorithm with new shrinking scheme. In Proceedings of the first international conference on machine learning and cybernetics, Beijing, 4–5 November 2002. Sheikholeslami, G., Chatterjee, S., & Zhang, A. (1998). WaveCluster: A multi-resolution clustering approach for very large spatial databases. In Proceedings of the IEEE conference on very large data bases (VLDB’98), New York, August 1998 (pp. 428–439). Wan, L. H., Li, Y. J., Liu, W. Y., & Zha, D. Y. (2005). Application and study of spatial cluster and customer partitioning. In Proceedings of the fourth international conference on machine learning and cybernetics, Guangzhou, 18–21 August 2005. Wang, W., Yang, J., & Muntz, R. (1997). STING: A statistical information grid approach to spatial data mining. In Proceedings of 1997 international conference on very large data bases (VLDB’97), Athens, Greece, August 1997 (pp. 186–195). Wang, Z., He, P. L., Guo, L. S., & Zheng, X. S. (2004). Clustering analysis of customer relationship in securities trade. In Proceedings of the third international conference on machine learning and cybernetics, Shanghai, August 2004 (pp. 26–29). Warfield, J. N. (2007). Systems science serves enterprise integration: a tutorial. Enterprise Information Systems, 2(1), 235–254. Weiss, M. A. (2004). Data structures and algorithm analysis in Java. Englewood Cliffs: Prentice-Hall. Zhang, T., Ramakrishnan, R., & Linvy, M. (1996). Birch: an efficient data clustering method for large databases. In Proc. of 1996 ACM–SIGMOD international conference on management of data, Montreal, Quebec (pp. 103–114).