Fast Graph-Based Relaxed Clustering for Large Data Sets Using ...

672

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 3, JUNE 2012

Fast Graph-Based Relaxed Clustering for Large Data Sets Using Minimal Enclosing Ball Pengjiang Qian, Fu-Lai Chung, Shitong Wang, and Zhaohong Deng

Abstract—Although graph-based relaxed clustering (GRC) is one of the spectral clustering algorithms with straightforwardness and self-adaptability, it is sensitive to the parameters of the adopted similarity measure and also has high time complexity O(N 3 ) which severely weakens its usefulness for large data sets. In order to overcome these shortcomings, after introducing certain constraints for GRC, an enhanced version of GRC [constrained GRC (CGRC)] is proposed to increase the robustness of GRC to the parameters of the adopted similarity measure, and accordingly, a novel algorithm called fast GRC (FGRC) based on CGRC is developed in this paper by using the core-set-based minimal enclosing ball approximation. A distinctive advantage of FGRC is that its asymptotic time complexity is linear with the data set size N . At the same time, FGRC also inherits the straightforwardness and self-adaptability from GRC, making the proposed FGRC a fast and effective clustering algorithm for large data sets. The advantages of FGRC are validated by various benchmarking and real data sets. Index Terms—Clustering, large data sets, minimal enclosing ball (MEB), time complexity.

I. I NTRODUCTION

S

PECTRAL clustering refers to a class of techniques which rely on the eigenstructure of a similarity matrix to partition points into disjoint clusters, with points in the same cluster having high similarity and points in different clusters having low similarity. Because of being able to discover clusters with arbitrary shapes and converging to the global optimal solution, it has gradually become popular in recent years, and abundant theories and approaches based on it have emerged (see, e.g., [1]–[9]). Among them, the normalized cut (NC) is perhaps the most attractive one. It was first proposed by Shi and Malik in [10] and [11] and later improved in many literatures Manuscript received June 6, 2009; revised February 13, 2011 and September 11, 2011; accepted September 26, 2011. Date of publication February 3, 2012; date of current version May 16, 2012. This work was supported in part by the Hong Kong Polytechnic University under Grants 1-ZV5V and G-U296, by the National Science Foundation of China under Grants 61170122 and 60903100, by the Natural Science Foundation of Jiangsu Province under Grants BK2009067 and 2011NSFJS, by the Fundamental Research Funds for the Central Universities under Grants JUSRP21128, JUSRP211A34, and JUDCF09034, by the 2009 and 2011 Postgraduate Student’s Creative Research Funds of Jiangsu Province, and by the Research Fund of 2011 Jiangsu 333 Plan. This paper was recommended by Associate Editor H. Kargupta. P. Qian is with the School of Digital Media, Jiangnan Univeristy, Wuxi 214122, China (e-mail: [email protected]). F.-L. Chung is with the Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong (e-mail: [email protected]). S. Wang and Z. Deng are with the School of Digital Media, Jiangnan Univeristy, Wuxi 214122, China, also with the Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TSMCB.2011.2172604

(e.g., [4]–[6]). Nowadays, NC is further developed by combining with other theories, such as mean shift (MS) [3], [8]. However, the lack of straightforwardness and self-adaptability is a critical challenge of NC-based approaches. For example, in [11], it is tedious (not straightforward) to bipartition an image by recursively using the eigenvector corresponding to the second smallest eigenvalue. To conquer this shortcoming of the original NC, K-means is usually selected as a complement for NC [5], [6], [8]. Furthermore, as the result of K-means depends heavily on the initialization, the so-called multiclass spectral clustering [9] algorithm can be adopted. Nevertheless, there exists another problem that all approaches based on NC are not self-adaptable. For example, in [8], one should preassign the partitioning class k to implement the next NC after the MS segmentation and after computing the weight matrix W of all regions. An important improvement of NC is proposed in [4], i.e., the graph-based relaxed clustering (GRC) algorithm. GRC offers two kinds of solution, namely, the quadratic programming (QP) formulation and the analytical solution (AS) based on matrix inversions. Either the former or the latter avoids the approximated iterative methods that have been adopted in many other graph-based approaches. That is, even for multiclass clustering problems, the whole work can be completed by just computing once in GRC, and it is greatly intuitive as to which cluster every data point belongs to through its solution [clustering indicator (CI)]. Therefore, GRC is straightforward. At the same time, different from many existing clustering approaches (e.g., K-means [3], [5]–[7], GRC is self-adaptable, i.e., the number of clusters is not required to be set beforehand. However, GRC is also suffering from some problems. First, as a result of the time complexity of the matrix inversion operation contained in GRC being equal to O(N 3 ), GRC is obviously impractical for large data sets. Second, the clustering result (i.e., CI) of GRC is sensitive to the parameters of the adopted similarity measure and sometimes even unstable. This fact actually implies that we cannot easily determine appropriate parameters for the adopted similarity measure in GRC. Our work here is primarily motivated by how to circumvent the aforementioned shortcomings of GRC. After an in-depth investigation of GRC, we find that it can be regarded as a special minimal enclosing ball (MEB) problem (i.e., center-constrained MEB or CCMEB) by introducing certain constraints. Based on this finding, the CCMEB-based generalized core vector machines (GCVMs) [12] can be exploited. By using the coreset-based MEB approximation (CMEBA), one may greatly reduce the time complexity of GRC. Moreover, by introducing a new linear item to the objective function, we can also effectively solve the sensitivity problem that the clustering result of GRC

1083-4419/$31.00 © 2012 IEEE

QIAN et al.: FGRC FOR LARGE DATA SETS USING MINIMAL ENCLOSING BALL

is sensitive to the parameters in the adopted similarity measure. Thus, a novel GRC algorithm called fast GRC (FGRC) is proposed. The proposed algorithm is distinguished in three aspects: 1) it is straightforward and self-adaptable, inherited from GRC; 2) for large data sets, its asymptotic time complexity is linear with the size N of the data set; and 3) compared with GRC, robustness of the CI to the parameters of the adopted similarity measure is enhanced. Thus, FGRC is a fast and effective clustering algorithm for large data sets, and we will confirm it in our experimental studies. The rest of this paper is organized as follows. The related works are reviewed in Section II. The proposed FGRC algorithm and its implementation details are described in Section III. The experimental results are reported in Section IV. Some concluding remarks are given in the last section.

673

where assoc(A, V ) = u∈A,t∈V w(u, t) denotes the total connection from nodes in A to all nodes in the graph and assoc(B, V ) is defined similarly. NC ensures the largest difference among the subsets and also avoids cutting small sets of isolated nodes in the graph, which are the favorite in minimum cut. Let x be an N = |V | dimensional indicator vector, where xi = 1 if node i is in A and −1, otherwise. Let W be the N × N affinity (similarity) matrix of G(V , E), which is defined by certain similarity function, e.g., −xi − xj 2 w(i, j) = exp , σ > 0. (3) 2σ 2 Let D = diag(d1 , d2 , . . . , dN ), where di = j w(i, j) denotes the connection from node i to all other nodes. Now, we can rewrite N cut(A, B) as

II. R ELATED W ORK The graph theoretic formulation is the basis of many spectral clustering approaches, where the set of points in an arbitrary feature space are represented as a weighted undirected graph G(V , E), with V denoting the vertices of the graph as well as the data points in the feature space, E denoting the edges formed between every pair of vertices, and the weight on each edge w(i, j) denoting the similarity between vertices i and j. We seek to partition the set of vertices into disjoint sets, where, by some measure, the similarities among the vertices in the same set are high and across different sets are low. According to different measures, several graph partition criteria have been proposed, e.g., ratio cut [13], minimum cut [14], average cut [15], NC [10], [11], and so on, and based on which, some other researchers have presented other methods [2], [4]. An efficient improvement of NC has been proposed by Lee et al.[4], where a novel graph-based algorithm called GRC is derived by using the semidefinite programming relaxation trick. Next, we will first review the NC and GRC, respectively. A. NC The NC (Ncut) was proposed by Shi and Malik in [10] and [11]. Its main idea can be described as follows. Suppose a graph G(V , E) can be partitioned into two disjoint sets A and B, with A ∪ B = V and A ∩ B = Θ by simply removing those edges connecting the two parts. The degree of dissimilarity between A and B can be computed as the sum of all of the weights of the edges removed. In graph-theoretic language, it is called cut, defined as w(u, v) (1) cut(A, B) = u∈A,v∈B

where w(u, v) is the similarity weight between nodes u and v. Different from minimizing the cut value in minimum cut, NC computes the cut cost as a fraction of the total edge connections to all the nodes in the graph. It is a kind of disassociation measure and can be formulated as N cut(A, B) =

cut(B, A) cut(A, B) + assoc(A, V ) assoc(B, V )

(2)

cut(B, A) cut(A, B) + assoc(A, V ) assoc(B, V ) xi >0,xj 0 di xi 0 −wij xi xj + . xi 0 di , s.t. yi ∈ 1, xi 0, a ball B(c, (1 + ε)r) is an (1 + ε)-approximation of MEB(S) if r ≤ rM EB(S) and S ⊂ B(c, (1 + ε)r). For the MEB problem, it is found that solving it on a subset, called the core-set, Q of data points from S, can often give an accurate and efficient approximation. More formally, a subset Q ⊂ S is a core-set of S if an expansion by a factor (1 + ε) of its MEB contains S, i.e., S ⊂ B(c, (1 + ε)r), where B(c, r) = MEB(Q). A breakthrough in achieving such an (1 + ε)-approximation was first obtained in [28] and further explored in [12], [20]–[22], and [29], [30, 31]. They used a simple iterative scheme: at the tth iteration, the current estimate B(ct , rt ) is expanded by including the farthest point outside the (1 + ε)-ball B(ct , (1 + ε)rt ). This is repeated until all of

In general, GRC is a novel spectral clustering algorithm with high performance. Its major advantages include the following aspects: 1) it is suitable for discovering clusters with arbitrary shapes; 2) it is straightforward that the whole clustering task can be solved by computing only once; and 3) it is selfadaptable such that the number of clusters is not predetermined before clustering. However, we also find several shortcomings of GRC in practice. First, as mentioned in the introductory section, either the QP formulation or the AS of GRC has time complexity no less than O(N 3 ). For large data sets, this severely weakens its usefulness. Second, its clustering result, i.e., the CI, is sensitive and even unstable to the parameters of the adopted similarity function [e.g., σ in (3)], which actually poses a great challenge for us to properly estimate appropriate parameters therein. In this section, we aim at coping with these disadvantages of GRC and present a novel algorithm FGRC. The new algorithm will be derived by two steps. By slightly changing the constraint and by introducing a linear term to GRC, we first develop a variant of GRC, called constrained GRC (CGRC), and then, based on CGRC and the CMEBA strategy, we propose FGRC. A. CGRC As we may know well, various spectral clustering algorithms generally use the CI y = {y1 , . . . , yN } to indicate which cluster each data point should belong to. However, the range of CI may be different in different algorithms. For example, yi = 1 or −1 in normal cuts, and yi is relaxed to be assigned as a random real number in GRC. In our study, for the purpose of speeding up the algorithm, we specify y ≥ 0. Additionally, we set ζ = 1 in (10); thus, we get max

−y Ly

s.t.

y 1 = 1,

y≥0

(23)

where 1 = [1, . . . , 1] . In GRC, a good CI should make the variance of all yi belonging to the same cluster become as small as possible [see (24)] and the distances among different cluster centers become as far as possible [see (25)]. Assuming that the number of clusters in the data set S is c, i.e., S = ∪ci=1 Si , let the size of

676


every cluster be ni , i = 1, 2, . . . , c; then, we can define the two clustering criteria, respectively, as ⎛ c ⎞ 2 yj ⎜ i=1 y ∈S yj − y i ⎟ yj ∈Si j i ⎜ ⎟ min ⎜ (24) ⎟ , yi = c ni ⎝ ⎠ ⎛

c c

y i − y k 2

⎜ i=1 k=i+1 min ⎜ ⎝− c(c − 1)/2

⎞ ⎟ ⎟. ⎠

(25)

Let c

Jy =

i=1 yj ∈Si

c c

yj − y i 2 c

−

y i − y j 2

i=1 j=i+1

c(c − 1)/2

.

(26)

Then, the optimal CI should be the one which minimizes Jy with the optimal parameters in the adopted similarity function. However, in practice, we notice that the relationship curve between Jy and the parameters often seems unstable (see this phenomena in the following experimental section), which cannot make us easily select appropriate parameters in GRC. In this paper, we refer to [1] to address this issue. In [1], in order to control the difference between the total weights in different parts of graph G, Higham and Kibble introduced an extra constraint |y diag(D)| ≤ β

(27)

where β is a balancing threshold. Similarly, y diag(D) is added to the objective function in (23) as a linear item, and the absolute operator is omitted because of y ≥ 0. Thus, we obtain max

μy diag(D) − y Ly

s.t.

y 1 = 1,

y≥0

(28)

where μ is a coefficient of the linear term. For optimizing the clustering performance, the linear item is introduced, but the clustering work relies primarily on the quadratic item; hence, μ usually takes a small positive real value in this paper. In summary, we obtain (23) by specifying y ≥ 0 and by setting ζ = 1. Then, by introducing the linear item y diag(D) into (23), we obtain (28). As the variants of GRC, both (23) and (28) are called as CGRC and can be denoted as CGRC-I and CGRC-II, respectively. In particular, because CGRC-II inherits all of the properties of CGRC-I and is better than CGRC-I, we prefer CGRC-II over CGRC-I and refer CGRC to CGRC-II [i.e., (28)] hereafter, unless otherwise stated explicitly. Since it originated from GRC, CGRC should inherit all its properties, e.g., straightforwardness, self-adaptability, polynomial time complexity O(N 3 ), etc. Furthermore, the linear term in CGRC enhances the robustness of the CI of GRC to the parameters of the adopted similarity function. We will verify them in Section IV.

B. Relationship Between CGRC and CCMEB In this section, we will reveal the relationship between CGRC and CCMEB. Here, two lemmas are used (see [4], [32], and [33]): Lemma 1: A matrix M + aI is a positive definite matrix if α > 0 and M is a positive semidefinite matrix. Lemma 2: A kernel matrix is symmetric and positive definite. Conversely, any symmetric and positive definite matrix can be considered as a kernel matrix. In other words, symmetry and positive definiteness are the necessary and sufficient conditions for a kernel matrix. ˜ =L+ Laplacian matrix L is positive semidefinite [4]. Let L ˜ is positive definite and can be taken as a αI, α > 0; then, L kernel matrix. We reformulate (28) as max

˜ + Δ − η1 − y Ly ˜ y diag(L)

s.t.

y 1 = 1,

y≥0

(29)

˜ + η1 + μdiag(D) and η is large where Δ = −diag(L) enough such that Δ ≥ 0. ˜ and α = By comparing (29) with (21) and if we let K = L y, we can easily find that they have the same expressions. Thus, we get a significant conclusion that CGRC can now be taken as a special MEB problem, i.e., the CCMEB problem. Similar to GCVM, CMEBA can also be used to speed up CGRC on large data sets. In this paper, we call the CGRC algorithm speeded up by CMEBA on large data sets as the FGRC(FGRC) algorithm: Inputs: A data set S, the approximation parameter ε, the coefficient of the linear term μ, etc. Outputs: Core-set Q, the clustering indicator y N . Iterative procedure of CMEBA: Step 1: Initialize Q0 , c0 , and r0 . Set the iteration number t = 0. Step 2: If there is no data point z in the extended feature space such that z falls outside the (1 + ε)-ball B(ct , (1 + ε)rt ), go to Step 6. Step 3: Find z ∗ such that it is the nearest data point from ct in the extended feature space. Set Qt+1 = Qt ∪ {z ∗ }. Step 4: Find the new CCMEB, i.e., CCMEB(Qt+1 ) in the extended feature space, and then set ct+1 = cCCM EB(Qt+1 ) and rt+1 = rCCM EB(Qt+1 ) . Step 5: According to the distribution of the data points in Qt+1 , adjust ε if necessary; increment t by 1, go to Step 2. Step 6: Terminate the iterative procedure; return the reference indicator y ∗∗ = y t and core-set Q = Qt . KNN procedure: According to y ∗∗ , classify the data points in S − Q using KNN method; return the clustering indicator y N .


1) Some Details About FGRC: 1) Step 1 of CMEBA: we select γ(γ N ) data points from S = {x1 , x2 , . . . , xN } to form Q0 , find CCMEB(Q0 ) from (18), and get c0 and r0 using (20). 2) Step 3 of CMEBA: required by clustering works, FGRC chooses the data point z ∗ which falls outside B(ct , (1 + ε)rt ) and is the nearest from ct in the extended feature space. With the expansion of CCMEB, we hope that the data points added to the core-set come from different clusters as evenly as possible. In order to achieve this goal, we have investigated the distance between any z = ϕ(x ) falling outside the ball and the center ct and have found the following key. According to (22), the squared distance between z and ct can be reformulated as d2 = d1 + d2 + d3 + d4 d1 =

|Qt |

˜ j) yi yj L(i,

i,j=1

d2 = − 2

|Qt |

˜ ) yi L(i,

i=1

˜ ˜ ) = diag(L) d3 = L( , ˜ + η + μdiag(D) . d4 = δ2 = −diag(L)

(30)

˜ ) ≤ 0, d2 is directly proportional Because of L(i, ˜ to |L(i, )|. Apparently, the more dissimilarity between ˜ )|, the smaller is z and Qt is, the smaller is |L(i, ˜ i yi |L(i, )|, and the smaller is d2 . Therefore, we can select the data point whose d2 is the smallest as the candidate z ∗ . Moreover, due to μ taking a small value in our study, d3 + d4 = η + μdiag(D) ≈ η; d1 is constant at the tth iteration. With all taken into account, min d2 can be chosen as another criterion for determining

approximately the candidate z ∗ , i.e., the nearest data point from ct . 3) Step 5 of CMEBA: different from the data point furthest away ct being chosen in GCVM, the nearest data point is selected in our algorithm. This goes against the fast convergence of CMEBA. To overcome this problem, the unfixed precision ε strategy [22] is available here. At first, ε is usually assigned to a very small value, which means that CMEBA approximates the optimal solution with very high precision. With the expansion of CCMEB, we can increase ε at appropriate time. Thus, we will get high accuracy as well as fast speed. But when is the appropriate time to adjust ε? We mainly refer to two aspects: 1) the distribution of different clusters in the core-set. In step 3, assume that the number of data points falling outside the ball is m, and we have computed the distances DSt = {dst1 , dst2 , . . . , dstn } of the n data points from ct at the tth iteration. Let dif fij = |dsti − dstj |, i, j = 1, . . . , n. With the gradual decrease of max(dif fij ), the distribution of every cluster in the core-set will be uniform more and more. In our study, when max(dif fij ) is smaller than a threshold, we then

677

reassign ε to a suitable large value to speed up the convergence of CMEBA. 2) A reasonable size of the coreset for various practical clustering applications. In terms of our experimental experience, the size of the core-set should not be smaller than 200. 4) Step 6 of CMEBA: y t , got after CMEBA, is called the reference indicator and is denoted as y ∗∗ . It can be easily visualized to determine how many clusters exist in the core-set, and then, we can label every cluster by hand or by utilizing the classical K-means or fuzzy C-means (FCM) algorithm. 5) The K-nearest neighbor (KNN) procedure: because just a few data points are added to the core-set during the CMEBA procedure, some auxiliary approaches must be adopted to classify the data points in S − Q. Owing to its linear time complexity, the KNN algorithm is adopted in our study. 6) Similar to other spectral algorithms, FGRC and CGRC also need to construct the similarity/affinity matrix W . In this paper, we take (3) as the default similarity measure. 2) Parameter Setting: Obviously, both σ in (3) and the coefficient μ of the linear item in (28) are involved in CGRC. Besides the same parameters σ and μ as in CGRC, the approximation precision ε of CCMEB K in KNN must also be considered in FGRC. In the following, we discuss how to select appropriate parameters in CGRC or FGRC. We should point out that the parameter σ in (3) is very important to CGRC and FGRC because the adopted similarity measure of data points directly influences the clustering result and σ is generally data dependent. In our work, the “CCA + empirical fine-tuning by finding Jy as small as possible” strategy is adopted to tune σ, where CCA means the correlation comparison algorithm proposed by Yang and Wu in [34]. CCA can be summarized as follows. CCA: Step 1: Set m = 1 and the correlation threshold τ = 0.97. Step 2: Calculate the correlation of the values of J˜S (xk )m and J˜S (xk )m+1 . Step 3: IF the correlation is greater than or equal to the specified τ , THEN choose current m to be the estimate; ELSE m = m + 1 and GOTO Step 2. The adopted similarity measure in [34] can be represented as J˜S (xk )m =

N j=1

V ertxj − xk 2 exp − α k = 1, . . . , N ;

5m

m = 1, 2, 3, . . .

where α denotes the sample variance N N 2 j=1 xj − x j=1 xj . α= x= N N

(31)

(32)

In [34], the CCA algorithm is used to estimate the power parameter m in (31).

678


Comparing (31) with (3), obviously, letting α/(5m) = 2σ 2 , the CCA algorithm can also be employed to estimate the parameter σ in our work. In terms of our experimental experience, the order of the magnitude of the parameter σ can be appropriately estimated using CCA. After that, with several empirical fine-tunings by finding Jy as small as possible, we can comparatively easily determine the appropriate parameter σ. In general, in CGRC and FGRC, the bigger the coefficient of the linear term μ takes, the better is the robustness of the CI to the parameters of the adopted similarity measure, but the bigger the computational burden of GRC becomes. Our experimental experience suggests that it is appropriate to set μ = 1E − 13 ∼ 1E − 14 for most of the normalized data sets. The relationship between the parameter μ and the clustering performance of CGRC and FGRC will be deeply discussed in the following experimental section. As for the approximation precision ε of CCMEB, it has been investigated in-depth in [22] and [31]. As revealed in [31], ε is appropriately set to be 1E − 6. Moreover, since ε dynamically changes during the execution of FGRC, it may be initialized with a value which is smaller than 1E − 6. The KNN procedure embedded in FGRC aims to classify the data samples beyond the core-set according to the reference indicator y ∗∗ . Therefore, it is a supervised learning procedure, and accordingly, the cross-validation can be employed to determine the appropriate parameter K. C. Time Complexity of FGRC In essence, the proposed FGRC algorithm is a special case of the CMEBA algorithms. Therefore, in general, the conclusions about the CMEBA algorithms also hold true for FGRC. However, because FGRC selects the nearest data point from ct at tth iteration, the total number of iterations τ may be larger than 2/ε (the maximal number of iterations in CVM and GCVM), although it can still be bounded by O(1/ε2 )[18], [19], [22]. Hence, by similar analysis in [12] and [22], we conclude that the asymptotic time complexity of the CMEBA procedure of FGRC is O(N/ε4 + 1/ε8 ). Furthermore, as mentioned previously, in the final stage of FGRC, we need to classify the data points in S − Q via the reference indicator y ∗∗ and KNN. It is necessary to estimate the time cost of this procedure. Let τ = O(1/ε2 ), and assume |Q0 | = γ as only one data point is added at each iteration (|Q| = τ + γ, |S − Q| = N − τ − γ). The asymptotic time complexity of KNN is O ((N − τ − γ)(τ + γ)) = O(N τ − τ 2 ) = O(N/ε2 − 1/ε4 ). (33) Therefore, the total asymptotic time complexity of FGRC is O((1/ε4 + 1/ε2 )N + 1/ε8 ). Obviously, it is still linear with the data size N . At step 2 of the CMEBA procedure of FGRC, computing (30) for all data points falling outside the ball will become very expensive when N is large. An approximate method called probabilistic speedup [20], [22] can be used herein. Its idea is to randomly sample a sufficiently large subset Vt from S − Qt and

Fig. 1.

Three small synthetic data sets DS1–DS3.

then to take the data point in Vt that is the nearest to ct as the approximate nearest point over S − Qt . The effectiveness of this approximate method has been confirmed in [12] and [22]. When the probabilistic speedup is used, the time complexity of the CMEBA procedure of FGRC will decrease to O(1/ε7 )[22], and it is independent of the data size N . However, in our study, another approximation strategy is used, i.e., the local nearest data point in β|S − Qt |

(34)

is taken as the approximate nearest point over S − Qt , where β is the constant coefficient. Now, the time complexity of step 2 of CMEBA is O((t + γ)2 + (t + γ)(N − t − γ)β). As finding the new CCMEB in step 4 takes O((t + γ)3 ) = O(t3 ), the tth iteration takes the total time O(tβN + t3 ). The overall time for 2 τ = O(1/ε ) iterations is τt=1 O(tβN + t3 ) = O(τ 2 βN + τ 4 ) = O(βN/ε4 + 1/ε8 ). Due to the asymptotical linear time complexity of KNN, the total asymptotic time complexity of FGRC is still linear. IV. E XPERIMENTAL S TUDIES The performance of CGRC and FGRC will be reported in this section. It is necessary for us to experimentally study CGRC first. We want to observe whether CGRC will lose notably the original performance of GRC after slightly changing its constraint and after introducing the linear item. At the same time, as a theoretical basis, the effectiveness of CGRC is the necessary condition for FGRC. We will observe the clustering performance of CGRC on small data sets and the time complexity of FGRC on several large data sets, and we will verify the relationship between its time cost and the data size via several large data sets. We carried out our experiments using both synthetic and real data sets with the experimental environment: Intel i5 2.4-GHz CPU, 3-GB DDR RAM, Windows XP SP2, and MATLAB 7.1. A. Experiments on CGRC First, we will verify whether CGRC still keeps the effectiveness, self-adaptability, and straightforwardness in clustering for the adopted data sets when compared with GRC. Moreover, we will do a comparative study among the CGRC, the K-means-based NC algorithm [5], and the FCM algorithm. Three performance indices, i.e., the clustering time, the adjusted rand index (ARI), and the normalized mutual information (NMI), are used in evaluating the performance of these algorithms in this experimental section. The clustering time is


Fig. 2.

679

CI of four algorithms on DS1–DS3.

measured in seconds. Given a data set S of N data points, supposing that U = {u1 , u2 , . . . , uR } and V = {v1 , v2 , . . . , vC } represent two groups of these data points (i.e., U consists of the cluster labels and V contains the real class labels) such C that ∪R i=1 ui = S = ∪j=1 vj and ui ∩ ui = φ = vj ∩ vj for 1 ≤ i = i ≤ R and 1 ≤ j = j ≤ C, then ARI and NMI are, respectively, defined as follows: Ni,j Ni Nj N − ij i 2 j 2 2 2 ARI = Ni Nj Ni Nj N 1 − i 2 + j 2 i 2 j 2 2 2 k N M I =

i=1

k i=1

c j=1

Ni log

Ni,j log

Ni N

N ·Ni,j Ni ·Nj

c j=1

(35)

Nj log

TABLE I C LUSTERING T IME (S ECONDS ) OF FCM, NC, GRC, AND CGRC ON DS1–DS3

Nj N

(36)

where N2 = N (N − 1)/2, Ni,j is the number of agreements between cluster ui and class vj , Ni is the number of data points in cluster ui , and Nj is the number of data points in class vj . Both ARI and NMI take a value within the interval [0, 1]. The higher the value is, the better is the corresponding clustering performance. Second, we will observe the influence of the linear item in CGRC to the clustering result of CGRC, and (26) is adopted as the performance index.

1) On Synthetic Data Sets: Here, we have generated synthetic data sets DS1–DS3 with sizes of 1000, 1690, and 2548, respectively, as shown in Fig. 1. Because of containing more than two clusters, DS1 and DS2 are used to verify the selfadaptability and straightforwardness of CGRC; DS2 and DS3 are used to verify whether CGRC can effectively cope with nonconvex cluster shape in data sets. The Gaussian noise was added to every original data set to generate the corresponding noisy data set. In our experiments, the added Gaussian noise in each feature has a zero mean and a standard deviation equal to γσ, where σ is the maximal value of the standard deviations of different features in the original data set and γ = 0.2, 0.12, and 0.05 with respect to DS1–DS3, respectively. We

680


TABLE II B EST C LUSTERING R ESULTS FOR DS1–DS3 BY ARI AND NMI

TABLE III M AIN PARAMETER S ETTINGS OF F OUR A LGORITHMS ON DS1–DS3

normalized all of the data points in DS1–DS3 before carrying out our experiments. The CIs of FCM, NC, GRC, and CGRC on DS1–DS3 are shown in Fig. 2, and the corresponding clustering time and best clustering results by ARI and NMI with the means and standard deviations obtained after executing every algorithm 20 times are listed in Tables I and II, respectively. The main parameters and their settings used in these four algorithms are summarized in Table III. Our analysis can be stated as follows. As shown in Table I, FCM is usually faster than others in the adopted small data sets. However, it is invalid to those nonconvex clusters, e.g., DS2 and DS3. Moreover, it requires the number of clusters to be set beforehand, so FCM has no self-adaptability. Since both the computation of the eigenvalue system and the matrix inversion operation are involved in NC, NC has a comparatively heavy computational burden. In addition, the embedded K-means also asks us to preset the number of clusters. GRC and CGRC both perform well on DS1–DS3. However, their clustering times are both longer than that of FCM as

a result of the corresponding QP calculation. In general, the clustering time of CGRC is slightly longer than GRC because of the influence of the linear term in (28). Thus, we have demonstrated that CGRC originating from GRC is still straightforward and self-adaptable and can efficiently cope with the arbitrary shape of a data set. Next, let us observe how the introduced linear item y diag(D) improves the clustering performance of CGRC. As stated in Section III, with or without the linear term, we have developed two variants of GRC, namely, CGRC-I and CGRC-II, respectively. Using DS1 and DS2 as the experimental data sets, adopting Jy in the form of (26) as the performance index, and choosing the range for the parameter σ in (3), we executed GRC, CGRC-I, and CGRC-II, respectively, and recorded the values of Jy with a different value of σ. In practice, the ranges of σ on DS1 and DS2 were assigned as 0–0.02 and 0–0.01, respectively. Every range of σ was equally divided into 30 subintervals, and each data set was evenly split into two subsets. After that, the mean of Jy on two subsets was treated as the final Jy with current σ. The relationship curves generated by these three algorithms on DS1 and DS2 are shown in Fig. 3. Please note that Jy was assigned as infinity, which corresponds to a certain discontinuous point in Fig. 3, when the CI cannot effectively present all clusters in our experiments, e.g., two or more cluster centers are overlapping. The main parameters with respect to GRC, CGRC-I, and CGRC-II and their settings are listed in Table IV. 1) As shown in Fig. 3, the clustering result (CI) of GRC is sensitive to the parameter σ of the adopted similarity measure, and sometimes, it seems unstable. CGRC-I inherits this sensitivity from GRC, and it becomes even worse after specifying y t ≥ 0 and setting ξ = 1 (see CGRC-I on DS2). However, the performance curve of CGRC-II with the linear item seems comparatively smooth, i.e., the robustness of the CI of CGRC-II to the parameters of the adopted similarity measure is enhanced notably. This fact reveals that, with CGRC-II, we are able to more reliably estimate the appropriate parameter σ by using the “CCA + fine-tuning by finding Jy as small as possible” strategy. This is the reason why we introduce the linear term into CGRC-II. 2) As demonstrated in the aforementioned experiment, the linear term indeed enhances the robustness of the CI. However, we must consider how to take the appropriate


681

TABLE V C LUSTERING T IME (S ECONDS ) OF CGRC ON DS1, W ITH μ = 2E − 13, 1E − 14, AND 1E − 15

TABLE VI C LUSTERING T IME (S ECONDS ) OF CGRC ON DS3, W ITH μ = 1E − 13, 1E − 14, AND 1E − 15

Fig. 3. Influence of the linear item in CGRC to the clustering results. (a) On DS1. (b) On DS2. TABLE IV M AIN PARAMETER S ETTINGS FOR GRC, CGRC-I, AND CGRC-II ON DS1 AND DS2

Fig. 5. Small real data sets. (a) DS4. (b) DS5. TABLE VII C LUSTERING T IME (S ECONDS ) OF FCM, NC, GRC, AND CGRC ON DS4–DS5

Fig. 4. DS3.

Influence of the coefficient μ of the linear term. (a) On DS1. (b) On

coefficient μ of the linear term. After setting three different values with different orders of magnitude to the parameter μ (for DS1, μ = 2E − 13, 1E − 14, and 1E − 15, respectively; for DS3, μ = 1E − 13, 1E − 14, and 1E − 15, respectively) and by assigning the range of parameter σ as 0–0.02, we executed CGRC (i.e., CGRC-II) on DS1 and DS3, respectively. The relationship curve between Jy and parameter σ is shown in Fig. 4, and the corresponding clustering times on DS1 and DS3, with the means and standard deviations obtained after executing every algorithm ten times, are listed in Tables V and VI, respectively. By observing Fig. 4 and Tables V and VI, we can easily find that, the bigger μ takes, the better is the robustness, but the more is the clustering time. In order to make a tradeoff between them, our experimental experience suggests that it is appropriate to set μ = 1E − 13−1E − 14 for most of the normalized data sets.

682


TABLE VIII B EST C LUSTERING R ESULTS W ITH ARI AND NMI ON DS4

Fig. 6. CIs of four algorithms on DS4.

Fig. 7. Segmentation results of four algorithms on DS5.

2) Real Data Sets: Here, we take two real data sets: DS4 and DS5. DS4 was taken from the optical recognition of handwritten digit data set of UCI [35]. As we know, it is difficult to distinguish 1 from 7 or 2 from 5, so we select every 60 records of the samples of (1, 2, 5, and 7) to form DS4, as shown in Fig. 5(a). The dimension of DS4 is 64. DS5 is essentially a human face image coming from Olivetti Research Laboratory, as shown in Fig. 5(b), which corresponds to a 39 × 32 gray level matrix. We extracted three features (abscissa, ordinate, and gray level) from this face image to form DS5 with a data size of 1248. We also executed FCM, NC (K-means based), GRC, and CGRC on DS4 and DS5, respectively. The clustering times of FCM, NC, GRC, and CGRC on DS4 and DS5, with the means and standard deviations obtained after executing every algorithm 20 times, are listed in Table VII. Table VIII lists

the clustering performance indices ARI and NMI of four algorithms on DS4, with the means and standard deviations obtained after executing every algorithm 20 times. Fig. 6 shows the CIs of these four algorithms on DS4. The experimental results demonstrate that, although FCM runs quickly, it is invalid for DS4 because it cannot partition the data set into the desired four clusters. Both GRC and CGRC work successfully with self-adaptability, straightforwardness, and comparatively shorter clustering time. Although obtaining a good clustering result, due to the heavy computation burden of the eigenvalue system and the matrix inversion operation, NC is the most timeconsuming one. The computational costs of FCM, NC, GRC, and CGRC are shown in Table VII. The CIs and the final segmentation results obtained by them are shown in Figs. 6 and 7, respectively. Judging from the outcomes of these algorithms, they all work


TABLE IX M AIN PARAMETER S ETTINGS FOR FCM, NC, GRC, AND CGRC ON DS5

683

TABLE X C LUSTERING T IME OF F OUR A LGORITHMS W ITH VARIOUS S AMPLED S IZES ( IN S ECONDS ).

successfully on this data set, but unlike FCM and NC, CGRC and GRC are self-adaptable and straightforward, and the number of the clusters is not required to be set beforehand. The main parameters in GRC, CGRC-I, and CGRC-II on DS5 and their settings are summarized in Table IX. Analysis: from the test outcomes running on DS4 and DS5, we can conclude that CGRC is still effective and straightforward for small real data sets. B. Experiments on FGRC In this section, we aim to validate the performance of FGRC from two aspects: 1) its effectiveness for large data sets and 2) the relationship between its time complexity and the data size. Our experiments will also be carried out on synthetic and real data sets, respectively. 1) On Large Synthetic Data Set: Here, we prepared the data set DS6 with data size = 52 520, as shown in Fig. 9(a). Besides FGRC, CGRC, GRC, and NC using K-means are also implemented here. We randomly sampled from DS6 with different data sizes to present the data sets for these four algorithms. We executed every algorithm ten times at each data size and listed the obtained clustering time with the mean and standard deviation correspondingly in Table X. Moreover, the relationships between clustering time and data sizes are shown in Fig. 8 with respect to NC, GRC, CGRC, and FGRC, respectively. The main parameters in these four algorithms on DS6 and their settings are list in Table XI. In this experiment, for FGRC, the precision ε was initialized to 1E − 13, and Q0 consists of 50 random data points. We assign ε to 1E − 2 to dramatically speed up the convergence of FGRC when the cumulative times of max(dif fij ) < 0.003 are more than 20. The clustering results of FGRC are shown in Fig. 9, with sampled size = 52 520. The data set DS6 is shown in Fig. 9(a), the final core-set |Q| = 327 is shown in Fig. 9(b), the reference indicator y ∗∗ is shown in Fig. 9(c), and the data points classified by KNN in S − Q is shown in Fig. 9(d). The total time of FGRC is about 213.96 s. Analysis: as revealed in Table X and Fig. 8, with the rapid increase in sample size, the efficiency and effectiveness of FGRC become more and more obvious. Fig. 8 also shows

Fig. 8. Relationships between CPU time and data sizes with respect to FGRC, GRC, CGRC, and NC. TABLE XI M AIN PARAMETER S ETTINGS OF F OUR A LGORITHMS ON DS6

684


Fig. 9. Clustering results of FGRC with the sampled size 52 520.

Fig. 10. DS8.

intuitively that the asymptotic time complexity of FGRC is linear with the sample size N . Note here that, because of the heavy computational burden and/or space limit, NC, GRC, and CGRC are all impractical for large data sets. 2) On Large Real Data Sets: As we may know well, the difficulty of clustering on real data sets is greater than that on synthetic data sets. In this section, we use real data sets to test FGRC. We utilize two large data sets: DS7 and DS8. DS7 originates from the KDD 1999 Network Intrusion Detector Learning database [35], [36]. A large number of samples of three kinds of network intrusion, i.e., ipsweep, neptune, and smurf, were chosen to construct DS7. As the original data size is merely 12 907 after discarding plenty of redundant (repeating) records, we enlarged its size to 78 000 by continuously inserting the original data with a very small random disturbance into DS7. On the other hand, DS8 is a data set constructed from a flower image with size 501 × 501, as shown in Fig. 10. a) On DS7: Here, all data points are 41-D and normalized. FGRC, NC, and GRC are used to do comparison for this data set. We have carried out every algorithm ten times at each sampled size, and their obtained performance indices with the mean and standard deviation are listed in Table XII. Clearly, neither NC nor GRC fits for this large data set because of their own heavy time or space burden. In this experiment, precision ε is initialized to 1E − 12, and Q0 consists of randomly taking 30 data points. We assign ε to 1E − 2 to sharply speed

up the convergence of FGRC when the cumulative times of max(dif fij ) < 0.0003 are more than 20. The clustering time, ARI, and NMI are listed in Table XII. At the same time, the relationships between the clustering time and the sampled sizes are shown in Fig. 11 with respect to NC, GRC, and FGRC, respectively. Moreover, the running time of CMEBA and KNN in FGRC at each sampled size is further shown in Fig. 12. The main parameters of the three algorithms and their settings in this experiment are also listed in Table XIII. Analysis: as revealed in Figs. 11 and 12, the asymptotic time complexity of both CMEBA and KNN is linear, so the asymptotic time complexity of FGRC is certainly linear. b) On DS8: DS8 is composed of 501 × 501 = 251 001 feature vectors of Fig. 10, where every feature vector is 3-D: abscissa, ordinate, and gray level. We randomly choose a certain number of vectors from DS8 to form several subsets with various sizes. In this experiment, we attempt to further confirm the asymptotic linear time complexity of FGRC when it runs on large data sets, so only the FGRC algorithm has been carried out ten times at every subset, and the average running time is listed in Table XIV. The detailed time costs of CMEBA and KNN of FGRC at each sampled size are shown in Fig. 13. Fig. 14 shows the clustering results of FGRC when the data size is 180 000. Fig. 14(a) shows the original image corresponding to a data size of 180 000. Fig. 14(b) shows the reference indicator y ∗∗ generated by CMEBA embedded in FGRC. According to the reference indicator y ∗∗ shown in Fig. 14(b), FGRC roughly partitions the data points of the image into five parts by using KNN, as shown in Fig. 15, where Fig. 15(a) shows the black background of the image in Fig. 14(a), Fig. 15(b) shows the edges of the flower in the image, Fig. 15(c) shows the dark parts of the flower in the image, Fig. 15(d) shows the white parts of the flower, and Fig. 15(e) shows the highlight of the flower, respectively. In this experiment, precision ε is initialized to 1E − 13, and Q0 consists of randomly taking 20 data points. We assign ε to 1E − 2 to drastically speed up the convergence of FGRC when the cumulative times of max(dif fij ) < 0.003 are more than 20. The main parameter settings in this experiment are listed in Table XV. Analysis: as revealed in Table XIV and Fig. 13, we have demonstrated that the asymptotic time complexity of FGRC is linear with the data size N once again. V. C ONCLUDING R EMARKS In this paper, we have first presented the CGRC algorithm by slightly changing the constraints on GRC and by introducing one linear term into GRC. Then, we have further proposed our novel clustering algorithm FGRC by using the CMEBA approach. In Section IV, we have demonstrated that CGRC and FGRC both perform very well. In general, the contributions of our study can be concluded as the following two aspects. 1) The enhanced version of GRC (CGRC) is proposed to enhance the robustness of GRC to the parameters of the adopted similarity measure. Based on CGRC, the FGRC algorithm is accordingly developed to offer us a novel


685

TABLE XII P ERFORMANCE I NDICES OF T HREE A LGORITHMS ON DS7

TABLE XIII M AIN PARAMETER S ETTINGS OF T HREE A LGORITHMS ON DS7

TABLE XIV CPU T IME OF FGRC ON DS8 W ITH VARIOUS S AMPLED S IZES ( IN S ECONDS ) Fig. 11. Relationships between CPU time of three algorithms and various data sizes of DS7.

learning fields. In particular, it has built an important link between spectral clustering and the MEB problem. Fig. 12. CPU time of CMEBA and KNN of FGRC at each data size of DS7.

fast clustering approach on large data sets, with nice applicability and self-adaptability. 2) This paper has further enlarged the applications of MEB and core-set theory in pattern recognition and machine

It should be pointed out that the appropriate definition of affinity matrix L plays a crucial role in our experimental study. How to define an appropriate affinity matrix L for a data set still keeps a challenging issue. What is more is that we will study further this new approach in the near future. For example, whether there is any better opportunity or strategy which can be found to adjust precision ε, initialize Q0 optimally and so on.

686


ACKNOWLEDGMENT The authors would like to thank the anonymous reviewers for their insightful comments and valuable suggestions. R EFERENCES

Fig. 13. CPU time of CMEBA and KNN of FGRC at each data size of DS8.

Fig. 14. Illustration of clustering scene on DS8 with a data size of 180 000.

Fig. 15. Partitioning result on DS8 with a data size of 180 000.

TABLE XV M AIN PARAMETER S ETTINGS OF FGRC ON DS8

[1] D. J. Higham and M. Kibble, “A unified view of spectral clustering,” Dept. Math., Univ. Strathclyde, Glasgow, U.K., Tech. Rep. 02, 2004. [2] M. Meila and L. Xu, “Multiway cuts and spectral clustering,” Dept. Statist., Univ. Washington, Seattle, WA, Tech. Rep. 442, 2004. [3] U. Ozertem, D. Erdogmus, and R. Jenssen, “Mean shift spectral clustering,” Pattern Recognit., vol. 41, no. 6, pp. 1924–1938, Jun. 2008. [4] C. Lee, O. Zaïane, H. Park, J. Huang, and R. Greiner, “Clustering high dimensional data: A graph-based relaxed optimization approach,” Inf. Sci., vol. 178, no. 23, pp. 4501–4511, Dec. 2008. [5] U. Luxburg, “A tutorial on spectral clustering,” Statist. Comput., vol. 17, no. 4, pp. 395–416, Dec. 2007. [6] A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Proc. NIPS, 2001, pp. 849–856. [7] W. M. Hu, W. Hu, N. Xie, and S. Maybank, “Unsupervised active learning based on hierarchical graph-theoretic clustering,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 39, no. 5, pp. 1147–1161, Oct. 2009. [8] W. Tao, H. Jin, and Y. Zhang, “Color image segmentation based on mean shift and normalized cuts,” IEEE Trans. Syst., Man, Cybern. B, Cybern., vol. 37, no. 5, pp. 1382–1389, Oct. 2007. [9] S. X. Yu and J. Shi, “Multiclass spectral clustering,” in Proc. IEEE Int. Conf. Comput. Vis., 2003, vol. 1, pp. 313–319. [10] J. Shi and J. Malik, “Normalized cuts and image segmentation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., 1997, pp. 731–737. [11] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 8, pp. 888–905, Aug. 2000. [12] I. Tsang, J. Kwok, and J. Zurada, “Generalized core vector machines,” IEEE Trans. Neural Netw., vol. 17, no. 5, pp. 1126–1140, Sep. 2006. [13] L. Hagen and A. B. Kahng, “New spectral methods for ratio cut partitioning and clustering,” IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., vol. 11, no. 9, pp. 1074–1085, Sep. 1992. [14] Z. Wu and R. Leahy, “An optimal graph theoretic approach to data clustering: Theory and its application to image segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 15, no. 11, pp. 1101–1113, Nov. 1993. [15] S. Sarkar and P. Soundararajan, “Supervised learning of large perceptual organization: Graph spectral partitioning and learning automata,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 22, no. 5, pp. 504–525, May 2000. [16] L. Xu, J. Neufeld, B. Larson, and D. Schuurmans, “Maximum margin clustering,” in Proc. NIPS, 2004, pp. 1537–1544. [17] M. Heiler, J. Keuchel, and C. Schnorr, “Semidefinite clustering for image segmentation with a-priori knowledge,” in Proc. 27th Symp. German Assoc. Pattern Recog., 2005, pp. 309–317. [18] M. Bdoiu and K. L. Clarkson, “Optimal core sets for balls,” Comput. Geometry: Theory Appl., vol. 40, no. 1, pp. 14–22, May 2008. [19] M. Bdoiu, S. Har-Peled, and P. Indyk, “Approximate clustering via core-sets,” in Proc. 34th Annu. ACM Symp. Theory Comput., 2002, pp. 250–257. [20] P. Kumar, J. S. B. Mitchell, and A. Yildirim, “Approximate minimum enclosing balls in high dimensions using core-sets,” ACM J. Exp. Algorithms, vol. 8, pp. 1–29, 2003. [21] F. Nielsen and R. Nock, “Approximating smallest enclosing balls,” in Proc. Int. Conf. Comput. Sci. Appl., 2004, vol. 3045, pp. 147–157. [22] I. Tsang, J. Kwok, and P. Cheung, “Core vector machines: Fast SVM training on very large data sets,” J. Mach. Learn. Res., vol. 6, pp. 363–392, 2005. [23] S. Asharaf, M. N. Murty, and S. K. Shevade, “Multiclass core vector machine,” in Proc. ICML, 2007, pp. 41–48. [24] I. W. Tsang, A. Kocsor, and J. T. Kwok, “Simpler core vector machines with enclosing balls,” in Proc. ICML, 2007, pp. 911–918. [25] B. Schölkopf and A. J. Smola, Learning With Kernels. Cambridge, MA: MIT Press, 2001. [26] D. Tax and R. Duin, “Support vector domain description,” Pattern Recognit. Lett., vol. 20, no. 11–13, pp. 1191–1199, Nov. 1999. [27] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998. [28] M. Badoiu and K. L. Clarkson, “Optimal core-sets for balls,” Comput. Geom., vol. 40, no. 1, pp. 14–22, May 2008. [29] M. Badoiu, S. Har-Peled, and P. Indyk, “Approximate clustering via core sets,” in Proc. 34th Annu. ACM Symp. Theory Comput., 2002, pp. 250–257.


[30] F. Chung, Z. Deng, and S. Wang, “From minimum enclosing ball to fast fuzzy inference system training on large data sets,” IEEE Trans. Fuzzy Syst., vol. 17, no. 1, pp. 173–184, Feb. 2009. [31] Z. Deng, F. Chung, and S. Wang, “FRSDE: Fast reduced set density estimator using minimal enclosing ball,” Pattern Recognit., vol. 41, no. 4, pp. 1363–1372, Apr. 2008. [32] X. Liu, B. Luo, and Z. Chen, “Optimal model selection for support vector machines,” J. Comput. Res. Develop., vol. 42, pp. 576–581, 2005. [33] G. Lanckriet, N. Cristianini, P. Bartlett, L. El Ghaoui, and M. I. Jordan, “Learning the kernel matrix with semidefinite programming,” J. Mach. Learn. Res., vol. 5, pp. 27–72, 2004. [34] M. S. Yang and K. L. Wu, “A similarity-based robust clustering method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, no. 4, pp. 434–448, Apr. 2004. [35] UC Irvine Machine Learning Repository. [Online]. Available: http://archive.ics.uci.edu/ml/ [36] UCI Knowledge Discovery in Databases Archive. [Online]. Available: http://kdd.ics.uci.edu/

Pengjiang Qian received the Ph.D. degree from Jiangnan University, Wuxi, China, in 2011. He is currently an Associate Professor with the School of Digital Media, Jiangnan University. He has published more than ten papers in international/ national authoritative journals. His research interests include data mining, pattern recognition, bioinformatics, and their applications.

Fu-lai Chung received the B.Sc. degree from the University of Manitoba, Winnipeg, MB, Canada, in 1987 and the M.Phil. and Ph.D. degrees from the Chinese University of Hong Kong, Shatin, Hong Kong, in 1991 and 1995, respectively. He joined the Department of Computing, Hong Kong Polytechnic University, Hung Hom, Hong Kong, in 1994, where he is currently an Associate Professor. He has published widely in the areas of data mining, machine learning, fuzzy systems, pattern recognition, and multimedia in international journals and conferences.

687

Shitong Wang received the M.S. degree in computer science from the Nanjing University of Aeronautics and Astronautics, Nanjing, China, in 1987. He visited London University, London, U.K.; Bristol University, Bristol, U.K.; Hiroshima International University, Hiroshima, Japan; Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong; Hong Kong Polytechnic University, Hung Hom, Hong Kong; and Hong Kong City University, Kowloon, Hong Kong, as a Research Scientist, for over five years. He is currently a Full Professor with the School of Digital Media, Jiangnan University, Wuxi, China. He has published about 80 papers in international/national journals and has authored/coauthored seven books. His research interests include AI, neuronfuzzy systems, pattern recognition, and image processing.

Zhaohong Deng received the Ph.D. degree from Jiangnan University, Wuxi, China, in 2007. He is currently an Associate Professor with the School of Information, Jiangnan University. He visited the Hong Kong Polytechnic University, Hung Hom, Hong Kong, several times and worked on various projects. He has published numerous papers in international journals and conferences, including IEEE T RANSACTIONS ON S YSTEMS , M AN , AND C YBERNETICS –PART B: C YBERNETICS and IEEE T RANSACTIONS ON F UZZY S YSTEMS. His research interests include data mining, pattern recognition, and image processing.