Maximum Likelihood Clustering of Large Data Sets

Maximum Likelihood Clustering of Large Data Sets Using a Multilevel, Parallel Heuristic Diploma Thesis Torsten Reiners1 TU Braunschweig Institute of Economics Department of Business Administration, Business Computer Science, and Information Management

Advisor: Prof. Dr. Stefan Vo Prof. David L. Woodru, Ph.D. September 1, 1998

1 Written while working as a research scholar in the eld of optimization for the

Graduate School of Management, University of California at Davis

Preface During the study in Business Computer Science at the Technical University of Braunschweig, I developed an interest in writing my diploma thesis abroad. Working at the department of Business Administration, Business Computer Science, and Information Management as a student assistant, I asked Prof. Dr. Stefan Voss for possible help with nding a university where I could work on a research project about which I could write my diploma thesis. Confronted with several oers in Australia and America, I chose the one in California because of two reasons: the subject woke my interest and Prof. David Woodru planned to spend his sabbatical year in Braunschweig with Prof. Dr. Voss so I would have an opportunity to extend the research done for the thesis. Prof. David Woodru oered me a position as a research scholar with the Graduate School of Management at the University of California in Davis where I was able to do research as well as writing this thesis. I would like to thank everybody who helped and supported me before and during the time in California: Prof. Dr. Stefan Voss for giving me the opportunity to go abroad and getting the experience of working on a research project and writing the diploma thesis in an English speaking country, Prof. David Woodru for giving me a position without knowing me in advance, my parents for the nancial support and understanding, the whole family of David Woodru for their generosity, hospitality, and good dinners, my girlfriend Bettina Bradatsch for her understanding that I wanted to go abroad and the wonderful time we had in Davis. Thanks to Bettina Bradatsch, David Woodru, and Andreas Fink for proof reading the thesis and correcting my new creations of the English language. Last, I want to mention the Center for Image Processing and Integrated Computing (CIPIC)1 where I worked in a group with Prof. David Woodru and Prof. David Rocke.

CIPIC is an interdisciplinary research center of the University of California in David. See also http://info.cipic.ucdavis.edu 1

i

ii Acknowledgement The research in this thesis was supported by grants from the DuPont Corporation. Declaration of Independence Herewith I declare on oath that the presented diploma thesis is done independently and unassisted using only the referred literature2.

Torsten Reiners3

Hiermit erklare ich an Eides Statt, da ich die vorliegende Arbeit selbstandig und ohne fremde Hilfe nur unter Verwendung der angefurten Literatur angefertigt habe. 3 For further questions and information: the author can be contacted by e-mail under [email protected]. 2

Contents Preface Contents List of Figures List of Tables Symbols Abbreviations and Terminology 1 Introduction

1.1 Examples of the Use of Classi cation 1.1.1 Biology . . . . . . . . . . . . 1.1.2 Chemistry and Physics . . . . 1.1.3 Linguistic . . . . . . . . . . . 1.1.4 Psychology . . . . . . . . . . 1.2 Dierent Clustering Methods . . . . 1.2.1 Hierarchical Clustering . . . . 1.2.2 Partitional Clustering . . . . . 1.2.3 Probabilistic Clustering . . . 1.3 Abstract for the Thesis . . . . . . . . 1.4 Structure of the Thesis . . . . . . . .

2 De nition of Data Sets and Distances 2.1 Data Sets . . . . . . . . . . . . . . . 2.1.1 Data Sets Used in Literature . 2.1.2 Generated Data Sets . . . . . 2.1.3 Real World Data Sets . . . . 2.2 Distances . . . . . . . . . . . . . . . iii

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

i iii vii ix xi xiii 1 5 5 5 7 8 8 10 14 15 16 17

19

19 19 21 23 24

CONTENTS

iv

2.2.1 Minkowski Distance . . . . . . . . . . . . . . . . . . . . 24 2.2.2 Mahalanobis Distance . . . . . . . . . . . . . . . . . . 24 2.3 Problem of Scaling the Data . . . . . . . . . . . . . . . . . . . 26

3 Model for Clustering 3.1 3.2 3.3 3.4

Maximum Likelihood and Estimator Objective Function . . . . . . . . . . Unassigned Points . . . . . . . . . . . Unassignment Constraint . . . . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

4 Clustering and Local Search

4.1 Local Search Strategies . . . . . . . . . . . . . . . . . . . . . 4.1.1 Descent . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Simulated Annealing . . . . . . . . . . . . . . . . . . 4.1.3 Reactive Tabu Search . . . . . . . . . . . . . . . . . . 4.2 Neighborhood Structure and Solution Space . . . . . . . . . 4.2.1 Neighborhood . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Solution . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Evaluation of the Objective Function . . . . . . . . . 4.2.4 Up- and Down-Dating of the Objective Function . . . 4.2.5 Precalculation for Moves Involving Unassigned Points 4.3 Experiments with Local Search . . . . . . . . . . . . . . . . 4.3.1 Model for the Experiments . . . . . . . . . . . . . . . 4.3.2 Steepest Descent Evaluation . . . . . . . . . . . . . . 4.3.3 Reactive Tabu Search and Simulated Annealing . . .

5 Mining the Data from Experiments

5.1 Introduction . . . . . . . . . . . . . . . . . . . 5.2 Data Mining on Local Search Strategies . . . . 5.2.1 Data Mining . . . . . . . . . . . . . . . 5.2.2 Data Mining on Simulated Annealing . 5.2.3 Data Mining on Reactive Tabu Search 5.2.4 Conclusion . . . . . . . . . . . . . . . .

6 Seed Clustering - A New Approach

6.1 Introduction of Seed Clustering . . . . . . . 6.1.1 Terminology . . . . . . . . . . . . . . 6.1.2 Modular Structure of the Algorithm . 6.2 Creation of Seeds . . . . . . . . . . . . . . . 6.2.1 Based on an Empty Solution . . . . . 6.2.2 Based on a Previous Solution . . . .

. . . . . .

. . . . . . . . . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

29

29 31 32 34

36

36 37 39 40 43 43 44 45 46 47 48 48 50 51

58

58 59 59 60 64 68

70

70 73 76 77 77 78

CONTENTS 6.3 Growing . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Growing of Seeds . . . . . . . . . . . . . 6.3.2 Growing of Seeds in an Empty Solution . 6.3.3 Growing of a Single Seed . . . . . . . . . 6.3.4 Growing used for K-MEANS Algorithm . 6.4 Walking . . . . . . . . . . . . . . . . . . . . . . 6.4.1 Walking of a Single Seed . . . . . . . . 6.4.2 Walking of All Seeds . . . . . . . . . . . 6.4.3 Walking of All Clusters . . . . . . . . . . 6.4.4 Improvement of Walking . . . . . . . . . 6.5 Convergence Using Hash and Frequency Tables 6.5.1 Hash Table for Walking . . . . . . . . . 6.5.2 Frequency Table . . . . . . . . . . . . . 6.6 Repair Mechanism . . . . . . . . . . . . . . . . 6.7 Seed Points and Seed Metric . . . . . . . . . . . 6.7.1 Selection of Seed Points . . . . . . . . . 6.7.2 Choosing the Seed Metric . . . . . . . . 6.8 Seed Clustering Algorithm . . . . . . . . . . . . 6.9 Possible Improvements . . . . . . . . . . . . . .

7 Analysis of Performance

v . . . . . . . . . . . . . . . . . . .

7.1 Analysis of the Algorithm . . . . . . . . . . . . . 7.1.1 Experiments . . . . . . . . . . . . . . . . . 7.1.2 Preliminary Parameter Selection . . . . . . 7.1.3 Parameter Selection . . . . . . . . . . . . . 7.1.4 Limits of Maximum Likelihood Clustering 7.2 Analysis against Existing Clustering Software . . 7.2.1 AutoClass . . . . . . . . . . . . . . . . . . 7.2.2 Rousseeuw . . . . . . . . . . . . . . . . . . 7.2.3 First-Improving Descent . . . . . . . . . . 7.3 Analysis against Local Search Strategies . . . . .

8 Sub-Sampling 8.1 8.2 8.3 8.4 8.5 8.6

. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Sub-Sampling . . . . . . . . . . . . . . . . . . . . . Evaluation of Sub-Sample Solutions on a Data Set . Handling of Unassigned Points . . . . . . . . . . . . Interaction between Sub-Samples . . . . . . . . . . Algorithm for Sub-sampling . . . . . . . . . . . . . Application Using Sub-Sampling: DPOSS Project .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. 79 . 82 . 84 . 86 . 86 . 87 . 87 . 89 . 90 . 91 . 92 . 93 . 94 . 95 . 96 . 98 . 100 . 103 . 106

109

. 109 . 109 . 109 . 111 . 118 . 119 . 120 . 121 . 127 . 128

130

. 130 . 131 . 132 . 133 . 133 . 135

CONTENTS

9 Parallelization 9.1 9.2 9.3 9.4

Extended Sub-sampling Model . Memory Model . . . . . . . . . Parallel Algorithm . . . . . . . Time Behavior . . . . . . . . .

vi . . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

10 Conclusion A Mathematical Programming Formulation of MINO B Data Tables B.1 B.2 B.3 B.4 B.5 B.6 B.7

FID Data sorted by Type of Flower . . . . . . . . . . . . Results of RRD on FID . . . . . . . . . . . . . . . . . . Results of RRTS1, RSA1, RRD on Generated Data Sets Results of RRTS1, RSA1 on FID . . . . . . . . . . . . . Results of RRTS2, RSA2, RSA3 on FID . . . . . . . . . Results of Simulated Annealing . . . . . . . . . . . . . . DPOSS Data Set Description . . . . . . . . . . . . . . .

. . . .

. . . . . . .

. . . .

. . . . . . .

139

. 139 . 140 . 142 . 142

145 147 148

. 148 . 150 . 151 . 152 . 153 . 154 . 156

C Applications of Clustering

157

D Description of Software

161

C.1 Medical Application . . . . . . . . . . . . . . . . . . . . . . . 157 C.2 LANDSAT Imaging Project . . . . . . . . . . . . . . . . . . . 159

D.1 Cluster Software . . . . . . . . . . D.1.1 Installation and Preparation D.1.2 Parameter and Files . . . . D.1.3 Output of Results . . . . . . D.2 Viewer for Solutions . . . . . . . . D.2.1 Installation and Preparation D.2.2 Structure of Files . . . . . . D.2.3 Interface Description . . . .

E Source Code Certi cate References Author Index Index

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. . . . . . . .

. 161 . 161 . 162 . 168 . 171 . 171 . 171 . 172

175 267 268 281 284

List of Figures 1.1 Family tree of man drawn as an European oak tree by Haeckel [Hae74] in 1874. . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.2 Tree of classi cation problems. . . . . . . . . . . . . . . . . . . 9 1.3 Dentogram for the hierarchical clustering. . . . . . . . . . . . 13 1.4 Minimum spanning tree for the partitional clustering. . . . . . 15 2.1 Visualization of the Fischer Iris Data. . . . . . . . . . . . . . . 21 2.2 Visualization to demonstrate the usage of the Mahalanobis distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 2.3 Eect of scaling the data set. . . . . . . . . . . . . . . . . . . 27 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9

Pseudo-code for the \Descent". . . . . . . . . . . . . . . . . . Pseudo-code for \Simulated Annealing". . . . . . . . . . . . . Pseudo-code for the function \Get Starting Temperature". . . Pseudo-code for the simple \Tabu Search". . . . . . . . . . . . Pseudo-code for the \Repetition Control". . . . . . . . . . . . Graphical representation for the structure of the experiments. RRD on FID, 20 restarts for each iteration. . . . . . . . . . . Fraction of GSI2 found with RRTS1 and RSA1 in FID. . . . . Fraction of GSI2 found with RRTS1, RRTS2 and RSA1-3 in FID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.10 RSA1,RRTS1,RRD on generated data set DATA1. . . . . . . . 4.11 RSA1,RRTS1,RRD on generated data set DATA2. . . . . . . . 4.12 RSA1,RRTS1,RRD on generated data set GFID. . . . . . . . 6.1 6.2 6.3 6.4 6.5

38 40 41 42 43 49 50 52 53 54 55 55

Pseudo-code for the growing to a feasible solution. . . . . . . . 85 Pseudo-code for the walking of a single seed. . . . . . . . . . . 88 Example of walking a single seed. . . . . . . . . . . . . . . . . 89 Example for the growing and repair mechanism. . . . . . . . . 97 Example for the in uence of the metric on the selection of data points for a cluster. . . . . . . . . . . . . . . . . . . . . . . . . 101 6.6 Modules of the seed clustering algorithm. . . . . . . . . . . . . 105 vii

LIST OF FIGURES

viii

6.7 Example for a frequency table. . . . . . . . . . . . . . . . . . . 108 7.1 Visualization of the results of the experiment to determine the refresh rate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113 7.2 Visualization of the running time for the experiment to determine w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 7.3 Two dimensional display of the sphere with the enclosed cluster.119 8.1 Visualization of the sub-sampling process. . . . . . . . . . . . 131 8.2 Overview of the SKICAT plate cataloging process. . . . . . . . 137 9.1 Visualization of the parallelization process using sub-samples. 141 9.2 Eect of scaling the data set . . . . . . . . . . . . . . . . . . . 143 C.1 Segmentation of a magnetic resonance image. . . . . . . . . . 158 C.2 LANDSAT image of Fredrick Township, Michigan. . . . . . . . 160 C.3 Clustering result of image C.2. . . . . . . . . . . . . . . . . . . 160 D.1 Interface of the viewer for solutions. . . . . . . . . . . . . . . . 174

List of Tables 1.1 Periodic table by Mendeleyev. . . . . . . . . . . . . . . . . . . 7 1.2 Distance or similarity matrix of a data set. . . . . . . . . . . . 13 4.1 Time needed to get the best solution with a probability of PS . 56 5.1 Mean, standard deviation, and correlation matrix for the data set SAI. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Mean, standard deviation, and correlation matrix of one clustering result of SAI. . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Mean, standard deviation, and correlation matrix for the clustering of the good results of data set DATA1. . . . . . . . . . 5.4 Results of two versions of RTS in the data mining section. . . 5.5 The t statistic between the suggested parameter setting and the Reactive Tabu Search with four parameters. . . . . . . . .

62 63 66 68 68

7.1 Number of data sets where a certain setting of a parameter caused the most results with less than 10% classi cation errors.111 7.2 Percentage of solved instances with refresh rate R. . . . . . . . 113 7.3 Percentage of best solution found with a certain setting of w in the 4D1 data set. . . . . . . . . . . . . . . . . . . . . . . . . 116 7.4 Average times and objective function values for two dierent data sets and convergence criteria. . . . . . . . . . . . . . . . . 118 7.5 Application of dierent clustering algorithm on the sphere data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 7.6 Breakdown point and time for FAST-MCD and seed clustering algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.7 Time to nd the breakdown point using seed clustering. . . . . 126 7.8 Breakdown point and time for FAST-MCD (10000 initial subsets). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.9 Application of the seed clustering algorithm on three dierent data sets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 ix

LIST OF TABLES

x

8.1 Experiment for selection of the sub-sample size. . . . . . . . . 134 8.2 Mean and standard deviation of a cluster with 4952 data points found in the data set with all stars. . . . . . . . . . . . 138 B.1 Pedal(P)/Sepal(S) Lenght(L)/Width(W) of the three owers described by [Fis36]. . . . . . . . . . . . . . . . . . . . . . . . 149 B.2 CPU and f with dierent numbers of random restarts (RR) on FID. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150 B.3 Results for the run of RSA1, RRTS1, RRD on dierent generated data sets measuring time and objective function value. 151 B.4 Results of 20 runs of RRTS1, RSA1 on FID, RRTS1 with iterations between 150 and 300. . . . . . . . . . . . . . . . . . 152 B.5 Results of 20 runs of RRTS1, RSA1 on FID, RRTS1 with iterations between 150 and 4800. . . . . . . . . . . . . . . . . 153 B.6 Results of 100 Runs of SA on FID with random parameters. . 155 B.7 DPOSS data set and description of the elds. . . . . . . . . . 156

Symbols w B c C

Number of data points in a seed. Number of data points in a seed during cluster walking. Between the groups scatter matrix. Temperature for the cooling schedule in Simulated Annealing. Partitioning of the data set, containing the data sets of the clusters and the outlier group. Ci Data set containing all data points of partition i. 2 d(; ) Mahalanobis distance function, parameterized by the covariance matrix. dI (; ) Euclidean distance. f () Objective function for the minimization problem. g Number of clusters to be found in the data set. h() Result vector of a heuristic. h Size of the subset used for the MCD algorithm. hs Size of subset in the sub-sample used for the MCD algorithm. H Minimum number of data points in each cluster. H1 Subset of data points used for the MCD algorithm. Mean of an observation. n Number of data points in the data set (size). ni Number of data points in the cluster i. ns Number of data points in the sub-sample used for the MCD algorithm. nT Size of the tabu list for Reactive Tabu Search. N ( ) Set of solutions in the neighborhood N of the solution . p Number of parameters describing each data point (dimension). PF Percentage of best solution found in a certain number of runs. PS Probability of nding the best solution in a certain amount of time.

xi

SYMBOLS Qp R

icmax )

Figure 4.1: Pseudo-code for the \Descent". the neighborhood N ( ). In case of the neighborhood containing a solution better than the best solution found so far, with respect to the objective function value, a descent move in this direction will be executed, i.e. the new solution will become the center of a neighborhood N ( 0 ). Otherwise the Descent is terminated and a local optimum = 0 is found. This local optimum can not be understood as the optimal solution because the heuristic is missing a feature to force itself out of the found \valley" and eventually \fall" into a deeper one. This description refers to a Steepest Descent, other possibilities for Descent are e.g. First Improving Descent where the rst neighbor with a better objective function value is chosen. The pseudo-code is given in Figure 4.1. A stop criterion for Descent can be described as followed:

The neighborhood contains no better solution than f ( ) f ( ) 8"N ( )

(4.2)

An iteration counter reached a given upper limit. The solution is not necessarily a local optimum.

A drawback of Descent is the fact that it starts from one solution and does only downhill moves. Therefore, only a small fraction of the whole search space is visited leaving a large number of solutions untouched and not investigated. A simple method, mentioned in [Egl90], to avoid the lack of visited solutions, is the restart of the descent from various random solutions in a loop, later referred to as RRD. The solutions of each descent will be compared and the one with the best objective function value will be returned as the result. We refer to the number of restarts as the number of iterations performed on the RRD.

CHAPTER 4. CLUSTERING AND LOCAL SEARCH

39

4.1.2 Simulated Annealing

Simulated Annealing (SA) resolves the problem of getting stuck in local optima by allowing moves to inferior solutions. By letting the heuristic accept occasional uphill moves it can jump out of a local optimum and eventually fall into the next possible \valley", hoping to nd the global optimum. The probability to go back (uphill moves) decreases with time and the heuristic will nally freeze in a (local) optimum. The idea was rst adapted to dierent problems in operations research in [KGV83] and [Cer75] after seeing this behavior in the eld of statistical mechanics. To grow a single crystal from a melt, the temperature has to be lowered very slowly and then has to be kept a long time at a temperature in the vicinity of the freezing point. Otherwise, the substance is allowed to get out of equilibrium and the crystal will have defects. The probability of accepting bad moves is controlled by a cooling schedule which sets the temperature during the run of the heuristic. The move will be accepted if (4.3) Rp where = f ( 0) , f ( ) with 0 being a random solution from the neighborhood N ( ), Rp being a pseudo random deviate drawn from an uniform distribution on (0,1) and c being the temperature set by the cooling schedule. If a move gets accepted 0 becomes the center of a new neighborhood N ( 0 ). If f ( 0 ) < f ( ), then = 0 and a new best found solution is discovered. The cooling schedule, developed and parameterized in [JAMS89] and [JAMS91], will be later used as an origin when we show an example for an application of clustering. SA as used here is guided by four parameters J = (TF; IP; MP; SF ) with the following meaning: TF The Temperature Factor is the multiplier for the current temperature when the temperature is reduced. IP Based on an abbreviated trial run, a temperature is found at which the fraction of accepted moves is approximately InitProb. This temperature will be used as the starting one for the SA run. MP Whenever a temperature is completed and the percentage of accepted moves is equal or less MinPercent, a counter ic is incremented, and it is reset to zero each time a new best solution is found. If the counter


40

\Random Solution" T \Get Starting Temperature" (the percentage of accepted moves starts out equal to IP) while ic 5 do (see also description of the parameter MP) for 1 to L do (Loop performed for each temperature, loop is referred to as [LPT]) \Random Neighbor of " if f ( ) < f ( ) then ; ic 0 (downhill move) else , if e c end end

ic T

end

> R then

; ic

0 (uphill move)

ic + 1 TempFactor T

Figure 4.2: Pseudo-code for \Simulated Annealing". reaches 5, the annealing run is declared to be frozen4 . SF The temperature length L is set to L = SizeFactor with approximately being the average neighborhood size, e.g. = (n , T , sH )(g , 1) + (n , T )(T ). See Figure 4.2 for a pseudo-code of the SA process and Figure 4.3 for the pseudo-code of the abbreviated trial run to nd the starting temperature based on IP. The idea is to start with an initial temperature and run the SA for L iterations. As long as the rate of accepted moves is lower than MP , the temperature is being increased, but only as the rate gets above MP . A setting for the vector J is suggested in [JAMS89] with J = (0:95; 0:4; 2; 16).

4.1.3 Reactive Tabu Search

Like Simulated Annealing, Reactive Tabu Search (RTS), developed by Battiti and Tecchiolli [BT94], overcomes the disadvantage of getting stuck in a local optimum. RTS is a modi cation of the static tabu search introduced by Glover (see [Glo89] and [Glo90]) and uses dynamically changing list sizes and escaping of the search into not investigated areas of the search space if a diversi cation seems to be essential. In tabu search, the cooling schedule of SA is replaced by a tabu list TL of size nT storing the attributes of the latest moves. Moves that are \reversals" of moves in the tabu list are considered to The maximum of the counter ic could be treated as a fth parameter, but due to the strong interactions with MP , leaving this parameter xed at 5 seems reasonable 4

CHAPTER 4. CLUSTERING AND LOCAL SEARCH T

do

41

1

T T 10 "perform loop [LPT] from gure 4.2" while \Percentage of accepted moves < MP" do T T 101 "perform loop [LPT] from gure 4.2" while "Percentage of accepted moves > MP" do

T T 2 "perform loop [LPT] from gure 4.2" while "Percentage of accepted moves < MP"

Figure 4.3: Pseudo-code for the function \Get Starting Temperature". be tabu for a certain amount of iterations (each iteration is one move in the solution space) depending on the tabu list strategy. Some tabu list managing strategies are e.g. the reverse elimination method (see [DV93]), moving gaps (see [HG94]), and tabu cycle method (see [GL97]). Each iteration, the solutions in the neighborhood N ( ) are evaluated as long as the move to them is not set tabu and the best solution 0 is selected to become the center of the new neighborhood N ( 0 ). The attributes of the move are stored in the tabu list. The tabu search runs for a given maximum number of iterations or stops if the actual neighborhood from where to select a move, consists only of solutions that are tabu. The pseudo-code for the basic tabu search is shown in Figure 4.4. The RTS uses the same basic structure, but allows the tabu list with the starting size of = n=INITK to change dynamically. Therefore, the algorithm uses a mechanism (see below for the description of hash tables) to track down repetitions of solutions and reacts on these by increasing the tabu list size. Due to the possibility of being able to set more directions tabu, the algorithm is forced to nd new areas to explore. On the other side, the size of the tabu list is decreased over time, precisely when a certain number of moves has performed since the last decrease. Without the decrease, the search would be limited to much even after leaving the area causing the repetitions. Using these changes in the list size the behavior of the algorithm changes according to the need of tighter search direction, or the need of giving more room for the search, respectively. The increasing list size might not be enough to let the search leave a certain area and another mechanism is used to


42

\Random Solution" TL ; ic 0 while ic \Max number of iterations" or :("All move are tabu") do ic ic + 1 argmin 2N ( ) ff ( ) < f ( ) ^ moveto( ) 2= TLg TL TL [ "Attributes of the move" end

Figure 4.4: Pseudo-code for the simple \Tabu Search". identify a chaotic trapping (see [BT94]). A hash list is used to store all encountered solutions and with them a counter for the repetition. Suggested by Woodru and Zemel [WZ93], the hash table can store only the hash values of the solutions instead of the whole solution vector. If the hash table is large enough, the inaccuracy of a solution being interpreted as a dierent one with the same hash value is small. Furthermore, the fact that the hash table is embedded in a heuristic allows us to make small \mistakes" in determining the repeated solutions. Whenever a hit in the hash table occurs, the solution is called to be rediscovered. Every time the number of solutions repeated more than REP times reaches CHAOS, the search is assumed to be stuck in an area of the search space, and an escape move is being performed (Diversi cation move). Originally, a number of random moves, proportional to the average moves between repeated solutions, is executed to leave the actual search area and to diverse in unexplored regions. In our implementation, we generated a new random solution from scratch which will have the same eect as random moves. There are several strategies which could be performed for an escape, e.g. instead of the random generation, we could do an explicit analysis of the tabu list to receive a move towards an interesting and less investigated search area. The size of the tabu list is reset to the initial value . RTS is guided by six parameters R = (INITK, CHAOS, REP, DECPER, INCPER, CYCMAX) with the following meaning: INITK

TL length is set to the proportion n=INITK at the begin of the search and after each escape move.

REP

Number of repetitions of a solution before the counter for CHAOS is incremented.

CHAOS

A counter keeps track of the number of at least REP times repeated solutions and whenever this counter is greater than

CHAPTER 4. CLUSTERING AND LOCAL SEARCH if

43

2 TL then length \Iterations since was visited the last time" \Update time and repetition counter for " if rep counter > REP then chaotic counter chaotic counter + 1 if chaotic counter > CHAOS then "Reset chaotic counter and execute an escape" end end if length CY CLE

Size(TL)

end else if \steps

MAX then Size(TL) * INCREASE

since the last Size(TL) change > moving average" then Size(TL) Size(TL) DECREASE

end end

Figure 4.5: Pseudo-code for the \Repetition Control". CHAOS, a diversifying escape move is executed. DECPER Percentage of reduction of the TL size. INCPER Percentage of augmentation of the TL size. CYCMAX If the solution occurred within the last CYCMAX iterations and 2 TL, the length of the tabu list is increased. The pseudo-code for the tabu search in Figure 4.4 has to be modi ed in a way that the procedure in Figure 4.5 is called for every move being performed. This procedure checks for repetitions, performs the escape move if necessary, and updates the TL size. We use the implementation of the class library in [Woo97] as a foundation. We are modifying it slightly by overloading it with a new Reactive Tabu Search which uses the described escape procedure, and by adjusting the INITK initialization to our purpose.

4.2 Neighborhood Structure and Solution Space 4.2.1 Neighborhood

The local search strategies are using a neighborhood to move within the solution space. We specify one possible neighborhood that will be used in the experiments to compare and analyze the local search strategies. This


44

neighborhood is based on moves caused by exchanging one data point at a time. Another neighborhood, the constructive neighborhood, will be shown with the introduction of the seed walking method in chapter 6. A characteristic of the simple neighborhood used for the local search is the movement of one data point at a time between the clusters. The procedure of the movement itself is described in 4.2.2. For such a neighborhood, the number of neighbors can be calculated by (g , 1) n for the problems MINW and MIND, if there are no binding size constraints, which is the neighborhood size. If there are s size constraints, in form of a minimum number H of elements in each cluster, the size of the neighborhood is (n,sH )(g ,1). This neighborhood is used, for example, in [Spa85] and [CW97] in rst-improving local search algorithms. For MINO, the unassigned points are causing an increase in the size of the neighborhood as a result of the property that every move to the group of unassigned points involves a move from that group back to a cluster. The size of the neighborhood without size constraints is (n , T ) (g , 1) + (n , T ) T . Adding the size constraints as before, the neighborhood will decrease. For a solution the neighborhood N ( ) with s size constraints has solutions.

(n , T , sH ) (g , 1) + (n , T ) T

(4.4)

4.2.2 Solution

The solution can be generated from scratch or transformed from an existing solution 0 by applying a special transition function. The dierence is brie y described in the next paragraphs.

Randomly Constructed Solutions The easiest way to construct a solution is done by assigning randomly the data points to the g clusters or to leave them unassigned. Keeping the solution feasible in consideration of the constraints, T data points are assigned to the group of unassigned points, and the remaining (n , T ) to the g clusters. To make sure that each group has at least H data points, each group receives this minimum number rst and the rest is randomly assigned to the clusters afterwards.


45

Transition Neighborhood Starting from a feasible solution, another feasible solution is constructed by applying special moves within a neighborhood. This paragraph will describe one possible transition which is used for the local search strategies. The transition of one solution to another can be distinguished in two classes: 1. Moving a data point zj from cluster i to cluster i0 where i; i0 = 1; : : : ; g and i 6= i0 . The move is feasible if the size constraint holds. 2. Moving a data point zj from cluster i to the group of unassigned points. After moving zj , a member j 0 of the group of unassigned points has to be assigned to a cluster. This assignment can be to the same cluster i or to a dierent cluster i0 with i 6= i0 , depending which is causing a lower objective function value. The move is feasible if the size and unassignment constraints hold. Whenever a move of a data point is performed, the solution vector has to be modi ed. This is done by setting ij = 0 for the leaving data point j , and i0 j = 1 where added. If an unassigned point is involved in the move, the following setting are done: ij = 0, i0j0 = 1, (g+1)j = 1, (g+1)j0 = 0. The size ni of the cluster i is decremented while ni0 is incremented for the receiving cluster i0 by one due to the new element j respectively j 0.

4.2.3 Evaluation of the Objective Function

The implementation is done by using two dierent kinds of evaluation of the objective function value. The computation of the objective function value for a solution is done by using a method suggested by Hawkins [Haw94], whereas we use update formulas given in [Spa85] during the search process of the best solution in the neighborhood. The calculation of the objective function value for each considered move would be too expensive because of the extensive recomputation of the determinant of the covariance matrices. The process is called up-dating and down-dating according to adding and, respectively, removing a data point from a cluster. The evaluation of the objective function value f splits into the calculation of the determinants of all clusters i; i = 1; : : : ; g and their summation. Each Wi can be calculated in the following way. This algorithm developed by Hawkins [Haw94] allows us to calculate the objective function in an easy and fast way. The data points z from the cluster


46

i are augmented by a 1, de ning yk = (1 : zkT )T ; k 2 J (4.5) with J = (j1; : : : ; jni ) being the set of indices of the ni data points assigned to the cluster i. This is done to take the mean out of the observations. De ning Yi = (yj1 ; : : : ; yjni ) as the set of all data points yk ; k 2 J , we can calculate the partitioned matrix T ! z n n i i i (4.6) YiYiT = n z C C T i i i i The determinant of Ai = YiYiT is de ned by jAij = nijCiCiT , nizizi T j jni X = nijni (zk , zi )(zk , zi )T j = =

k=j1 nijWij i npi +1j W ni j

(4.7)

The objective function 3.13 for the problem MIND can be expressed using Ai in g X nilog(j Ap+1i j) (4.8) ni i=1 with jni X Ai = yk ykT (4.9) k=j1

We receive an ecient way to compute the covariance matrix Ai , and therefore also the objective function. This is done by calculating a (p + 1) (p + 1) dimensional matrix yj yjT where the (p + 1) dimensional vector yj is generated by augmenting a one to the data point zj as described above. The covariance matrix Ai is constructed by adding this calculated matrix for all data points in cluster i. The calculation of the determinant and the objective function is then straightforward by substituting the values for their corresponding values.

4.2.4 Up- and Down-Dating of the Objective Function

Whenever a move of a data point from one cluster to another cluster is done, the determinants have to be recalculated. This can be done as described in


47

Section 4.2.3 but it would require the expensive calculation of determinants. As described in [Spa85], up-dating and down-dating formulas can be used to update the determinants without spending time in their calculation. Instead, the amount by which the determinant will change due to being involved in the move of a data point is added or subtracted, respectively.

jAi + yk ykT j = jAij(1 + ykT A,i 1yk )

(4.10)

with = +1, when the data point yk is being added, and = ,1 in case of removing the data point from cluster i. Using equation 4.10, only the removal or the addition is taken into consideration. When a data point yl is removed and a data point yk is added to the same cluster, the update formula is:

jAi + yk ykT , yl ylT j = jAij[(1 + ykT A,i 1yk )(1 , ylT A,i 1 yl) + (ylT A,i 1 yk )2] (4.11) Furthermore, the inverse of the covariance matrix Ai has to be maintained and updated to evaluate the next move. The updated inverse for A,i 1 is calculated by ,1 T ,1 (4.12) (Ai + yk ykT ),1 = A,i 1 , Ai ykTyk,A1i 1 + yk Ai yk and exists whenever 1 + ykT A,i 1yk 6= 0. The objective function value is calculated as before by using the equation 3.13 with the precalculated determinants from the up-dating or down-dating process. Due to computational errors which are accumulated when using formulas like this, the covariance matrices are recalculated after a given number of moves. In this work, there is a recomputation after every ten moves.

4.2.5 Precalculation for Moves Involving Unassigned Points

Whenever we plan to move a data point from a cluster i to the group of unassigned points, we have to consider a move from the same group back to a cluster i0 . Therefore, we have to determine which of them is the best to be moved, and which cluster to move it to in order to get the best change in the objective function. The change has to be calculated for each of the T unassigned points moving to each cluster. Yet, the T g calculations can be done in advance due to a xed solution during the search for the best move. After deciding on a move with the best change in the objective function value, the determinant of the aected clusters have to be recalculated. The


48

same has to be done for all unassigned points being involved in the move, the others do not have to be recalculated. There is an inaccuracy in the prediction of the objective function value whenever an unassigned points is moved back to the cluster where the latest unassigned points came from. The eect is caused by the changing covariance matrix after the removal of the data point and the precalculated value for the unassigned point. This may cause a sub-optimal choice of unassigned point to move. Having a heuristic and dealing with only small inaccuracies, where the chance of actually choosing a sub-optimal move is low, we can keep the more ecient approximation instead of spending time for long calculations of determinants.

4.3 Experiments with Local Search

4.3.1 Model for the Experiments

Using dierent data sets and algorithms, a design phase has to be performed in advance to setup the experiments in a way that needless runs are prevented and all needed results are achieved. Figure 4.6, derived from [BHH78], shows the structure of the experiments that was used to de ne and execute the tests. Thinking about what to learn from the experiments, we can come up with a hypothesis for a model that can be built for the runs. A hypothesis for our experiments could be one of the following:

Runs of SA on the results of SA runs will rediscover the parameter setting suggested by Johnson et al..

The number of parameters for RTS can be decreased without getting a worse performance than before.

RTS has shorter running times than SA while getting at least the same quality in the results.

SA is equivalent with RTS. Once the model for the hypothesis is constructed, it can conduct to the algorithms and data sets to setup runs for the experiment. The outputs of the experiment are the results used as the input for the analyzing procedure, or as the base for new data generation. After analyzing the results, the hypothesis can be accepted, rejected, or the need for more results can lead to extended runs with the same model. Based on the results, it might be


49

Hypothesis Algo.

Data

Model

Experiments Results

Hypothesis Accepted

Accept Reject

Analysis

Figure 4.6: Graphical representation for the structure of the experiments. possible to discover information for a new hypothesis which conduct towards the construction of a new model to verify or refuse the latest hypothesis. In this part, three dierent algorithms are used to run a performance test. These three local search strategies are applied to a various number of experiments to analyze their behavior on dierent data sets with dierent parameter settings. The goal of these experiments is to suggest an algorithm by estimating the amount of time in which a solution can be found with a certain probability PS . Furthermore, each algorithm will be compared with the others to nd a ranking in its useability on clustering. In the following chapter, data mining is used on the results trying to receive more perception about the algorithms itself. Especially, we hope to nd already known relationships in the test results and also clues for interesting and testable hypothesis. The SA algorithm is useful to demonstrate the clustering method and data mining in particular, because of the extensive studies by [JAMS89]. Some of the results are given in appendix B, but all data sets and results of this thesis are enclosed on the CD-ROM or can be obtained by the author. A citation of the data refers to the tables in the appendix. To demonstrate our results, we used fewer runs in each experiment as a direct example for the thesis but the same results are seen using more runs. For the performance test we used here a set of 20 dierent starting points on the FID instead of the 200 in the larger experiments. The settings for the clustering, especially g; H; T will vary in the dierent experiments. All algorithms have in common that an initial seed value is set. Whenever a algorithm is restarted, it will create the next seed based on the initial seed. All runs were done on a Digital DEC Alpha 3000 with a M-700 CPU under the operating system Digital Unix Version 4.0B. The time measured


50

-1350 -1400

Objective Function Value

Iteration = 2 Iteration = 5

-1450

Iteration = 10 Iteration = 20

-1500 -1550 -1600 -1650 -1700 0

20

40

60

80

100

Time in [sec]

Figure 4.7: RRD on FID, 20 restarts for each iteration. during the runs is the real CPU time.

4.3.2 Steepest Descent Evaluation

Due to the lack of parameters in uencing the behavior of the algorithm, there is no need for an extensive analysis or data mining. The only real in uence which can be performed is setting the number of times that the algorithm picks a new starting point in the solution space for each run. In this thesis the runs on the FID are done to compare the performance of RRD to the results of SA and RTS. Table B.2 shows the results of 20 runs on the FID with four dierent numbers of restarts each time which are visualized in Figure 4.7. The clustering parameters are set to g = 3, H = 20 and T = 0. Increasing the number of random restarts, the standard deviation in f is decreasing and the mean is going toward the f of GSI2. We should note at this point that RRD did nish the clustering of 8 out of twenty after an average time of 22.1 seconds whereas RTS was able to solve all after an average time of 12.58 seconds. See Table B.4 and Figure 4.8. As shown later, the RRD can beat in some cases the RTS on the generated FID data set (GFID) as it is shown in Figure 4.12. RRD is a simple algorithm, but it might be a good alternative and will be included in the experiment to nd a recommendation of an algorithm to use for clustering. However, it is still a comparison between a purely random driven algorithm and two \intelligent" ones, SA and RTS. Whenever RRD gets stuck in a local


51

minimum and will terminate, the other algorithms can escape from there and go further down in the search space. In other words, the RRD is fast because of its simplicity and might be lucky to nd a good solution, but it is hard to justify against a slower, but more intelligent strategy5 .

4.3.3 Reactive Tabu Search and Simulated Annealing

Starting with the following hypothesis

Hypothesis 4.1 SA and RTS are equivalent for the clustering process when

both use default parameter settings

we used an experiment in which we did 20 runs of RTS on FID with dierent starting points. The clustering parameters are kept constant for this experiment with H = 20 and T = 0 and g = 3. The number of iterations was chosen during the experiment from the list (150,300) to receive the percentage of runs that were solved with the solution GSI2 within a given number of iterations. The parameter setting for all runs is CHAOS = 3, REP = 3, INITK = 7, DECPER = 95, INCPER = 105, CYCMAX = 450, and is referred to as RRTS1. The same experiment was processed with SA using the parameter settings suggested by [JAMS89] with TF = 0.95, IP = 0.4, MP =2, and SF = 16, referred to as RSA1. After the same 20 runs we noticed the longer running time as the biggest dissimilarity between SA and RTS. As shown in Figure 4.8, RRTS1 is solving all runs after only 300 iterations with an average of 12.6 seconds, where SA needs an average of 184.84 seconds to solve 90% of the runs. The results suggests that the parameters for SA have been chosen in a way that the cooling process is taking too long and the algorithm is running longer than it has to do. This leads to a new hypothesis.

Hypothesis 4.2 SA with a parameter setting leading to shorter runs than the standard setting will perform as good as RRTS1.

To evaluate this hypothesis we execute a new experiment with shorter running time for SA by changing the parameter setting. Based on experiments by [JAMS89], the parameters were chosen as TF = 0.93, IP = 0.3, MP = 2, SF = 8 and TF = 0.9025, IP = 0.2, MP = 2, SF = 8 to reduce the time needed to freeze the algorithm. We refer to these runs as RSA2 and RSA3. Throwing darts while being blindfolded might be an extreme, but good example. If you know the approximate direction you might hit the bullseye at one try. 5


52

1.2

Fraction of Solution GCI2 found

1

RRTS1 - Fixed Parameter RSA1 - Standard Parameter

0.8

0.6

0.4

0.2

0 0

100

200

300

400

500

Time in [sec]

Figure 4.8: Fraction of GSI2 found with RRTS1 and RSA1 in FID over the time, 20 runs with dierent starting points. Instead of using SA with the suggested parameter setting, we accelerated the cooling process and shortened the search time. The disadvantage of having less time can lead to results not as good as before. For a fair experiment, the RTS algorithm needs also a small disadvantage in form of other parameter settings than the suggested ones. We apply this by using randomly chosen uniform parameter settings instead of the suggested ones, precisely CHAOS and REP uniform on [2,4], INITK uniform on [7,37], DECPER uniform on [60,95], INCPER uniform on [105,120], and CYCMAX uniform on [150,450], referred to as RRTS2. The results of this and the latest runs are shown in Figure 4.9, using a log scale on the abscissa. Surprisingly, RSA3 with the setting for the fastest cooling criterion shows a result of the same quality as the RSA1, but received after a third of the time. Due to the subject of the thesis we do not investigate the reason for this. We are mainly interested in the fact that RTS very obviously is the winner of this experiment. This can be seen even without a statistical analysis of the results. It needs less time and has a larger fraction of solution GSI2 found by the algorithm. Hypothesis 4.2 is therefore rejected, but we will verify this behavior of the two algorithms by running more tests on generated data. Our experiment will be modi ed inasmuch as that we will use three different generated data sets with dierent parameters and run RSA1, RRTS1, and RRD on them. In this test we will \cheat" in a way that we assume the knowledge about the running time of the other algorithm in advance. The advantage of this cheating lies in the design of the experiment. We run


53

1.2

Fraction of Solution GCI2 found

1 RRTS1 - Fixed Parameter RRTS2 - Random Parameter RSA1 - Standard Parameter

0.8

RSA2 - Shorter Parameter RSA3 - Shortest Parameter

0.6

0.4

0.2

0 1

10

100

1000

Time in [sec] on a Log Scale

Figure 4.9: Fraction of GSI2 found with RRTS1, RRTS2 and RSA1-3 in FID, 20 runs with dierent starting points. RSA1 rst and afterwards let RRTS1 and RRD nd a solution of at least the same quality. The time is being stopped and used for comparison. This \cheating" can be accepted since our goal is to use these results to show that one algorithm is signi cantly better than another one. As an initial run we choose a xed number of iterations for RRTS1 based on n and the data set characteristic (dimension p, number of cluster g, and overlapping clusters in GFID). The values we choose for the iterations are based on practical experience we gained during the tests, but are not based on any other criteria besides that. We used the generator GEN1 with the data input and parameters shown in the table below. The generation and composition of the data sets are described in detail in Section 2.1.2. DATA1 : Using data DS3 with n= (100,300,500), fraction = 1, seed = (2343, 3264) DATA2 : Using data DS1 with n = (100,300), fraction = (1,2), seed = (2343, 3264) GFID : Using data FID with n = (100,200), fraction = (1,2), seed = (2343, 3264) We set the running time for RRTS1 to 2 n for DATA1, 5 n for DATA2 and, because of the overlapping clusters in the GFID, we use 10 n on the GFID. In our rst run, RSA1 was able to nd a better solution with two

CHAPTER 4. CLUSTERING AND LOCAL SEARCH -1000

54

2500

-900

-800

2000

-700

OBJ RSA1 OBJ RRD

1500

Time in [sec]


OBJ RRTS1 -600

-500

-400

Time RSA1 Time RRTS1 Time RRD

1000

-300

-200

500

-100

0

50 0, 2, 32 64

50 0, 1, 23 43

30 0, 1, 32 64

30 0, 1, 23 43

10 0, 1, 32 64

10 0, 1, 23 43

0

Data Set - Size,Fraction, Seed

Figure 4.10: RSA1,RRTS1,RRD on generated data set DATA1. instances (100,2,2343 and 200,2,2343), but also did consume far more time. Thus, we did rerun RRTS1 with more iterations to nd out if it can nd a better solution than SA, but still be faster. Only the nal results are given in Table B.3 and visualized in Figure 4.10 to Figure 4.12. Each category symbolizes one data set with the parameter settings as described above. The bar chart6 represents the objective function value f , a higher bar is indicating a better result. In our test, the objective function values f are the same in most of the runs, or RRTS1 nds a lower value than the other algorithms. Only on the GFID data set with the parameters 200,1,2343, the SA got a dierent solution with a slightly better f . Nevertheless, this dierence in the order of magnitude of the rst decimal is insigni cant to the time dierence, where RTS is almost three times faster. As mentioned before, RRD is faster than RRTS1 in some cases. In this experiment RRD was faster on the last instance of DATA1 and on two instances of GFID. On DATA2, it needed a little more time than RRTS1 for the instances with n = 100, but around ve times as much with the data sets with n = 500. In the experiment on the GFID, RRD is uctuating between 45 and 230 seconds, and this does not show any relation between running time and quality. There is also an outlier with almost 1000 seconds. This shows that RRD can compete with RTS in quality and time for small data sets. Using larger data sets, the evaluation of the objective function value gets more expensive, and exploring the solution space rather than multiple In Figures 4.10 and 4.11, the SA runs for the last data set are not available due to computational errors. 6

CHAPTER 4. CLUSTERING AND LOCAL SEARCH -1400

55

2500

-1200 2000

OBJ RSA1

-1000

OBJ RRD -800

Time RSA1

Time in [sec]


OBJ RRTS1 1500

-600

Time RRTS1 Time RRD

1000

-400

500 -200

0

64 30

0,

2,

32

64 30

0,

1,

32

43 30

0,

2,

23

43 30

0,

1,

23

43 0,

2,

23

43

10

10

0,

1,

23

64 32 1, 0, 10

10

0,

2,

32

64

0


Figure 4.11: RSA1,RRTS1,RRD on generated data set DATA2. -2100

1100

900 -1600

OBJ RSA1 OBJ RRTS1 OBJ RRD -1100

Time RSA1

500

Time in [sec]


700

Time RRTS1 Time RRD

-600 300

-100 100

400

64 20

0,

2,

32

64 20

0,

1,

32

43 20

0,

2,

23

43 20

0,

1,

23

64 10

0,

2,

32

64 10

0,

1,

32

43 23 2, 0, 10

10

0,

1,

23

43

-100


Figure 4.12: RSA1,RRTS1,RRD on generated data set GFID. runs going straight down saves valuable computation time. Compared to RSA1, RRD needs less time on the rst two data sets, DATA1 and DATA2. On GFID, RRD was faster on 50% of the cases. The results show that RRD is good on well separated data sets, but using it on the GFID with overlapping clusters, shows a decrease in performance. It still can nd the good solution, but needs more restarts and therefore more time. Here we stop analyzing the behavior of the three algorithms, and conduct a last experiment to estimate the time to nd the best solution with a certain probability PS . We executed the same runs with dierent data sets


56

Probability PS 0.5 0.8 0.9 0.95 0.99 tRRTS1 5.72 13.28 19.00 24.73 38.01 tRSA1 Shortest 112.25 260.63 372.88 485.13 745.76 tRSA1 Average 290.6 674.75 965.35 1255.95 1930.71 tRSA2 Shortest 66.45 154.28 220.73 287.17 441.46 tRSA2 Average 158.29 367.53 525.82 684.11 1051.64 tRRD 86.43 200.63 287.10 373.53 574.20 Table 4.1: Time needed to get the best solution with a probability of PS . and more instances of them, but the result was the same. We clearly saw the same result as above, suggesting a certain order of quality for each algorithm. Doing statistic on this results of RRD, SA, and RTS, would not predicate too much. We could say, that RTS would be the choice if you want the result faster but with at least the same quality as with SA, and for just a solution, RRD with only a few random starts would be the best. Therefore, another experiment is done to estimate the time t necessary to nd the best known solution with a certain probability PS for each algorithm.

PS = 1 , (1 , PF )N (4.13) ln(1 , PS ) , N = ln (1 , PF ) with PS being the probability with which the result will be found in t seconds, PF being the percentage of the solution GSI1 found in 100 runs, and N being the percentage of the average time t to receive the fraction PS . We have to measure PF , to calculate the other values. During the experiment, RRTS1 was executed on the FID 100 times with dierent starting points for 300 iterations. We get a fraction of PF = 88% and an average time of t = 17:5 seconds. Table 4.1 shows the time t = t N for dierent probabilities PS . The same runs are done with SA, for both RSA1 and RSA2. Due to the large range of running time for SA, we use the average of the time, but also the shortest time found to calculate t for dierent probabilities PS . The fraction of PF for RSA1 is 51%, and we got the shortest run with 115.52 seconds and an average of t = 299:07 seconds. RSA2 has a fraction of 32% and an average time of t = 88:07 seconds with the shortest run of 36.97 seconds. Finally, 100 runs for RRD with 20 random restarts are done. The fraction for RRD is 59% and the average t = 111:17 seconds. Analyzing these results, the usage of RTS is recommended for clustering. Compared to the fastest run of SA, the RTS algorithm needs only 8.6% of


57

the time for its fastest run. Comparing RTS to the average time of RSA1, RTS would need only remarkably 1.9% of the time. Despite of the simplicity of the RRD (or maybe because of it), it shows a better time performance as the average SA, and could therefore be a better suggestion. Compared to RTS, RRD still needs more time for a probability of 0:5 than RTS for a probability of 0:99. It might be important at this point to mention the fact that these results might not be valid in general. The estimated time is calculated only for one problem. We wanted to demonstrate the application of local search strategies on clustering problems before we continue with data mining but not do a general analysis of performance. All we have done is to show a trend of how well the methods are performing. A complete performance experiment would be basically the same but using more data sets with more dierent characteristics in it to check the behavior of each method. The conclusion of this comparison is the rejection of hypothesis 4.1, because there is no signi cant evidence that SA could beat RTS in clustering data. We also conclude that SA might be a bad choice to apply in the clustering process. RTS is the best choice if an answer of good quality is needed. RRD is the fastest, and therefore the choice of nding a solution in the shortest time. RTS and RRD have another advantage over SA. It is possible to let each algorithm run for a certain amount of time and let it report the best result. This may be done with SA as well, but it will converge based on its cooling schedule. After showing the data mining process in the next section, an algorithm specially developed for the task of clustering will be introduced and run against these three.

Chapter 5 Mining the Data from Experiments on Algorithms Using Maximum Likelihood Clustering 5.1 Introduction Having introduced a statistical model for the clustering of a data set and the generic local search, we are going to show an application of clustering using the local search strategies. We apply these strategies to a clustering problem to generate a new data set containing the parameters of the local search and the performance of it. In a data mining process this data set will be analyzed by clustering it again with the same local search strategy with the goal of receiving an insight of the algorithm behavior. In advance of the calculations, we have to derive a problem instance from a generic problem which will be used for the experiments. In our case, the MINO problem formulation is used for the tests. Furthermore, the neighborhood introduced in a previous chapter is used as well as evaluation of the objective function value. Before we continue with the problem description, we show the generic hard optimization problem, given by min f ( ) (5.1) Subject to: 2 with being the decision vector, also called solution vector, and the set of constraints applied to . This formulation of the problem will be referred to 58

CHAPTER 5. MINING THE DATA FROM EXPERIMENTS

59

as (P ). We will use the mathematical formulation of the problem MINO as shown in appendix A with its constraints as an example. There are no known algorithms being able to solve this problem optimally in a reasonable amount of time. Therefore, using heuristics allow us to nd a good solution as close as possible to optimality while using only a fraction of the time an optimal solving algorithm would need. The local search strategies that we have introduced will be used in this chapter to do two dierent experiments. The rst is used to analyze the useability of local search for clustering data sets whereas the second applies data mining to analyze the results of Simulated Annealing and Reactive Tabu Search by applying them on the result. The second experiment, the analysis of algorithms, uses algorithm performance and its results to nd out if new relationships can be discovered, e.g. a correlation between two parameters could lead to a combination of these two into one parameter. We refer to the parameters that in uence the operation of the heuristic as . h(; P ) is de ned as the resulting vector of applying a heuristic algorithm with the parameter setting to the problem instance P . The result vector h will consist of the running time consumed by the heuristic and a measurement of quality. The quality can be expressed as a best objective function value f and/or in a variety of statistical values, e.g. number of misclassi ed elements, repeated solutions, or point of time where the best solution was discovered. We refer to a data set for analysis of algorithm results as Z , which is a set of vectors z de ned by z = (p; ; h(; P )) (5.2) The dimension of Z is p and consists of n vectors z. This de nition is useful because the results of the algorithm are de ned as a data set and therefore can be handled as an input for the next run. This fact is most important for the data mining process we will examine in Section 5.2.

5.2 Data Mining on Local Search Strategies 5.2.1 Data Mining

Before using the process of data mining in conjunction with local search strategies, we provide a brief introduction and description of data mining. Data mining is known under dierent names depending on the group of people using it. The statisticians use the names data mining or clustering, whereas the AI and machine learning researchers are talking about knowledge discov-


60

ery in databases, KDD. Even if there are further names like data archaeology, data segmentation or information discovery, the process of looking at large data sets to discover new information is given in all cases. In literature, there are distinctions between KDD and data mining. The KDD is describing the whole process of preparing the data, analyzing it, and also evaluating the results. Data mining refers to a process within the KDD process and covers the analysis and extraction of data patterns, but not their interpretation. For a complete reference on the KDD terminology see [KZ96]. We are using the word data mining, even if we are closer to a KDD process, because we do not use tools or applications, but rather work directly on the raw data. A de nition of KDD is also given by [FPSS96] where the process is described as \the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data." We will apply the KDD and/or data mining on the results of an algorithm which was applied to a data set to discover clusters. The input data set for the data mining consists of the parameters of the algorithm and the results of its application. The interesting part is that the same algorithm is applied on the newly generated data set to nd patterns in its results. Data mining can be seen as another way of doing research. Instead of formulating a problem or hypothesis in advance, the data set itself is analyzed and the results are used to formulate the problem or hypothesis. We do not know what to look for, but hope to nd something interesting, without de ning interesting here, and maybe some new, unknown ideas how to formulate a hypothesis which can be tested afterwards. Data mining is demonstrated in Section 5.2.2 and 5.2.3 where we analyze the results of Simulated Annealing and Reactive Tabu Search algorithms to rediscover the properties of the algorithm itself. As said in [GW98], the \data mining is performed to discover important patterns in the data. ... The information requirements are de ned relatively easy, but what is found can be surprising."

5.2.2 Data Mining on Simulated Annealing

The well analyzed SA algorithm serves as a rst example to demonstrate the data mining process on the results of the clustering. All runs are executed on a xed data set, here the FID, with xed clustering parameters, g = 3, H = 35, and T = 2, allowing us to vary only the four SA algorithm parameters themselves. During the experiment, the knowledge about the number of clusters is provided, pretending not to know the data


61

points for each cluster. The parameters of the SA algorithm are selected randomly for each run. The results build a result vector h consisting of the four parameters, the CPU time, CPU , in seconds, and the objective function value, f . The SA parameter values are chosen in the following way, with a dierent pseudo random number stream for each run.

IP uniform on [0:1; 0:7] TF uniform over a log scale on [0:9; 0:99] SF uniform over a log scale on [8; 128] MP uniform on [3; 8]

The results of 200 runs are shown in Table B.6. They provide us with a new data set of size n = 200 and dimension p = 6. Each data point consists of the parameters (IP; TF; SF; MP; f ; CPU ). We refer to this data set as SAI and use it for the data mining process trying to nd interesting clusters in it. Table 5.1 shows the properties of the data set SAI. Besides the signi cant positive correlation1 between CPU and the values of IP and TF as expected from experience, the parameters are uncorrelated mostly because of the large range of the parameter variation. Especially, the objective function value f and running time CPU are uncorrelated. In the data mining process, we vary the number of the groups g, the number of unassigned points T , and the minimum number of elements H for each run of SA with the setting J to receive a set of diverse solutions. Precisely, g was chosen from the list (2,3,4,5), H from the list (20, 25, 30, 35, 40), and T from the list (0,5,10,20). Due to the amount of results on this clustering process, we need to preanalyze the data. Besides properties like cluster separation, we use the fact that we know the good results. Depending on the objective function value, the standard deviations, and especially the mean of the data within a cluster, we can tell if it is an interesting cluster formation or not. E.g. a mean close to the best solution found with a low standard deviation shows that The covariance matrix is transformed into a correlation matrix R by the Pearson correlation coecient introduced in 1896 by Pearson [Pea96]. The equation is 1

rlk

= plk

ll kk

l; k = 1; : : : ; p

(5.3)

with rlk being the elements in the correlation matrix and lk are the elements of the covariance matrix .


62

TF IP MP SP CPU f 0.948 0.389 5.475 40.945 235.315 -1647.390 0.026 0.169 1.665 31.390 306.452 12.761 TF 0.00 -0.12 0.05 0.63 -0.16 IP -0.06 -0.01 0.49 0.10 MP 0.10 -0.12 0.37 SF 0.01 0.11 CPU -0.13 Table 5.1: Mean, standard deviation, and correlation matrix for the data set SAI. the cluster is formed by almost only good solutions. We can then check, whether the cluster is formed because of similar parameter settings, or because of similar CPUs, and how the mean parameter setting diers compared to the suggested one. This is applied particularly in the case where we try to rediscover the parameter setting described in [JAMS89]. Furthermore, the knowledge of the correlations for a well working algorithm will help us in this progress. Having the solution of a good clustering in mind, we analyze all clustering results presorted by properties like separation and size of the clusters. The clustering with the parameters g = 2, H = 30, T = 10 was found as one with interesting attributes. Table 5.2 shows the correlation matrix of this solution. The third group is the group of unassigned data points, but its size is greater than p, and therefore it can be treated as a real cluster. This cluster is also the most interesting one, because of its means being close to values that are suggesting a good performance, as proposed by Johnson et al. Before we continue, we should mention that we call all solutions with an objective function value below -1650 good, and the others will be called bad. The correlations with time and objective function value are mostly pro-intuitive and almost all signi cant to at least 0.92. Before continuing with the cluster of unassigned points, the real clusters will be analyzed. The rst cluster with 87 data points has its mean of the objective function value just above the value for good clustering, and it contains only a few bad solutions. The average running times are higher than with the standard setting because of the high temperature factor TF and 2 If the correlation is r = 0 and the number of points in the cluster is N , the variate p r (N , 2)=(1 , r2 ) has a Student-Fisher t distribution with N , 2 degrees of freedom;

this can be used to test the hypothesis that r = 0.


63

Cluster I - 87 data points TF IP MP SP CPU f 0.97 0.36 5.19 41.83 406.23 -1649.87 h 0.01 0.16 1.7 27.62 305.22 12.83 TF 0.24 0.02 0.02 0.60 -0.08 IP -0.20 -0.04 0.75 0.00 MP 0.21 -0.21 0.48 SF -0.04 0.06 CPU -0.11 Cluster II - 103 data points h 0.93 0.41 5.67 36.74 198.88 -1644.76 0.10 0.17 1.63 29.33 105.70 12.58 TF 0.10 0.02 0.01 0.42 0.09 IP 0.03 0.02 0.83 0.15 MP 0.00 0.07 0.29 SF 0.08 0.26 CPU 0.07 Unassigned Points - 10 data points h 0.97 0.45 5.91 76.50 923.64 -1652.86 0.03 0.16 1.45 55.99 644.75 7.78 TF 0.41 -0.49 -0.71 0.95 0.38 IP -0.47 -0.21 0.51 0.28 MP 0.25 -0.30 -0.17 SF -0.76 0.22 CPU 0.25 Table 5.2: Mean, standard deviation, and correlation matrix of one clustering result of SAI with H = 30,g = 2,T = 10. the high values of MP and SF . Many correlations indicate that the Simulated Annealing performed well, but there are also many correlations that are insigni cant. The second cluster shows even more correlations that are insigni cant, against our expectations of good performance. Due to this, it is understandable that the cluster is formed by bad solutions. The average CPU time is rather low and shows that most of the solutions are resulting from short runs. This is also indicated by the low mean of the temperature factor and the relatively low, compared to the other two clusters, mean of size factor. The third cluster, respectively the group of unassigned data points, can


64

be called the \cluster of good solutions". The mean of the objective function value is the lowest we found and shows that only good solutions are grouped together. The average time indicates that the quality is a result of longer running times. Furthermore, all the correlations are signi cant and in a way we would expect them to be with a good performance. Even though, the means of the parameters show that the solutions are a result of a setting for longer runs, they re ect the suggested settings. Although we analyzed more clustering results with dierent parameter settings, the one above was the only interesting one. We never discovered a larger group with good solutions than the above described group of unassigned points. This makes us believe that Simulated Annealing performs badly on the process of clustering and might not be the best algorithm.

5.2.3 Data Mining on Reactive Tabu Search

Analogous to the experiments with SA on FID, the RTS algorithm was to be run 200 times on FID, using the same kind of variation of parameter settings and random number streams. As shown in section 4.3.3, RTS is able to nd the optimal solution within 300 iterations in all cases. Therefore, we had to change the experiment in a way that a dierent data set is used which is harder to solve than the FID. The generated data set DATA1 with the parameters n = 400, p = 10, fraction = 1, and seed = 2343 seems to have the characteristic of being hard enough not to be solved each time but the best solution can still be found within a reasonable time. The RTS algorithm was applied to the generated data set using the following randomly changing parameter setting for all runs CHAOS uniform on [1,6] REP uniform on [1,6] INITK uniform on [3,40] DECPER uniform on [60,95] INCPER uniform on [105,130] CYCMAX uniform on [200,600] All values have to be rounded to the closest natural number. To get a larger variety of parameter settings, we used a positive, a negative, or no correlation between CHAOS and REP, and a negative or no correlation between DECPER and INCPER.


65

The number of iterations was limited to 1200, an experimental value where the problem can be solved most but not all of the times. In fact, we received 132 solutions from 400 executions without having misclassi ed data points showing the lowest objective function value f found during this experiment. Confronted with two dierent types of solutions, one with zero classi cation errors, and one with having a various number of them, the solutions are split immediately in two data sets. The rst one gets all the good solutions assigned, referred to as the good data set, the other only the bad ones, referred to as the bad data set. Each vector of these data sets is constructed out of the six parameters that had been used in the experiment on DATA1, namely CHAOS, REP, INITK, DECPER, INCPER, and CYCMAX. The objective function value f and the running time CPU can be discarded. In the good data set these values are all the same and the clustering algorithm would cluster the data on these co-planar values. In the bad data set, these values are not needed anymore, because we are mostly interested in which parameter settings are causing the bad performance and their correlations, rather than nding out what setting causes which bad performance. Before we continue with the data mining process, our goal should be formulated. There are two data sets of dierent solution quality whereas we assume that this is caused by the dierent parameter settings. As before for SA, both data sets are clustered with RTS using the settings RRTS1. The clustering parameters are varied as before, g from the list (2,3,4,5), the number of points to keep unassigned T from the list (0,5,10,20), and H from the list (20, 25, 30). After executing all clustering processes the results were analyzed. For the bad data set, there was no occurrence of an interesting clustering. Nevertheless, with the setting g = 2, T = 10, and H = 30 on the good data set, a clustering with 2 clusters and 10 unassigned points was sighted to shows an interesting behavior. Table 5.3 shows the correlation matrix of this solution. Cluster I shows an interesting correlation between the parameters CHAOS, REP, and INITK. The correlation with INITK is positive, whereas the parameters CHAOS and REP show even perfect correlation. Due to the fact that this occurs in the good data set, we make the suggestion that these three parameters can be combined in one parameter. Furthermore, the correlation between INCPER and DECPER is -1, suggesting that we should vary the percentage of increasing respectively decreasing by the same amount. The other correlations are not signi cant. The other cluster and group of unassigned points are both not having any interesting correlations, so we restrict our further analysis to the rst cluster.


66

Cluster I - 55 data points CH RP IK DC IC CM h 3.73 3.73 20.75 80.30 119.71 433.48 1.91 1.91 12.12 8.77 8.77 118.43 CH 1.00 0.36 0.00 -0.00 0.06 RP 0.36 0.00 -0.00 0.06 IK 0.14 -0.14 0.04 DC -1.00 0.04 IC -0.04 Cluster II - 72 data points h 3.49 3.60 18.51 80.79 117.93 420.57 1.75 1,75 10.95 10.22 7.88 121.76 CH -0.28 0.07 0.2 0.09 0.17 RP -0.02 -0.28 0.13 0.07 IK 0.04 -0.32 -0.16 DC -0.02 -0.01 IC 0.15 Unassigned points - 10 data points h 4.80 3.61 25.91 75.40 120.71 415.30 1.87 2.07 14.15 14.02 10.57 166.31 CH -0.22 -0.14 0.33 -0.13 0.01 RP 0.01 -0.10 -0.00 -0.20 IK 0.15 0.16 0.22 DC -0.01 -0.49 IC 0.39 Table 5.3: Mean, standard deviation, and correlation matrix for the clustering of the good results of data set DATA1 with H = 30, g = 2, T = 10. We use abbreviations for the parameters of the Reactive Tabu Search. Note that for the rst two clusters the means of the parameter settings are close to the suggested ones. The result of this data mining process is the possible reduction of six parameters to four. There is another possible hypothesis of having an algorithm with only three parameters by setting the parameter INCPER negative correlated to DECPER. Not being interested in the actual hypothesis but in nding new ones we will restrict our analysis on the version with the four parameters and leave the other hypothesis open for further research. The data mining can only suggest new ideas and hypotheses that otherwise might not


67

have been discovered. The hypothesis testing still has to be done afterwards. The next experiment will test the hypothesis. Hypothesis 5.1 The RTS algorithm with four parameters is working as well as with six parameters. The two dropped parameters are set in relation to the others. The data mining was used as an example for the clustering methods and therefore, the hypothesis testing will be kept short. Using the hypothesis, we can set up a new experiment running RTS on generated data sets. For this example, we used a run of 2000 iterations using the data set DS3 with size = 100, fraction = 1, and seed = 2343. The parameters REP, CHAOS, and INITK are set by REP = CHAOS; INITK = CHAOS

(5.4)

with being the factor to set the value of INITK . CHAOS is uniform random in [2,4], DECPER = 95, INCPER = 105, and CYCLE is uniform random in [200,600]. We refer to this parameter setting as RRTS with being in [2; : : : ; 7]. RTS with six parameters is used with random uniform settings for CHAOS in [1,6], REPT in [1,6], INITK in [3,4], CYCLE in [200,600], xed DECPER = 95 and INCPER = 105, referred as RRTS2. The second parameter setting is similar to the RRTS1 run with a random uniform setting for INITK in [3,30], referred as RRTS3. In Table 5.4, the statistics in form of mean , standard deviation , and variance 2 are shown for each parameter setting of the Reactive Tabu Search algorithm, based on 24 dierent runs. We apply a two sided t-test (e.g. see [KDKR93] or [BHS78]) on this data to see if the version with four parameters diers from the original one. The t-test is done by calculating the pooled, estimated variance of the dierence of means between RRTS2 respectively RRTS3 and the results of the RRTS with 2 2 (5.5) p2 = (N1 , 1)(N +1 +N(N, 2), 1) 1 where N1 is the degree of freedom for the RRTS2 respectively RRTS3 results and N for the RRTS results. The meaning of the variances 12 and 2 is the same as of N1 and N . The next step is to estimate the standard error derived from the pooled variance using

s ^ = p2( N1 + N1 ) 1

(5.6)


68

RRTS2 RRTS3 RRTS2 RRTS3 RRTS4 RRTS5 RRTS6 RRTS7 Mean -549.20 -546.86 -560.28 -550.34 -541.05 -542.93 -539.84 -533.84 36.25 24.80 37.44 23.74 16.36 21.90 19.48 16.74 2 1313.92 615.04 1401.67 563.66 267.52 479.45 379.62 280.17 Table 5.4: Result of 24 runs for each algorithm. The mean and standard deviation of the solution quality is used to compare the Reactive Tabu Search with the suggested setting and the four parameter setting.

RRTS2 RRTS3 RRTS4 RRTS5 RRTS6 RRTS7 RRTS2 -1.04 -0.13 1.00 0.73 1.11 1.88 RRTS3 -1.46 -0.50 0.96 0.58 1.09 2.13 Table 5.5: The t statistic between the suggested parameter setting and the Reactive Tabu Search with four parameters. Finally, the t statistic for a test using two samples is calculated by t = 1 ,^ (5.7) with 1 being the mean of the results of the algorithm RRTS2 or RRTS3, and being the mean of RRTS . The degree of freedom is 24 for all algorithm. All other results are shown in Table 5.2.3. The values for the degree 48 of the t distribution can be taken from a table (see e.g. [Sac92]) and they are for a two tailed test -2.012 and 2.012 at a 5% level. Therefore, we can conclude based on our test statistic with all but one value within the range that there is no credible evidence that the algorithm perform dierently. The best result can be found for in the interval [3; : : : ; 5]. The experiment of analyzing the behavior of a Reactive Tabu Search with a reduction in the number of parameters shows the possible usage of data mining and its application in the eld of heuristics. This experiment was too small to rely on it as a proof for the reduction but showed nicely that relations in a data set can be found by applying clustering algorithms.

5.2.4 Conclusion

The data mining section closes the rst part of the diploma thesis where we started the introduction of the model and the generic algorithm to continue with demonstrating an application of clustering. We applied three local


69

search strategies to dierent data set to analyze their behaviour. Remember that tabu search was our choice if we needed a result as good as possible in a fairly short time and the Random Restart Descent in case we need only a solution but do not care about the quality to be the best. We also showed that Simulated Annealing was performing badly and we decide not to suggest that algorithm for clustering. Afterwards, during the application we used SA one more time. Using Simulated Annealing to cluster data sets and afterwards the results we were able to rediscover the parameter settings originally suggested by [JAMS89]. We further discovered that Simulated Annealing might be not the best method. We executed the same experiment with Reactive Tabu Search and discovered an interesting relation in the results of the rst clustering. Based on these results, we developed the idea of melding three parameters of the Reactive Tabu Search into one. This idea was supported by a small experiment. Generally, Reactive Tabu Search and Descent are useful for clustering, either for nding a fast solution or to search in a larger area of the solution space. Nevertheless, the disadvantage of lacking specialization for the clustering process and also not being convergent is the reason why the second part of the thesis was developed.

Chapter 6 Seed Clustering - A New Approach 6.1 Introduction of Seed Clustering In this chapter, we will introduce an algorithm specially designed for the clustering process. It belongs, as the other algorithms before, to the partitional clustering methods and allows us to nd a given number of partitions or clusters in a data set. The algorithm uses an exclusive, intrinsic, and simultaneous partitional clustering. As before, we use the objective function of the problem MINO. As in earlier chapters, we use an extra group Zg+1 for the unassigned data points. They are used to protect against noise in the data set and the clustering algorithm can assign the data points that would increase the objective function value if they are members of a cluster to the unassignment group. Especially during the seed generation and the growing to real clusters, most data points are not a member of any cluster and have to be treated in a special way. The unassignment restriction of the MINO problem stays unchanged, allowing up to T data points not to be assigned. A short description of the method would be the picking of g seed points1 , each growing to a seed of points under a certain metric, and afterwards to a full cluster. Both the seeds and the clusters are transformed by moves. The seed point selection as well as the growing and moving will be described in this chapter and combined in an algorithm for clustering data sets. As shown in later chapters, the algorithm can easily be adapted to sub-sampling Other names for seed points are reference points, centroids, medoid, or center points. The seed point can be a data point of the data set or a new, imaginary point, introduced as a reference. 1

70

CHAPTER 6. SEED CLUSTERING - A NEW APPROACH

71

(see Chapter 8) of large data sets and parallelization (see Chapter 9). Other approaches based on the same principle have been done before. The name \K-MEANS" was introduced by McQueen (see [McQ67]) and Ball and Hall (see [BH67]), but other names as \moving center" in [Mir87], \K-center" in [Hak65], and \K-medoid" in [KR87] are used. The rst two methods are minimizing the error sum of squared Euclidean distances whereas the others are using dissimilarity instead. They are either minimizing the sum of dissimilarities or minimizing the maximum dissimilarity between the data point and the seed point. In the work of [And73] and [DJ80], an extensive discussion on iterative partitional clustering is given along with the description of the basic algorithm that all methods have in common. The basic algorithm can be divided in two parts, the generation of an initial partitioning and the updating. They mention a third part, the adjusting of the number of clusters afterwards, which we do not use for our algorithm. The rst part, the initial partition, is basically the construction of the rst partitioning. In [DJ88], the initial partition is created by assigning the data points to the closest seed points or by using a hierarchical clustering and choosing the partitioning with g clusters. The centroid of the data set which can be a virtual data point and the g , 1 data points being most distant from each other and from the centroid may serve as seed points. Most of the time the Euclidean distance measurement is used. As mentioned in [DJ88], dierent initial partitions do not have to lead to the same solutions when processing the algorithm. Thus, they suggest running the algorithm with dierent starting partitions. It is obvious that the initial partition is relevant for a successful clustering using these methods. While generating the rst seed points for the seed clustering algorithm, we do not know anything about the structure or metric. The rst seed point is selected randomly, but the other g , 1 seed points could be located most distant to the rst one. The distance is depending on the metric and is using the all-data metric which gives us a more or less inaccurate distribution of the seed points. A rst improvement in the location of the seed points is done by extending them to a seed of data points. After growing the seed points, we let the seeds nd a better location by reassigning data points to them, keeping the size of constant. The seed can be seen as a sub-unit of a cluster. Note that this solution with the seeds is not valid because of the small size of the seeds, but we will later show how to grow them to a cluster and therefore gain a valid solution. After receiving a full solution, the metric of the clusters can be used as an initial metric to nd the seed points for the next iteration. Note that K-MEANS is not using the mechanism of seeds or using a dierent metric than the Euclidean metric.


72

The second part, updating of the partition, is called \walking" in the seed clustering algorithm. There are three main ideas that will be important for characterizing the behavior of updating the clusters: 1. The rules for the assignment of data points, 2. The selection of the metric for distance calculations between the seed points and data points, and 3. The frequency of updating the seed points and metrics. As most other algorithms, McQueen [McQ67] uses the Euclidean distance metric and an update of the center is done after each assignment. The same metric, but with an update of the centers after all assignments is used in [For65]. The Mahalanobis distance, using the changing covariance matrix of the cluster, is more complicated to calculate and not used in combination with these methods so far. We will introduce a dierent updating method that is recomputing the covariance matrix after a given number of assignments. The updating is repeated as long as no stopping criteria are ful lled by generating new seed points out of the solution and redoing the update. There is the possibility that an in nite oscillation between two solutions may occur during the updating or walking. A simple procedure that guarantees termination of the algorithm would be the introduction of a maximum number of iterations, see [DJ88]. Its disadvantage lies in the fact that this procedure could result in a non-convergent algorithm. Other methods of either detection or prevention of oscillations and cycles can achieve this convergence. Choices are stopping the updating as soon as either a step is not showing an improvement in the objective function value or two successive iterations result in the same solution. The rst method may display the disadvantage of stopping at plateaus instead of going over them and falling into the next \valley". The second takes care of this and stops only if there is no change after an updating step in the solution. On the other side, there is the chance of getting trapped in a cycle which would lead to an in nite updating process. As a prevention of cycles, a hash table could be used to check for previously occurrence of the same solution where the walking could be stopped. In practice, the K-MEANS method is not using special methods but converge to a solution within a few iterations where all data points would be the same after another step. For further material on convergence see [SI84]. We will implement the second method with dierent convergence criteria and usage of hash tables. As mentioned in [KR90], the number of clusters for the K-MEANS can be less than g if the seed point is not an element of the data set. In that case,


73

it might be possible that no data point is assigned to one cluster causing an empty cluster after the updating. Using only seed points for the seed clustering algorithm that are elements of the data set, the cluster will always have at least one element, the seed point itself. We do allow both, real and imaginary seed points, but use a growing and repair mechanism, as described later, to assure that each cluster will have at least H data points. The third part, the adjusting of the number of clusters afterwards, can be done by splitting or merging the existing clusters. It does not have to be used in the seed clustering algorithm due to a repair mechanism applied to the clusters during the search process. This can be necessary if g, provided by the user, is not the natural number of clusters. A splitting might be executed if a cluster contains two groups that are actually well separated (this can be done by another clustering process using other algorithm parameters speci ed by the user). If two clusters are closer together than a speci ed minimum distance, they can be merged to one big cluster. See [DJ88] and [BH64] for further details.

6.1.1 Terminology

Before we continue with the description of the algorithm and its components, we will introduce the terminology used in this chapter. We intend to use the same terminology as in the literature, but for some methods we introduce a new name or meaning. Wherever a dierent term is common, it will be mentioned and a brief description of the dierences will be given. Seed point - A data point si is used as the initial seed point for creating a seed si for cluster i. The seed point si can either be an element of Z or an arbitrary point in p) then

(Seed must have at least data points) \Calculate covariance matrix W and mean of S . The mean is used as the seed point." for l 1 to n do (Create distance list between all unassigned data points and seed point) if (l = g + 1) then DL DL [ (d2W (; zl ); l) end end

\Sort DL ascending by distance" for l 1 to do (Assign the closest data points to the seed i) D head(DL) (Get the rst tuple and remove if from DL) D[2] i

end else

\Error, stop walking"

end if ("Number of hits Walk false end end

while storing in the hash table" > 0) then

Figure 6.2: Pseudo-code for the walking of a single seed in the solution space. The result is a solution that contains the newly generated seed s0i. Walking is continued as long as every new seed s0i diers in at least one data point from its predecessor si, or as long as no cycle occurs which is the case when a previous seed is reconstructed. This can be checked by storing all seeds in a list and doing an expensive comparison. We will later describe in more detail the usage of a more convenient hash table to see if a seed is rediscovered or not. The pseudo-code is shown in Figure 6.2, using a list for the distance calculation and a hash table that will be introduced later. Figure 6.3 shows an example of a single seed walking in the solution space respectively in the space of unassigned data points. As one can see, the seed nds a position away from its original location.


89

Figure 6.3: Example of walking a single seed. Instead of showing the whole data set, we focus on the area in which the seed is walking. Data points assigned to the seed are red, unassigned data points are shown in a yellow color. The mean of the seed is given by a black square which does not have to be a data point. Frame A shows the initial seed after growing from the seed point, marked with a black outlined square. Frame B is showing the seed after four steps. It moved towards the center of the data points. In Frame C, the seed ipped on the other side of the data points and walked back to the center where it stops after three more steps.

6.4.2 Walking of All Seeds

Basically, this walking is the same as the one above, but all seeds are moved at the same time instead of keeping all but one constant. Therefore, a competition for which seed will get which data point occurs. A simple method is chosen because of the normally small size of the seeds. As before, the mean i and covariance matrix i with i = 1; : : : ; g are calculated and kept. All data points are set to be unassigned as described in equation 6.17. The next step is the growing of a new seed solution based on the stored values. The growing algorithm can be parameterized as growing(1; ; f1; : : : ; g g; fs1; : : : ; sg g; Z; )

(6.19)

with being an empty solution were all data points are labelled as unassigned. The result of the growing process is a valid seed solution containing


90

g seeds of the size . For the walking process, the creation of new seed solutions based on the previous one will be repeated as long as we do not get the same solution twice. The detection of cycles will be covered in Section 6.5.

6.4.3 Walking of All Clusters

The walking of the clusters can be realized in a similar way we have done the all seeds walking algorithm. The nal size of the group must be changed because a cluster can theoretically grow up to the size of n , T data points leaving none left for the other clusters. To prevent this from occurring, the growing is combined with a further algorithm, the repair. Even though there can not be a real distinction between these two algorithms, the growing is mainly used to guarantee that all groups will have at least H data points and the repair lls them to their nal size. Due to this approach, no cluster will end up to be empty and reducing the number of clusters in the data set. Instead of repeating the algorithm from the previous section, we will use here a combination of the seed creation algorithm based on a solution described in Section 6.2.2, the growing algorithm in Section 6.3, and the repair mechanism that will follow in Section 6.6. The algorithm has the following structure: 1. Initialize the hash table as described in Section 6.5.1. All counters for the number of hits are set to zero. 2. Generate a seed solution s based on the previous solution , using the algorithm in Section 6.2.1 but the seed size w instead of . The value w is given by the user. 3. Apply the growing algorithm on s with the following parameters: growing(R; s; s ; ss ; U; H )

(6.20)

4. Apply the repair mechanism. 5. Update hash table with , if already in hash table then stop. 6. Continue with step 2. Note that the solution does not have to be a feasible solution for the MINO problem but must have at least g groups of size ni > p; i = 1; : : : ; g. As before, the parameters of the growing are indexed by s to indicate that these values are not explicitly passed but calculated within the function.


91

This short algorithm describes the walking process that is mainly responsible to move the clusters to the \correct" location. The starting location of the seeds is relevant for the walking and the time it will spend on it. Therefore, the preselection of the seeds is important. The algorithm does not give any details on how to generate the seeds or how to do the repair. Dierent methods are possible and each will alter the walking process as well as the dierent parameter settings. The growing and repair process is mostly in uenced by using a dierent selection of metrics, the seed size w , the refresh rate R, and the minimum number of a data points H in a cluster. We described dierent possibilities above but will outline a few more ideas in the next section.

6.4.4 Improvement of Walking

After the description of the basic algorithm, we are going to show some improvements that can be done. Some of these improvements are implemented in the software developed for the thesis, others are ideas that can be done for a future version. In our original statement of the algorithm, we allowed for every walking method to change all the assigned data points of a group from one step to the other. This could be restricted in a variety of ways. The simplest would be to keep one data point of the group xed. Especially, the seed point can be used as an anchor for the group to limit the walk to an area. This limitation is caused by keeping one data point xed and also trying to keep the size as small as possible. Another idea could be to x always the most distant data point from the seed point trying to pull the walking in one direction but keep it in a certain area around the seed point. If we know - and we leave the criterion of knowing out of this discussion - about a larger area in the solution space where we want the seed to be located, we could use something like a leash. The leash would be xed to one data point, e.g. the rst seed point of the actual walking process, and the center of the seed or cluster is not allowed to move further away than a certain distance, the length of the leash. We were using the mean of the actual group for the determination of the next seed point. In case of only a small change in the mean or in the metric, it can cause the growing to result in the same group. The usage of another seed point besides the mean could push the walking in a dierent direction or keep it alive for a longer time. Some ideas are the usage of a data point that has the most data points within a certain distance, using the metric of the whole group, the smallest average distance of the closest points, or the


92

data point of the group with the lowest frequency in the frequency table. The stopping criterion of the walking is another way to in uence the process. In the original statement of the algorithm, we stopped whenever there was a hit in the hash table with the meaning of an already seen solution. Other possibilities could be an increasing objective function value, similar to a Descent algorithm, an approach where we allow the objective function value only to grow a certain percentage, or seeing a certain number of moves in a row with only increasing values. For each stopping criterion, we have to decide how to pick the nal solution of the walking. We can either take the best seen solution, the last one before the objective function value got a higher objective function value, or the solution where a stopping criterion stopped the walking. In the case where the hash table was used there are a number of possible reasons for a stop of the walking. First, the same solution can be rediscovered or a solution that had the same hash value. Assuming the we got the same solution, we are not able to tell that we found a better one before or that the one causing the hit is a good solution at all. Therefore, we can keep track of the solutions on the path and decide on other criteria like objective function value, size of the group, or separation with other groups. In the case where we use the hash table, the last solution can be either a solution at a location it can not walk away from or we walked into a solution we saw before. There is no guarantee that this is a good solution or that we did not see a better one on our way through the solution space. The walking process can be combined with other methods and heuristics. We will implement an approach to use the Random Restart Descent (RRD) from Section 4.1. It is applied to the group that was found by the walking process of the group with the goal of improving its quality. Instead of walking, RRD will exchange single data points to nd a lower objective function value. Other heuristics like SA or RTS could be used.

6.5 Convergence Using Hash and Frequency Tables In the rst part of the thesis, we used three algorithms from which two were based on an iteration counter. Both algorithms, Reactive Tabu Search and Random Restart Descent, have to be told in advance how long they should run. The walking algorithm, like SA, does not use an iteration counter, but in connection with two more or less complex \memories", a hash table (HT) and a frequency table (F), it is guaranteed to be convergent. The time that is needed to reach convergence is not predictable, mainly because it is


93

based on the structure of the data set and the settings of the parameters. Nevertheless, two kinds of memories can be used to in uence the running time distinctly. Next to this, the parameter settings in uence the time as described in Section 6.8 and shown later in the experiments.

6.5.1 Hash Table for Walking

During the walking process we have to guarantee that we do not cycle in the solution space, i.e. one point ipping back and forth between two seeds. As described for the RTS, we use a hash table where not the whole solution is stored, but a counter at the position of the hash value gets increased. Whenever we retrieve a counter greater than zero we assume that we saw that solution before and will stop the walking process. Like before, there is a certain inaccuracy, but this is acceptable because the algorithm is not exact but a heuristic. The inaccuracy is due to possible misinterpretation of two dierent solutions which happen to have the same hash value. The hash function we use here is done by [WZ93].

h=

n X i=1

i Ni mod jHT j

(6.21)

with h being the hash value, Ni the ith element of a precomputed vector of uniform pseudo-random numbers, and jHT j the size of the hash table. The modulo with the size of the hash table is carried out to prevent an over ow of the table in case of large hash values. The counters for the repetition of a solution vector can be set to zero at dierent times of the algorithm. It can be done after each walking, presenting a clean and empty hash table for the next walking. Or we can initialize the hash table and its counters only at the beginning of a run and not clean it for the whole running time. The second possibility would take care of the walking from two dierent seeds ending in the same path after a few moves. Whenever the solutions are the same at one point, the rest of the path will be the same, too. There is no reason to perform the same calculations twice, and therefore, the running time can be shortened by truncating the rest of the calculation. On the other hand, there is a disadvantage that two totally dierent walks are seen as one by just having the same hash values for just one solution vector, but not the same path.


94

6.5.2 Frequency Table

As mentioned before, the decision on the seed points has the biggest in uence on the algorithm. There are certain ways of selecting these starting points and of controlling the number of seed points used as a rst seed point before stopping the algorithm. The frequency table F keeps track, depending on its strategy, which data points are used for or in a seed by incrementing a counter Fl ; l = 1; : : : ; n. The size of F is n and at the beginning of the execution of the algorithm, all counters have to be set to zero. We can formulate dierent stopping criteria for the algorithm that are based on the amount of data to be written into the frequency table. We distinguish between the following strategies: \every data point", \single seed point", or \complete seed solution". These are only a small selection of of many possible strategies. Convergence is guaranteed in all three cases.

Every Data Point The rst strategy is used to start the search with every single data point z 2 Z for the seed point s1 for the rst seed in the seed solution. Therefore, the algorithm will store n seed points in the frequency table, equivalent with executing n searches with dierent data points as the rst seed. Afterwards, the table is full and the algorithm is stopped. The strategy is similar to the next one, single seed point, using a threshold tF = 1 and incrementing only the counter Fs1 instead of the counters for all seed points. A standard implementation of this criterion is a loop from 1 to n.

Single Seed Point

For every seed point si with i = 1; : : : ; g, we increment the counter Fsi by one. This procedure could be compared with putting a penalty on the seed points so that data points with a lower frequency are preferred to the latest seed points as the next seed points. We do guarantee thereby that every data point will be used as a seed point for at least one seed. We can also guarantee convergence by having the algorithm pick at least one data point with the lowest frequency in the table F as a seed point for every construction of an initial seed solution and by having it stop as soon as every frequency is greater or equal than a threshold tF .

Fsi

Fsi + 1; i = 1; : : : ; g

STOP, if 8 Fi tF ; i = 1; : : : ; n

(6.22)


95

Complete Seed Solution

Instead of storing only the seed point, the whole seeds si with i = 1; : : : ; g are stored after assigning data points to each of them. In this case, not every data point might become a seed point, but every data point was at least participating in a seed. The strategy can be varied by, e.g. accumulating the counters of the frequency table only for the data points in the rst seed s1 or for the data points in all seeds.

Fl

Fl + 1; fl j zl 2 s1 _ zl 2 si; i = 1; : : : ; gg

(6.23)

As before, the convergence is guaranteed by using a threshold tF for the frequency before the algorithm stops, see equation 6.22, and letting the algorithm pick the next seed point with the lowest frequency in the frequency table F . It is almost needless to point out that the convergence criterion with the accumulation of the counters in the frequency table F for all data points in all seeds has the fastest convergence due to the fact that it is increasing the largest number of counters, each time g .

6.6 Repair Mechanism A valid seed solution does not have to be a feasible solution for the MINO problem and might need some repairing. However, if the restrictions hold which are given in the equation set A.2, the solution does not need a repair process. The repair mechanism can be seen as a special variant of the growing where the focus is on the feasibility of the nal solution. After having introduced the growing process in Section 6.3, we have some methods to generate seed solutions and let the seeds grow to the size H . The cluster size holds its restriction, but we do not have a feasible solution after the growing. The group of unassigned data points will have at least T data points but for a feasible solution the number of unassigned data points has to be exact T . The repair mechanism is used to assign these data points to the clusters until the unassignment restriction holds. As described before, the distances between all unassigned data points to all clusters are calculated using the metric of the respective cluster and stored in a distance list. Afterwards, the list is sorted using ascending distances and afterwards the data points, starting from the head of the list are assigned to the closest cluster. The process is stopped as soon as there are only T unassigned data points and the solution becomes feasible according to the


96

MINO problem. From the algorithmic view, the repair mechanism works the same way as the growing so we do not show a mathematical description. Furthermore, the following example is not split into the two algorithms but showing the whole process of going from a valid seed solution to a feasible solution. At last, applying the growing and repair mechanism on a valid seed solution to gain feasibility is shown as an example in Figure 6.4. The data set has 300 data points in dimension two with three clusters which are easy to recognize by just looking at them and consequently the steps are simple to follow. The parameter setting for the clustering is H = 60, g = 3, T = 0, = 5, and R = 20. The initial solution in frame A shows a valid seed solution which contains three seeds with ve data points each. The rst steps are the assignment of the closest points to their corresponding seed. After the rst seven steps, the growing turns out to be interesting in a way that the red cluster got more than H data points assigned to it but loses them in a later step to the blue seed. The growing above the size of H is caused by the refresh rate of R = 20. The recalculations are not done right in the moment a seed becomes a cluster and this way it can grow larger during this step. On the other side, after the recalculation the blue seed is allowed to pick the data points out of that cluster unless it is getting too small. The crossing of the blue seed in frame C is caused by the fact that no cluster is allowed to grow beneath H if it started out as a cluster in that step. The blue group is still not a cluster and assigns more data points during the next step. All data points in the red and the green cluster are tabu because these clusters are of the size H and the only direction left is across the visible and distant groups in the data set. Frame E shows the status where all groups are clusters of a size greater or equal H . The red cluster, trapped in the left top corner by the blue cluster, is not able to grow and the blue and green clusters take this advantage and assign the remaining data points until there are T = 0 unassigned data points.

6.7 Seed Points and Seed Metric Throughout the description of the algorithm, the importance of the seed metric and the seed points is mentioned. With the knowledge of the correct seed points and the metrics of the clusters, the whole clustering process would be trivial but none is available in the beginning. The only available information is the location of the data points, of which each could be one of the correct seed points, and the all-data metric which is an arbitrary decision


97

Figure 6.4: Example to demonstrate the behavior of the growing and repair mechanism. The two algorithms are shown at the same time and no distinction is made. Frame A shows the seeds of the seed solution after the seeds walked to these locations where no further improving through the walking algorithm can be done. Frame B shows the seeds after three steps, frame C after four more steps. The rectangle in frame C focuses on one area of interest. The red cluster has more than H data points whereas the blue has less. Therefore, it is visible in frame D that the blue seed \stole" a few data points from the red cluster causing the red one to be of size H , but the blue sill below H . Frame E presents the status after one more step. Because the red and green clusters have exactly H data points, the blue seed receives the last data points to become a cluster. The \jump" across the red cluster is caused by these circumstances. The last frame, F, shows the nal constellation after seven more steps. The red cluster kept its size due to the surrounding blue cluster but the green was able to increase its size after all clusters reached the size of H .


98

used only because of no better possibilities. This \hopeless" situation is resolved by starting with the all-data metric and a random selection of the seed points. After the application of the seed clustering algorithm, we can use the solution to extract a better estimate for the metric as well as a location for the seed points. Note that a clustering does not have to result into an improvement of the needed information. The next two sections describe the methods used in the seed clustering algorithm and suggest other ideas how to improve the algorithm but not being implemented.

6.7.1 Selection of Seed Points

The seed points are de ned in the section on terminology as the starting points of the seeds. So far, we assumed to know the locations of the seed points without presenting an algorithm for their selection. As before, the rst seed point has to be selected randomly because we have no information about any possible location. Ideas like using a dense area or a location with a certain number of data points in a de ned distance seems to be good. The problem is that we can not say anything about the metric and therefore the necessary distance measurements for the dense area can not be done. The usage of the all-data metric could give us an estimate but it is impossible to check its quality in advance and therefore a random selection is as good as an expensive calculation. The rst seed point can be selected randomly as said before or later it can be based on previous runs. As long as we are not able to determine a data point that has to be a seed point to receive the optimal solution, we have to select for each run the rst seed point by a certain rule. We are going to select the seed point in two dierent ways: The rst method is the selection of a data point that was never before a seed point, the second method is using a data point that was never before in a seed. It is obvious that the second method is limiting the selection more than the rst method because each time a seed solution is created, more data points are restricted for further selections. There is a third possibility which uses every data point as a rst seed point. At this point, we should recall the criterion for stopping the algorithm at the moment where all counters in the frequency table reached at least the threshold TF . Therefore, the last selection rule has the longest running time. Having the rst seed point, we have to make a decision for the other ones. Our goal is a solution that has g clusters that are well separated and therefore we try to locate the seed points in a way that the distances between each


99

other is as large as possible. The metric used for the distance calculations is assumed to be known here and we refer to the description of the selection of the metric in Section 6.7.2. We have to consider the distance of each data point to all seeds at the same time and therefore we use the following equation: sn Y q Y d2k (zj ; z0 ) (6.24) argmax j 2O

k=1 z0 2Zk

with O being the set of data points from which the seed point can be selected, sn being the number of seed points already in the solution, and Zk the data points in seed k. This distance calculation takes into consideration the form of the seeds to which the distances are calculated. The multiplication is used to reward large distances more than small ones. The result is a data point with the largest product of distances from itself to all data points in the other seed. We could use a simpler calculation by doing only the distances between the data point and the seed points but we would not include the shape of the seeds into the calculation. Afterwards, the number of seed points sn as to be increased by one, sn sn + 1. The selection of the seed points can be limited to a set O of data points. This is done by using a frequency table in which counters are used for the number of times a data point was used as a seed point or as a member of a seed. Whenever a new seed point has to be selected, not every data point but a smaller set of them has to be considered. For example, we decided on the rst seed point and have to nd the next at a location that is as far as possible away from the other. Instead of calculating the distance to all data points, we can decide that we want a data point that is used in a seed not as often as others. This will save expensive calculation time. Another reason to make a selection among all data points could be a diversi cation in the starting points to have more variety in the nal solutions. Otherwise, preventing the selection of data points that were not used in a seed that often, can be used to reward the area that had data points in a seed before and might be an area of attraction. With every solution we get better insight into the structure of the data set. Seeds might be attracted to the same area and after nishing the clustering algorithm the frequency table can be checked for the regions of attractions (see also Section 6.9). After the generation of a seed solution of a feasible solution, we can use the seeds or cluster to extract useful values for the next start of the search on the same data set. Next to the metric, the location of a cluster, especially its mean, might be a good choice to be picked as a seed point. Even though it was covered in an earlier chapter, we would like to point


100

out that the actual search can be stopped if a seed point results in a seed that was seen before. The repetition can be checked with a hash table. The algorithm can be stopped because there are no random steps in the algorithm after selecting the rst seed point.

6.7.2 Choosing the Seed Metric

The previous section covered the selection of the seed points. For the decision of good locations for the g seed points, we have to calculate distances as described before. Knowing the seed points, the seeds and later the clusters can grow around each of them. Therefore, we have to calculate distances to determine which of the remaining data points belongs to which seed or cluster. The distance calculation again needs a metric. In this section, we discuss the problems that can occur with the wrong metric and how to nd the metric that is a better representation of the cluster shape. As a shorthand notation we use the word \metric" to refer to the covariance matrix of the cluster, which is used to calculate Mahalanobis distances. Having the wrong metric during the repair process where the seeds are grown to full clusters, would cause the assignment of the wrong data points to them. The example that is displayed in Figure 6.5 is demonstrating the eect of a metric that does not have the shape of the cluster we try to detect in the data set. Frame A shows two clusters but having both seeds in the same one. During the repair process, data points are assigned to the seeds and after each step the metric is recalculated. Frame B is showing the two seeds after eight steps. Note the long and stretched form of the red seed whereas the blue seed is close to the form of the cluster. From the view of the red cluster, using its own metric, data points along the dashed line are closer than data points orthogonal to it. This is the reason why the red cluster gets data points in the next step from the group in the lower right corner instead of the (under a Euclidean metric) closer data points marked with a grey square. Frame C is showing the result of this step causing a stretched cluster shape. The next steps are used to assign the rest of the group from the lower right corner to the red cluster and growing the blue by assigning most of the remaining data points in the top group to it. In the beginning of the calculation, we have the data set and know the number of clusters g to look for. We have neither hints about a possible metric nor do we know the metrics and means of each cluster. In spite of the lack of information, we have to decide on an arbitrary metric to start the calculation and receive a better estimation for the means and the metrics of the clusters. Any ane equivariant metric could be used but it is practical


101

Figure 6.5: Example, how the metric is in uencing the selection of data points for a cluster. Frame A is the initial seed solution where the repair mechanism starts to create a feasible solution. After eight steps, the seed is grown as shown in frame B. The red seed is stretched and pointed along the dashed line whereas the blue is of the shape of the cluster. This longish shape causes the seed to grow in the direction of the lower right group and receiving data points from there. Frame C is showing the result of one more step. To highlight the structure of the red cluster, all data points belonging to the red cluster are connected with each other. The last frame, frame D, is showing the nal result, where blue got more data points from the top left group but were not able to banish the red cluster all the way out of that area. to take the all-data metric that is calculated by Z =

n X i=1

(zi , Z )T (zi , Z )

(6.25)

To continue the algorithm description, we introduce an array M that stores the metrics and means for the clusters. It is of size g and each eld represents one cluster. We use the notation M[i] to access the metric for the seed or cluster i and M [i] for the means of the seed or cluster i. The


102

initial assignment is given by M[i] = Z and M [i] = Z ; i = 1; : : : ; g

(6.26)

The initialization is done and the algorithm can be started. The rst step is the selection of a seed point as described in the last section. We will assume that we have chosen the seed point s1 . In advance of growing si to a seed, a metric has to be found. We choose the metric from the array M under which the seed point s1 has the closest distance to the mean under that metric. argmin(d2M [i](M [i]; s1 )) (6.27) i=1;:::;g Using the seed point and the selected metric, we can grow the seed point to a seed as described in Section 6.2. The selection of the remaining g , 1 seed points is described in Section 6.7.1 and the growing to a seed is a repetition of the procedure used for the rst seed point. For every new seed point the metric is chosen by using the equation 6.27 and the application of the algorithm to grow a seed point to a seed. The result of this part is a valid seed solution with g seeds of size . There are two possibilities at this point of the algorithm. We can either update the array M with the metrics of the seeds or continue with the repair mechanism and then update the array M with the metrics of the clusters. Independent of this choice, the metrics will be updated and might be a better estimate of the clusters than the all-data metric. The whole process is repeated every time we start over with a new seed point after having found a feasible solution, and we do not initialize the array M in later iterations. This basic algorithm can be modi ed using other ideas of how to improve the search. The update of the array M takes place every time a solution was found. Besides the case of rediscovering the same cluster, the metric changes and is stored in the array M. This might lead to a dierent growing of the seeds in the next iteration. On the other hand, there is no need to update all metrics. Keeping a good metric for one cluster constant can be useful to x that cluster to a certain area by eliminating the chances to receive dierent data points due to a new metric. The quality of a metric can be compared with each other and only if there is an improvement within two successive iterations the metric is updated. The measurement of quality can be done in dierent ways. An easy approximation is the determinant of the metric, but also the size of the cluster can be compared to the previously stored one or if one cluster is overlapping with another cluster, both metrics are not used in the update procedure.


103

Note that the metrics in the array are used only for the initial creation of the seeds. From that point, the repair algorithm calculates the metrics by using the data points in the seeds and also is updating the metrics according to the refresh rate.

6.8 Seed Clustering Algorithm One of the main ideas for the clustering algorithm was its modularity and interchangeability. Each component or module of the algorithm can be replaced by another if the functionality and the interface are the same. Besides this, the algorithm is intended to be parallelized and used with sub-sampling. Before we continue with the description of the algorithm, we should point out a few things. This chapter covered the description of the components that can be used to build a whole algorithm. We will not describe them here again but rather try to show their relation between each other and how they can work together to nd a solution in the data set the algorithm is applied to. Furthermore, we will use an abstract form of describing the algorithm and do not explain the interface in details. A more detailed description can be found in the source code that can be obtained from the author of the thesis. Every component that was described before is also implemented. The functionality of the implementation and its options are shown in Appendix D. We can distinguish between components that are used to do the search in the data set and others for keeping information or making decisions about how to continue. We will start with the second one:

Data Set is used to keep information about the problem and all the data points.

Solution stores for each data point the number of the cluster it belongs to.

Frequency is used to count the number of times a data point was a

member of a seed or a seed point. The frequency can be used by other components to store information about the seeds which will be used for the selection of the seed points and for convergence.

We are going to show the algorithm used in the thesis. Afterwards, alternatives for some steps are shown. The algorithm starts at the search component with an empty solution vector and an empty frequency table.


104

1. The search component checks the stopping criterion using the frequency table. If it is ful lled, stop the algorithm and report the best solution found in the data set. 2. Add a seed point to the solution vector. The selection is based on the frequency table and in case of already having other seeds in the solution on the metrics of them to do distance calculations because the location of the new seed point is tried to be most distant to all other seeds. 3. The seed maker is called to generate a seed. The closest data points based on a metric from a previous solution or the all-data metric if nothing else is available are assigned to the seed point. 4. The next two components are optional and used to improve the seed. The seed walking is used to nd a better location for the seed in the solution space whereas the further improving can be done by applying another type of algorithm like in our version Random Restart Descent. 5. If the number of seeds is less than g, continue at step 2. 6. Otherwise, the seed solution is valid. Depending on the strategies, the seed points of data points in the seeds are stored in the frequency table for the next seed point selection. 7. The valid seed solution is grown to a feasible solution. 8. The feasible seed solution can improved by applying the walking algorithm on it. After the walking, the solution vector contains the nal result and has to be stored. The search for a better solution is continued by unassigning all data points and continuing at step 2. A schematic of the components is shown in Figure 6.6. The order of execution of the components is shown with black arrows whereas the usage of one component by another is presented in a green dotted line. Every component has to access the data set and the solution but we did not draw these connections to keep the graph less complex. The frequency update in this version of the algorithm is done only by two components. Other approaches could use the information in the nal result to mark data points as unassigned. Next to the frequency table for the seed points, the metric selection is important. As described before, we can either use the metrics of the seeds or the clusters. Before we continue with alternatives, the usage of other components is shown on one example. The seed walking is realized by picking, e.g., the

105

S e a rc h S e e d p o in t S e le c tio n M e tric S e le c tio n

N o co n v er g en ce

D a ta S e t F re q u e n c y


n e x t s te p is u s in g S to re in

S to re

S e e d M a k e r

Im p ro v in g

L e s s th a n g s e e d s

g se e d s

G ro w in g /R e p a ir C lu s te r W a lk in g

C le a n s o lu tio n

S o lu tio n

S e e d W a lk in g

K e e p s o lu tio n if b e s t

Figure 6.6: The graph shows a schematic view of the modules that are used for the seed clustering algorithm. The order of the modules is shown by the black path whereas the usage of one module by another is shown with a green doted line. The red dashed line is used to show the modules that update the frequency table. Almost all modules have to access the solution and data set but we decided not to draw all arrows to keep the picture clear.


106

mean of the seed and growing of the seed around that new seed point with the previous metric. Therefore, the components seed point selection, metric selection, seed maker, and growing are used. The parameter passed to each of these components is important to receive the correct functionality (see for example the description of the growing in Section 6.3). If we want to replace the seed walking, we can do this by a method that works the same way but does not select the mean of the seed. Instead a data point in the most dense area of the seed is used. The only thing we have to change is the call for the seed point selection, everything else stays exactly the same. Other examples is the replacement of the improving where the Random Restart Descent can replaced by Reactive Tabu Search or Simulated Annealing without performing other changes. The following list shows a few more examples:

The convergence criterion can be changed by either adapting the

amount to write into the frequency table or by changing the rules concerning when to stop the algorithm by e.g. setting a new threshold for the frequency table. The growing can be replaced by methods that grow the seed one by one to a cluster instead of doing this simultaneously. Instead of the seed metrics, the cluster metrics of the nal result can be stored and then used for the next search. The next section presents an outlook of the possibility for further improvements and research of the seed clustering method.

6.9 Possible Improvements This section can be seen as a collection of ideas that might improve the algorithm implemented for the thesis. We will not describe them in every detail but do a short introduction to point out the possibility of further research. During the seed construction, we used the stopping criterion that the walk will result in the same solution in two successive steps. Another approach can be the analysis of the determinant of the seed as used for the clusters. Tests showed that the determinant is not decreasing with each step but can increase. This is desirable because we want to get out of a local minimum if possible to fall into a deeper one, maybe containing an optimum. Some possible improvements can be walking the same way as before but selecting


107

the seed with the smallest determinant or stopping the walking as soon as the determinant increases. Another idea involving the seed is to change the growing algorithm. In the current algorithm, the metric is chosen in advance and all distance calculations use that metric. A recalculation of the metric is not possible as long as the size of the seed is below p + 1 but after that point the metric could be recalculated for each step. At the moment where the seed reaches the size of p and the new metric is calculated, the seed can update itself by checking if there are data points that would have been in the seed under this metric instead of the ones in the seed. If this is the case, the data points can be replaced. The last idea we want to discuss is seed growing. At each step, the size can be compared to the size of the previous step. If the growing of the change in the covariance determinant is larger than a special percentage or similar measurement, the seed can be rejected, i.e. the seed point. This is similar to the idea of using only seed points that are in a dense area assuming a certain metric. The same idea could be used later in the algorithm where we grow the clusters to a size larger than H . Whenever the size is increased too fast, the cluster can be rejected and the algorithm starts over with another seed point. During the growing process of the clusters, the data points are sorted by distances and the data point with the smallest distance is assigned rst to its cluster and so on. Instead of using the distance, the list can be sorted in a way, that the next data point assigned to a cluster is selected by the fact that it is increasing the size of its cluster by the lowest factor compared to all other assignments. The most expensive calculations have to be done for the growing of the clusters and their walking in the solution space. Therefore, the selection of the \right" seed points is essential to reduce the running times. During the run, the frequency table stores the number of participations of a data point in a seed or as a seed point. This can be used to get a histogram as in Figure 6.7. Assuming that the seed points are located in an area of high attraction we can use the histogram to preselect the data points for becoming a possible seed point. This histogram does not require cluster walking and the modi ed algorithm would only use the faster seed walking to determine the most probable seed points but would not execute the cluster walking. The cluster walking is only applied to this limited selection of data points and therefore a large reduction of running time can be achieved. The last idea we are going to show here is the modi cation of the clustering parameters. Instead of looking for g clusters at the same time we start


108

7 6

Frequency

5 4 3 2 1 0 1

6

11

16

21

26

31

36

41

46

Data Point

Figure 6.7: Example for a frequency table. The height of each bar represents the number of times a data point participated in a seed. This information can be used to get an estimation for the selection of the seed points. with the extraction of one cluster and than apply the same algorithm to the remaining data points again. The allows us to get g clusters with a certain number of data points and use their metrics to cluster the data set again with the original number of clusters. This would resolve the problem of having no metric in the beginning besides the all-data metric because the cluster metrics can be used. This would be particularly eective for situations with homogenous or nearly homogenous covariances. An improvement of the algorithm on a dierent level would be the subsampling where a smaller data set, the sub-sample, is drawn from the original one. The results of the clustering can be projected on the data set with the step of an evaluation sample in between to receive the nal solution. The sub-sampling is covered in Chapter 8. In Chapter 9, the sub-sampling is parallelized by assigning each subsample to a dierent processor. We will discuss further possibilities of communication between the processes and an approach for a more sophisticated algorithm.

Chapter 7 Analysis of Performance 7.1 Analysis of the Algorithm 7.1.1 Experiments

The performance of the algorithm depends on several factors like parameters and input data. Therefore, several experiments have to be done to nd the settings where the algorithm performs well. The experiments are split into two parts, the rst one is a preliminary test to estimate the settings of the parameters with the best performance for the algorithm. Having a range of settings, we can do a test where the parameters are tested for a good value and are xed to that value. The tests have to be performed on dierent data sets therefore we cluster generated data sets. Furthermore, the advantage of using the generated data sets is the knowledge of the \real" solution and to have more instances of one problem. All experiments were done on a SGI Origin 2000 with 16 R10000 processors.

7.1.2 Preliminary Parameter Selection

Whenever we talk about the parameters, we have to distinguish between the parameters of the clustering problem and those for the seed clustering algorithm. The problem parameters are the number of groups, g, the minimum size of the clusters, H , and the number of points, T , which have to be kept unassigned. These parameters are independent from the algorithm and therefore the analyst has to nd good settings by doing experiments on the data sets. This can be done by experience gained while clustering similar data sets or by doing extensive experiments to nd the correct choices. We keep these parameters constant for each individual data set. The number 109

CHAPTER 7. ANALYSIS OF PERFORMANCE

110

of groups is passed to the generator and this knowledge will be used in the tests. The parameter T is usually set to zero so that the seed clustering has to assign every data point to one cluster. The value H has a large in uence on the clustering but it is not algorithm speci c. For our preliminary tests, the value of H is set to dierent values to estimate a connection between this and other parameters. Later, the parameter H is kept xed to a certain value speci ed then. The seed clustering algorithm can be in uenced by a large number of parameters. Due to experiments during the development, we separated them into two classes, one containing the parameters that have the largest in uence on the clustering results, and another class with less in uence. The values of the parameters in the second class are set a priori to values which are found to be good in the experiments during the development of the algorithm. An analyst can, in case that he is not able to cluster the data set otherwise, vary these parameters by changing their default values. The rst class contains the most in uential parameters, namely the refresh rate, the seed size, and the size of a seed during the walking of the clusters. Another parameter, the type of convergence, has also a large in uence on the running time because it determines the number of data points to try as a starting point. We will describe a small experiment afterwards to show its in uence but rst we are going to use the convergence where the frequency of the data points in the seeds is increased and where the seed point is selected from the data points that never participated in a seed before. This convergence criterion seemed to provide the best trade-o between solution quality and time performance. The preliminary experiment is done on four dierent kinds of data sets, namely 4D1, 4D3, DS1, and DS2. The size of the data sets is chosen in the range [500; : : : ; 3000] with a step size of 500. The clustering parameter H does not belong to the algorithm parameters but it was set to dierent values to see its in uence. The clustering parameters that are in the group of the most in uencing parameters are set to the following values:

2 fp + 1; 2p; 3p; 4pg w 2 fp + 1; 2p; 3p; 4pg R 2 fp + 1; 2p; 3p; 4p; n=3g Each combination of the seed clustering parameters is used on each of the 24 data sets. After nishing all experiments, the results are analyzed by calculating the percentage of solutions without misclassi ed data points


111

for each speci c value of a parameter. For example, in the 4D1 data set with 1000 data points, the runs with = p + 1 solved 17% of the runs on that data set whereas other settings solved less than 11%. Table 7.1 shows the result of the whole experiment. The values in the table show how often the setting with a speci c value caused the highest percentage of solutions without misclassi cation errors. The result of our small preliminary test is that the setting of the parameters , w , and R should be as small as possible to receive better solutions whereas a larger value of H was responsible for more solutions without misclassi cations1. Having these values in mind we can continue with the experiment for the selection of the best parameter setting.

w R p + 1 16 22 11 2p 5 2 5 3p - - 8 4p 3 - Table 7.1: Number of data sets where a certain setting of a parameter caused the most results with less than 10% classi cation errors.

7.1.3 Parameter Selection

As described in the last section we perform three sequential tests where we change one parameter at a time. First, we keep all but the refresh rate R constant. After nding a good setting, we will x the R to that value as a default for future runs. Next, the parameter is changed while every other parameter is kept constant. The same procedure is repeated with the last parameter w . Every other parameter is set to a default value which was derived from the experiments during the development of the algorithm. The convergence criterion is the storage of the whole seed solution and the selection of the next seed point out of the set of data points that were never in a seed before. The problem parameter H is set to 10% of n, the size of the data set. The parameter g is set according to the number of clusters in the data set which is known because of the usage of a generator. There are no real outliers in the data set and we do not protect against noise, therefore T is set to zero. We use the seed metrics of the last run for distance calculations. 1

The results can be found on the CD-ROM in the data les for these experiments.


112

Furthermore, we do not run an improving algorithm on the seed solution in form of a descent or another local search strategy. The data sets DS1, DS2, 4D1, 4D3, and GFID are used to generate dierent data sets for the experiment. From each input le, we generate three dierent instances with a changing seed for the random number stream. The sizes are chosen relatively to their dimension. In the dimension 10 we start with 400 data points going up to 1000 with a step size of 200. For the data sets with a lower dimension, we start with 200 data points and stop at the size of 800 also using a step size of 200. The small sizes are not generated in high dimension because the number of data points would be too small for an appropriate detection of the clusters. Furthermore, we used the two dierent fractions as described in Section 2.1.2. The generator is creating eight dierent properties for the data sets (four sizes with two fractions) making a total of 120 dierent data sets ( ve data sets and three generated instances of each data set). As before, the complete results can be found on the CD-ROM while we show here only a summary of them. The clustering parameter H was set to 10% of the size n, T was set to zero, and g to the number of clusters which is known in our case. The algorithm parameters and w are set to p + 1 as the preliminary test suggested. The refresh rate R is set to six dierent sizes, 2i with i = 1; : : : ; 6, to analyse its in uence and to decide on a \good" setting. The refresh rate is not set in relation to p in this experiment. See below for an explanation. The seed clustering algorithm was able to nd the correct solution for three types of data set (DS1, DS2, 4D3) out of the ve for all instances and therefore we use these results only for a time analysis. We rerun the seed clustering on more instances of the data set GFID in dimension four and 4D1 in dimension ten. This is done to have more results of the data sets which are hard enough to solve in a way that the \correct" solution is not found every time. We use the solution quality measured by the objective function value for the selection of R. The refresh rate is the number of assignments of a data point to a cluster without refreshing the cluster metrics. Therefore, the metric calculation is performed more often for smaller refresh rates. This can be seen in the results where the running time is decreasing with an increasing R. Figure 7.1 shows the time behavior as well as the quality of the solution for ve instances. The time decreases exponential with a growing R which is the same result that was seen with all data sets. Therefore, we can in uence the running time with a larger refresh rate whereas the percental gain of time is getting less with an increasing R. The refresh rate does not seem to have a strong in uence on the solution quality as it had on the time. For the GFID, it is noticeable that the solution quality rst gets improved with an increasing


113

4 5 0 0

-7 6 5 0

4 0 0 0

-7 6 0 0

3 5 0 0

-7 5 5 0

T im e in S e c o n d s

-7 5 0 0 2 5 0 0 -7 4 5 0 2 0 0 0 -7 4 0 0

O b je c tiv e F u n c tio n V a lu e

3 0 0 0

1 5 0 0 -7 3 5 0

1 0 0 0

-7 3 0 0

5 0 0

0

-7 2 5 0 2

4

8

1 6

3 2

6 4

R e fre s h R a te R

Figure 7.1: Results of the experiment to determine the refresh rate R. The objective function value (red) and time (yellow) is shown for dierent settings of R whereas the number of displayed instances was limited to ve for a better visualization. The black dotted line shows the regression function for the objective function value. Other instances showed the same results as demonstrated here.

R 2 4 8 16 32 64 % 0 0 0.07 0.33 0.39 0.21 Table 7.2: Percentage of solved instances with refresh rate R. refresh rate up to a point where the refresh rate gets to large causing a worse solution quality. Table 7.2 shows the percentage of best found results with a certain refresh rate. The table is based on 96 instances of GFID. A similar behavior can be found with the 4D1 data set. For our experiment on the GFID data set, a refresh rate between 16 and 32 showed the best performance. The results of the 4D1 data set showed a


114

similar behavior but with a lower refresh rate. Due to the number of data sets used in the experiment, we do not have enough evidence to show a real relation between R and p but a hypothesis could be a shrinking refresh rate R with an increasing dimension of the data set. This would cause more recalculations of the covariance matrix if the dimension increases. Another run on generated data sets with up to 10000 data points showed that there is not an obvious relationship between R and n, either. Further research has to be done to perform a more extensive experiment that is including a larger variety of data sets2 . According to the small experiments we performed here and during the development, the parameter has to be chosen relatively to the data set. Only in cases of well separated clusters in a data set that has enough data points for its dimension, it seems that the parameter setting does not matter. The refresh rate has a large in uence on the seed clustering algorithm in terms of solution quality and time. Therefore it is considered as a real algorithm parameter which must be provided by the analyst. We are not able to suggest a certain setting for the refresh rate like, e.g., 3porn=8, we are going to use an arbitrary value that showed a good performance in earlier experiments. We set the refresh rate to R = 16 for all following tests. Next, we redo the same experiment but instead of keeping and w constant, we have a constant refresh rate of R = 16 and a constant w = p+1. The seed size will be varied by setting it to a value from the list (p + 1, 2p, 3p , 4p). The data sets DS1, DS2, and 4D3, as before, are always solved without misclassi cations and therefore we are going to use the GFID and 4D1 data sets for the next experiment. We use ve instances of each data set of which the size and fraction is varied as in the previous runs. All result les can be found on the CD-ROM. Analyzing the time behavior for dierent seed sizes, we discover that the running time on the data set GFID is decreasing with an increasing seed size whereas the running time for the data set 4D1 is increasing. Next to the opposed behavior on the data sets, we could not discover any relation between the solution quality and the seed size . It does not seem to matter what the seed size is set to as long as it is smaller than 3p. A large seed size seems to cause the seed clustering to nd solutions that are not as good as with a smaller (the objective function values got less than 50% of the best found on these instances). An explanation for this behavior is that The extended experiment would go beyond the scope of the thesis and will be the subject of the following research. 2


115

a large seed is not walking as much as a small seed and therefore it can not improve its location that well. Based on these tests, we can not suggest a certain value but p + 1 seems to be a good choice especially it is done before in several other algorithms (e.g., see [RvD97], [Mir96]). The seed size does not have a large in uence on the solution quality and time, and also it does not show a relation to any other parameter. Due to these experiments we are not considering anymore as an important algorithm parameter as we did for the refresh rate. The only suggestion for the seed size is a setting to a small value which can be based on p such as p + 1 being the size of the smallest possible seed. The last experiment is about the third parameter w . It in uences the size that a cluster shrinks to during its walking. We repeat the same experiments with a xed refresh rate R = 16 and seed size = p +1. All other settings are kept the same and, as before, the experiment is only executed on instances of two dierent data sets, namely 4D1 and GFID because the other ones are solved without misclassi cations, independent of the parameter setting. Figure 7.2 shows the running time for dierent settings of w on the GFID data set. For each size n of the data set, the results of six instances are displayed. With an increasing w , the running time of the seed clustering decreases. The same behavior can be seen on the instances of the 4D1 data set. Due to the larger seed size, there is a higher possibility of resulting in the same solution during the next step and the convergence criterion is reached in less steps. This is causing a shorter running time but it also is a disadvantage for the solution quality because less locations are examined for possible local or global minima. The time is in uenced by the setting of w but the solution quality is also aected by this parameter. Having a large value shows shorter running times but the solution quality gets worse because of the shorter walking. This can not be generalized because the cluster walking is the last component to be executed on the data set and relies on the previous quality of the seeds. If the seeds are generated in a way that their locations correspond with the \real" locations of the clusters, the cluster walking will result in the same solution after only one step. Otherwise, if the seeds are positioned \badly", the cluster walking might perform more steps to improve the solution. Therefore, we suggest a setting that is small to allow the cluster walking to improve the solution and do not rely on the previous seed algorithm in a way that it assumes a good location but rather performs some steps to receive a better quality. For our experiment, we received the results shown in Table 7.3. With an increasing w , the percentage of nding the best solution with a certain setting gets lower. The results are indicating a value of 3p as the breakdown

CHAPTER 7. ANALYSIS OF PERFORMANCE S iz e o f D a ta S e t = 4 0 0 S iz e o f D a ta S e t = 8 0 0

116

S iz e o f D a ta S e t = 6 0 0 S iz e o f D a ta S e t = 2 0 0

1 0 0 0

T im

e

in

S e c o n d s

1 0 0

1 0

1 5

8

1 2

1 6

W a lk in g S e e d S iz e

Figure 7.2: Results of the experiment to determine the seed size during the walking process w . All instances of a certain size of the original data set are displayed in a dierent color. Seed size w p + 1 2p 3p 4p Percentage of best found 63.2 31.4 5.4 0 Table 7.3: Percentage of best solution found with a certain setting of w in the 4D1 data set. point but further experiments on more data sets have to be done to verify that. We will use the parameter setting w = p + 1 in further applications. We consider the parameter not as important because the amount of time that is spent on the cluster walking can be decreased by nding better algorithms for the creation of the seed solution. If we could estimate the correct or nearly correct seeds the cluster walking would degenerate to a growing of the seeds to a cluster and maybe some steps for small adjustments. The limited experiments show that the seed clustering algorithm does


117

have one real parameter, the refresh rate R, and a number of parameters that can be used to adjust the algorithm for certain data sets. Both seed sizes during the dierent kinds of walking seem to have an in uence on the running time and the solution quality but without showing a relation to other parameters. The setting we suggested in this section can be seen as a default setting which causes a good performance of the algorithm in terms of solution quality and running time. The algorithm is designed in a way that the analyst can interact with the program to in uence the behavior. Possible options are, e.g., the change of the parameter w to spend more or less time with the cluster walking, to lower the refresh rate for a better estimation of the covariance matrix during the growing, or to change the convergence criterion. Table 7.4 shows the dierence in the running time and solution quality when the convergence criterion is changed from the previous one where we stored the whole seed solution in the frequency table to a convergence criterion where only the seed points are used to increase the counters in the frequency table. The experiment uses the data sets 4D1 and a later introduced data set \universe" with 607438 data points in dimension 15. The average times and objective function values of 64 (4D1) or 48 (universe) runs, respectively, show that, as expected, the storage of the complete seed storage in the frequency table is using less time than only the seed points. For the 4D1 data set, it was only 40% of the time whereas the universe data set was clustered in 14% of the time. Using less time does have an in uence on the solution quality, in case of the 4D1 data set we receive a 25% lower objective function value whereas we get almost the same solution quality for the universe data. We use the convergence criterion with the complete seed solution for the further experiments. The other approach might nd a better objective function value but it needs also more time. Being confronted with the decision to sacri ce quality or time, we prefer using less time. In cases where we need a better solution quality than the one we found, we can rerun the algorithm with dierent convergence criteria or parameter settings. Otherwise, we save time to receive a satisfactory solution quality. The following section will show an example where the seed clustering algorithm can not nd the \correct" solution mainly because of its objective function. Afterwards, the seed clustering algorithm is compared to other software packages to analyze the performance against established and in application used methods.


118

Data Set Time CS Time SP Obj CS Obj SP % Time % Obj 4D1 613.17 1551.85 -2266.95 -2992.41 0.39 0.75 Universum 306.99 2177.83 -50928.54 -50932.33 0.14 0.99 Table 7.4: Average time and objective function value for two dierent data sets. The convergence criteria are the storage of the complete seeds (CS) and the seed points only (SP). The last two columns show the percental dierences. Algorithm Obj Cluster Sizes Seed 2488.54 153 317 RRD 2487.04 151 319 RTS 2487.04 151 319 SA

2526.36

344 126

Parameter Settings SS=4, CS=4, RE=117 IK=7, CH=3, REP=3, DEC=95, INC=105, CYC=200 TF=0.95, IP=0.4, MP=2, SF=16

Table 7.5: Application of dierent clustering algorithm on the sphere data set.

7.1.4 Limits of Maximum Likelihood Clustering

The seed clustering algorithm is able nd clusters in a data set but we used only data sets with a certain structure of the clusters. They were either well separated or overlapping. But we did not have any example of clusters where one cluster was being surrounded by another cluster. An example for such a con guration could be a three dimensional ball of 70 data points located at the origin surrounded by a sphere having 400 data points with a larger dimension than the ball but the same location for its center. The solution corresponding to the \correct" clustering has an objective function value of 2863.84 (MINO). Next to the seed clustering, we applied RRD, SA, and RTS to compare the results. We received the results shown in Table 7.5. The most obvious result of this experiment is the objective function value being lower than the \correct" solution. The maximum likelihood clustering can nd clusters that are well separated or overlapping but does have problems with embedded clusters. In this case, the objective function for a solution with more misclassi ed data points is lower than the \right" answer. The objective function value of the seed algorithm is slightly larger than the one of the RRD and RTS runs but as before the Simulated Annealing shows the worst result. In Figure 7.3, the solution of the seed clustering


119

8 6 4 2 0 -8

-6

-4

-2

-2

0

2

4

6

8

-4 -6 -8

Figure 7.3: Two dimensional display of the sphere with the enclosed cluster. The clusters are shown in two dierent colors. algorithm is shown. We reduced the number of data points and also limited the display on the rst two dimension. The limitation of the data points does not eect the visualization of the clusters but made the whole picture simpler to understand. The red cluster has one third of the data points (on the right side) whereas the rest of the data points are in the blue cluster. The original structure is visible in this display. The algorithm tries to nd small clusters and therefore the big sphere is not preferred against a solution with two smaller but \wrong" clusters.

7.2 Analysis against Existing Clustering Software There are many software packages on the market for the purpose of clustering. The method developed in this thesis is ane equivariant and we do our comparison only against methods that are claiming to be ane equivariant itself. We use the word claiming because there are methods which are in theory dierent compared to their behavior they reveal in practical applications. For many data sets the Euclidean distance is valid and that knowledge could be used to \cheat" in terms of nding a solution faster. We will compare our method against three products: AutoClass by Cheeseman, an algorithm by Rousseeuw, and a rst-improving algorithm similar to the one suggested by Spath, and


120

7.2.1 AutoClass

The program AutoClass by Cheeseman [CS96] is based on the Bayesian model. The theory was founded by Bayes [Bay53] in the eighteenth century. Works in this eld include [Bre94], [HSC91], and [CS96]. An advantage of this approach is that the number of clusters can be found by the software automatically. In our experiments this did not prove to be a signi cant advantage because AutoClass tended toward an excessively large number of clusters (e.g. it found 10 where 3 were generated). Other features of the software are the natural clustering, being ane equivariant, the usage of discrete and real valued data, the handling of missing data, linear time increase with the data size, and probabilistic class membership. The software was chosen because of its performance and quality of the solution. After some experiments it was obvious that AutoClass would be a strong competitor, but a dierent test showed an interesting behavior that lead to the disquali cation of the program. We used a data set of 2000 data points, referred to as AUTODAT, with three clusters in dimension 10 and xed the number of classes to three. Otherwise, there would have been no guarantee that the program would have looked for three clusters but a number that seems to be more appropriate for it. Due to the fact that the clusters were generated we know the \correct" solution of two clusters with 666 and one with 668 data points. The solution found by AutoClass was Cluster 1: 669 Cluster 2: 668 Cluster 3: 663 with only three misclassi cation errors. The next step was the creation of a new data set, referred to as AUTOSTDDAT, by standardizing the data set AUTODAT and a restart of the software. Being ane equivariant, we should have seen the same result as before. Allowing the same time for the computation, we received the following clusters as the result of clustering the standardized data set: Cluster 1: 1638 Cluster 2: 312 Cluster 3: 50 It is obvious that this is not even close to the result from the original data set and we have to conclude that the software is \cheating" by not being ane equivariant but using assumptions to receive a better performance. We compared our algorithm with two other approaches, the First Improving Descent by Spath and an outlier detection algorithm by Rousseeuw. The next sections will show if the seed algorithm is performing better than these two by applying all algorithms on the same instances of a number of data sets.


121

7.2.2 Rousseeuw

The objective of the algorithm developed by Rousseeuw et al. [RvD97] is the detection of h data points (out of the n data points in the data set) whose covariance matrix has the lowest determinant. This is in contrast to our approach where we try to nd g clusters. However, their method is ane equivariant and can be used as a competitor for our method when there are fewer than three clusters. Instead of looking for g clusters, we are going to detect only one cluster leaving n , h data points unassigned. The minimum covariance determinant method (MCD) is modi ed by [RvD97] using techniques to improve the basic algorithm. The so-called FAST - MCD is performing the following steps to nd a solution: 1. The default of h is set to [(n + p + 1)=2] but can be changed to a value in the range[(n + p +1)=2] h n. A value of h = [0:75n] is suggested by [RvD97] as a compromise between breakdown value and statistical eciency. 2. In case of h = n the result is the mean and covariance matrix of the whole data set. The algorithm is stopped. 3. In case of p = 1 an exact algorithm can be used to nd the best subset (e.g. see [RL87]). 4. In case of n 600 (Value determined by Rousseeuw et al.):

Repeat the construction of an initial subset H1 of size h and appli-

cation of two C-steps j times, with j being a large number3 . The construction of the initial subset and the C-steps are described below. On the k results with the lowest determinant of the covariance matrix of subset H1 apply the C-steps until the convergence criterion is reached. The nal result is the subset with the lowest determinant of its covariance matrix after all C-steps. 5. In case of n > 600:

Creation of gs sub-samples of size ns without replacement. Each sub-sample is a representation of the data set.

Rousseeuw et al. suggest the value 500 but it has to be adjusted by a programmer if necessary. 3


122

As before, the detection of an initial subset H1 of size hs =

[ns(h=n)] and two C-steps is repeated j times. For each subset the best k1 results with the lowest determinant of the covariance matrix are kept. Merge the subsets to a larger subset of size nm and apply all results found before (there are k1 gs results) on it. For each result, perform two C-steps with h = [nm (h=n)]. Keep the k2 best results. Perform the C-steps until convergence using each of the k2 results as an origin. The nal result is the one with the lowest determinant.

The initial subset H1 can be constructed with dierent algorithm. The simplest one would be the random selection of h data points but other and more expensive methods can used to nd a better estimate and therewith a better starting location for a following algorithm. The following method by [RvD97] is similar to our growing of a seed function and nds a group of h data points which is an approximation of the later solution.

Select p + 1 random data points and calculate the mean and covariance matrix using the equations 7.2 and 7.3 below. As long as the determinant of is equal zero, add more data points randomly to it.

Calculate the distance between the mean and all data points z using the Mahalanobis distance

with i = 1; : : : ; n.

d2(zi; )

(7.1)

The initial subset is formed by the h data points having the smallest distance to the mean .

Starting from one solution for the MCD, the C-steps can be used to nd a better approximation for the subset of size h with a lower determinant. The condition where the determinant of the covariance matrix does not change during an C-step is called convergence of the algorithm because there was no improvement in the result. The following algorithm describes the C-step:

Starting from a solution with a subset H1 with h data points, the mean X = h1 zi i2H1

(7.2)


123

and covariance matrix

X = h1 (zi , )(zi , )T i2H1

(7.3)

are calculated for further distance calculations.

In case of a non-zero determinant of the Mahalanobis distance be-

tween the mean and all data points in the data set is calculated as in equation 7.1. Afterwards, the new approximation is done by forming a new subset H2 with the data points that are the closest to the mean . The determinant of the new subset is equal or less than the determinant of the old subset and therefore a better solution was found or the convergence criterion is reached. A proof can be found in [RvD97].

The experiments are done on a SGI Origin 2000 with the R10000 CPU running Irix Version 6.4. The FAST-MCD algorithm is written in Fortran 90 and compiled with the MIPSpro 7 Fortran 90 compiler. The subject of the experiments is mainly the comparison of the two algorithms but also to nd the breakdown point of them. The breakdown point is de ned as the percentage of outliers where the algorithm is not able anymore to nd the correct solution (see [RL87]). The data sets are generated with the program \Mulcross" done by Woodru and Rocke to test outlier detection algorithms4. The generated data sets of n data points have two groups where one is seen as an outlier group containing a given percentage of the data points. Furthermore, the dimension of the data set is passed to the program next to the distance between the two groups. The distance is measured as before with the unit Q (see Section 2.1.2) using a multiplier m. For our experiments, the following parameters were used to generate the data sets. The sizes of the data sets were chosen from the list (200, 400, 600) in dimension (5, 10, 15, 20). The FAST-MCD is repeated three times on the same data set with dierent seeds for the random number stream. The seed clustering is only applied once on each data set due to the lack of randomness in the algorithm. Trying to nd the breakdown point the percentage of outliers is increased within all data sets from 5% up to 50% in steps of 5%. For the outlier detection, the 50% is theoretically not reachable (see [RL87]) and Rousseeuw does not allow to look for it in the FAST-MCD. The seed algorithm as a clustering algorithm can nd any splitting of the data set and therefore it is able to nd the \correct" solution. We should 4

The generator is accessible through ftp://schubert.ucdavis.edu/pub/multout


124

mention that both algorithm use the same objective function but dierent linear factors. The distance between the groups is important for the hardness of the data set and the detection of the correct groups and therefore the multiplier is chosen from the list (2, 4, 6) whereas the groups at a distance of 2 are slightly overlapping but being separated at a distance of 6. We apply to each of the 396 data sets the two algorithms. The only parameter for the FAST-MCD algorithm is the size of the cluster (which de nes the number of outliers, too) but it is used as a lower limit for the size and the cluster can grow larger in the nal solution. The parameter setting for the seed clustering is chosen in the following way: The parameter , w are set to p + 1. The convergence criterion is the storage of the complete seeds. The clustering parameters are set to g = 1, H = 40, and T to the number of data points that are declared as outliers of the data set. Compared to the FAST-MCD the number of outliers is xed and it can be seen as \cheating". Later, another experiment will be executed where we look for two cluster where no sizes are determined in advance. An preanalysis of the results received during the experiments shows an interesting property. The seed clustering is either nding the correct solution where the speci ed number of outliers is found and the cluster itself, or the solution could not be found at all causing a large number of misclassi cations. The FAST-MCD shows the same properties than the seed clustering but it never found a solution with zero misclassi cations. Whenever the correct outlier group was detected there were also some data points of the group marked as outliers. In our experiments, the number of misclassi cations never exceeded the 5% marker. Therefore, the rst observation of the experiment is the better solution quality of the seed clustering algorithm. We mentioned before the breakdown point where the algorithm is not able to nd the correct solution anymore. Table 7.6 shows the breakdown points of the seed clustering and the FAST-MCD algorithm for all data sets as well as the needed running time. Both algorithms have the same breakdown point or the seed clustering got a higher one. Actually, the FAST-MCD algorithm does not allow to look for a breakdown point of 50% but with the seed clustering algorithm we can do this. In the table, this is shown with an entry \> 50" where the algorithm found two clusters of equal size, one containing the outliers. The good solution quality is in contrast to the running time of the algorithm. The FAST-MCD is on all data sets faster than the seed clustering.


125

Breakdown Point p (FAST-MCD) p (Seed Clustering) n Q 5 10 15 20 5 10 15 20 200 2 40 20 15 0 40 35 20 20 200 4 45 30 20 25 50 > 50 45 40 200 6 45 30 25 15 50 > 50 50 40 400 2 45 35 25 15 45 35 30 30 400 4 50 40 25 25 50 > 50 50 45 400 6 50 40 30 25 50 > 50 > 50 48 600 2 48 35 25 20 48 35 25 20 600 4 50 35 35 30 50 > 50 > 50 40 600 6 50 35 35 30 50 > 50 > 50 > 50 Time 200 3.92 10.90 25.20 44.72 5.70 10.18 24.39 72.32 400 7.51 23.96 44.37 80.30 34.33 47.30 93.37 183.72 600 5.93 18.70 37.98 62.70 111.67 145.32 223.74 384.93 Table 7.6: Breakdown point of the FAST-MCD and seed clustering algorithm on dierent data sets and the time needed to nd the solutions. For the data sets with 200 and 400 data points the FAST-MCD algorithm used approximately half of the time than the seed clustering algorithm. The last size, 600, shows that the FAST-MCD is using sub-sampling to nd the solution because otherwise the algorithm could not be faster on a data set containing more data points compared to the runs with data sets of size n = 400. The knowledge about the number of outliers was used in the last experiment to set T and that might gave the seed clustering an advantage over the FAST-MCD algorithm. In the next experiment, we look for two clusters where we do not make any assumption about the clusters in the data set but let the algorithm look for the natural clustering. The experiment, where we set the clustering parameters g = 2 and T = 0, was executed to perform further analysis of the behavior of the seed clustering algorithm. Compared to the FAST-MCD, the seed clustering is not specialized on nding only one cluster with a low determinant but to nd any number of clusters. Therefore, we decide to look for two clusters whereas the second cluster is the group of outliers. Besides the data set in dimension p = 20 with n = 200 data points and a distance of two, the seed clustering found the \correct" solution in every case even in the data sets with 50%


126

Time p (Seed Clustering) n Q 5 10 15 20 200 12.14 75.02 271.54 460.07 400 77.33 200.78 159.65 1496.66 600 246.50 575.21 1595.50 2875.10 Table 7.7: Time to cluster the data sets with g = 2 and T = 0 using the seed clustering algorithm Data Sets (p = 10) 200 400 2 4 6 2 4 6 Breakdown Point 20 30 30 45 40 45 Time 197.75 307.56 Table 7.8: Breakdown point and time of the FAST-MCD algorithm using 10000 initial subsets. outliers. The one \failure" is caused by the high dimension of the data set but having not enough data points. Having the advantage of a good solution quality, the disadvantage is given by the increasing running time to nd two clusters. Table 7.7 shows the time for the last experiment. For a fair comparison of the two algorithms, we should allow FAST-MCD to use the same amount of time to nd its solution. We repeated the runs on the data sets of size 200 and 400 in dimension 10. The number of initial subsets was increased from 500 to 10000. Table 7.8 shows the times and breakdown points of this test. Besides a dierent breakdown point (45% instead of 40%) on the data set of size 400 using a distance of four, we did not get an improvement of the results. Furthermore, we even used more time than the seed clustering needed to cluster the data sets with the clustering parameter g = 2. The number of initial subsets could be increased more and eventually FAST-MCD would get better breakdown points but then it would use much more time than the seed clustering. We stop our analysis here with a remark. A complete comparison of the two algorithms needs more experiments, e.g., the usage of a larger variety of data sets and the usage of sub-sampling for the seed clustering algorithm in case of large data sets. Thus, our experiments show that the seed clustering


127

is competitive with the FAST-MCD by Rousseeuw. FAST-MCD nds good results in a short amount of time whereas the seed clustering provides a better solution quality using a longer running time. The last experiment indicates that FAST-MCD is not able to get the same solution quality as the seed clustering even if it is running the same amount of time. Improvements in the seed clustering algorithm like selecting fewer data points as initial seed points or using the mentioned analysis of the frequency table (see Section 6.9) where the cluster walking is not done as often as before, might bring the algorithm closer to the performance of the FAST-MCD without losing the solution quality.

7.2.3 First-Improving Descent

The algorithm developed by Spath [Spa80] is using a Random Restart Descent to detect g clusters in the data set. Besides the usage of a rstimproving selection in the neighborhood and no model for handling unassigned data points, the mathematical background is the same as the one we used for our Descent algorithm. Therefore, we forego a more detailed description of the algorithm and continue with the tests. We do not use the original source code which is written in Fortran 67 but modify our previous implementation of the Descent strategy to match the algorithm by Spath. We rerun the same experiment as we did for the comparison between our algorithm and the algorithm done by Rousseeuw. The running time of the Descent algorithm is determined by the number of random restarts. For a rst trial run we set the number of restarts to 10. All results can be found on the CD-ROM. Using this small value for the number of restarts causes a time that is far below our running times shown in Table 7.7. Furthermore, the First-Improving Descent gets around 80% correct. The seed clustering was able to cluster every data set by using more time and therefore the next run was done with 400 restarts of the Descent. This led to a running time that was between four and ten times as large as the running time of the seed clustering algorithm. The analysis of the results show that more data sets are clustered \correct" than before but that in dimension above 10 some data sets were not solved correctly (around 27%). We conclude that the First-Improving Descent is an algorithm that can nd a solution in a very short time but on the other side it does not seem to matter to let the algorithm run longer in cases where the solution could not be found. An interesting fusion of these two algorithms could be done by executing First-Improving Descent with two restarts and using its solution as an initial solution for the seed clustering algorithm. Therefore, the simple


128

problems will be solved by the First-Improving Descent algorithm and the seed clustering will have a faster convergence because of the \good" estimation of the clusters. On the other data sets that can not be solved by Descent, the seed clustering algorithm will use more steps for the walking to nd the clusters because of the missing estimation.

7.3 Analysis against Local Search Strategies The rst part of the thesis was about local search strategies applied to the clustering problem. After introducing a new algorithm specialized for the clustering process, we are going to repeat the same experiments from the rst part to analyze if the seed clustering can beat the local search strategies. The parameter setting for the seed clustering algorithm is the suggested one found during the experiments in Section 7.1. During the experiments for the local search strategies we had to use a seed for the random number stream. The seed clustering algorithm with a termination based on a convergence criterion does not use any random numbers and therefore the tests do not have to be repeated with several seeds but one execution on the data set is enough. The results of the local search strategies are shown in Appendix B.3. The same experiments are done using the seed clustering algorithm. The results are shown in Table 7.9. The data set DS3 can always be solved without misclassi cations, as expected, because it is in dimension ve with two clusters which are well separated. The seed clustering does solve these instances faster than the other three algorithms. With the next data set, DS1, the seed clustering nds a solution that is better or the same than with the other algorithms. Only on the rst instance, Reactive Tabu Search found a better objective function value. In ve out of eight cases, the seed clustering is the fastest algorithm. The other algorithms, especially Simulated Annealing, need more time to nd a solution whereas Reactive Tabu Search shows a time performance that is almost as good as the seed clustering. In three cases, Reactive Tabu Search found the soluion in a shorter amount of time. The GFID data set seems to be a hard instance for the seed clustering algorithm. The seed clustering needs less time to nd a solution but in contrast to the last results, it can only nd the best solution compared to the solutions of the other algorithms in two out of eight cases. The last results show that the seed clustering performs better in terms of solution quality on data sets that contain well separated clusters whereas


129

Data Set

Data Set Para. Seed Clustering Size Fr. Seed Time f DS3 100 1 2343 0.23 -271.85 100 1 3264 0.25 -185.82 300 1 2343 4.65 -564.29 300 1 3264 4.11 -520.71 500 1 2343 15.20 -902.57 500 1 3264 16.01 -826.65 DS1 100 2 3264 55.48 -521.04 100 1 3264 51.93 -548.68 100 1 2343 13.65 -555.69 100 2 2343 56.56 -596.42 300 1 2343 225.25 -1281.64 300 2 2343 471.63 -1282.68 300 1 3264 100.38 -1192.70 300 2 3264 236.18 -1195.92 GFID 100 1 2343 10.59 -1035.21 100 2 2343 11.84 -1050.67 100 1 3264 10.65 -989.09 100 2 3264 10.80 -987.80 200 1 2343 51.13 -2015.02 200 2 2343 51.17 -2015.02 200 1 3264 46.19 -1925.33 200 2 3264 41.60 -1929.31 Table 7.9: Application of the seed clustering algorithm on three dierent data sets to compare the results to previous local search runs. The results of the other algorithms can be found in Appendix B.3. the local search strategies nd good solutions even on the GFID where the data sets have overlapping clusters. This was the last experiment for the basic seed clustering algorithm. The following chapter will show one more example about the sub-sampling and its time behavior.

Chapter 8 Sub-sampling to Process Large Data Sets 8.1 Sub-Sampling A sub-sample can be seen as a randomly chosen subset from the data set Z . The clusters, represented by the means and covariance matrices, found in the sub-sample can be projected afterwards on the data set Z to get the nal objective function value. This method is used for large data sets. There is a disadvantage hidden in the method. Assuming that we choose a number of data points from the data set as our sub-sample, it can happen that we did not pick enough data points from a small cluster. Therefore, this cluster was not detected during the clustering process. A protection against these small clusters in the data set can be done by allowing unassigned points. During the projection of the sub-sample on the original data set, the small clusters are \covered" with the group of unassigned points. In a later phase, the clusters found in this process can be extracted from the data set and the remaining data points, including the small group of unassigned points, are clustered again to detect the small clusters. Furthermore, the size of the sub-sample has to be in a way that we nd a compromise in time and quality of the solution. The size of the sub-sample depends on the number of clusters g and the dimension of the data set p. We use the following equation to calculate the size of the sub-sample ns = g p2 (8.1) with being a factor provided by the user and ns n. The factor should be at least 1 to provide enough data points in the sub-sample, but less than ns=g p2. The random selection of sub-samples from the data set is repeated 130

CHAPTER 8. SUB-SAMPLING

A

B

131

C

D

Figure 8.1: Visualization of the sub-sampling process. The data set contains n data points (A) out of which a number of sub-samples (B) is drawn. Each sub-sample has ns data points. Note that the size of the slices corresponds with the number of data points in the clusters. The clustering results of all sub-samples is projected on the evaluation sample (C). The best solution on the evaluation sample, having the lowest objective function value, is projected on the original data set (D). to increase the variety of sub-samples and the possibility to nd one that is a good representation of the structure in the data set Z .

8.2 Evaluation of Sub-Sample Solutions on a Data Set After receiving a solution s of the sub-sample, the quality of this solution has to be measured in a certain way. We will introduce two methods that can be used. The rst uses the result of the sub-sampling, the covariance matrices i and the means zi of all clusters i, i = 1; : : : ; g, to generate a feasible solution for the data set Z . The method used to generate a solution vector 0 based on the covariance matrices and the means is described in Section 6.3.2. The objective function value f ( 0 ) can be used to compare the quality of solutions found before. The disadvantage of this method is the large number of distance calcula-


132

tions that are necessary in a large data set. We have to do these calculations at least once to obtain the nal result but we could take a smaller data set that is still larger than a sub-sample to receive a better estimate of the nal solution quality. This larger sub-sample, called the evaluation sample, can be used to project the sub-sample solution s to the solution e. Using more data points than the sub-sample, the solution e will show a more accurate estimation of the quality of the solution and of the clusters found. This step is done after each sub-sample calculation and we keep the solution e with the best objective function value f (e). The last step is the projection on the data set Z after all sub-samples are calculated. The size of the evaluation sample is calculated using

ne = g p2

(8.2)

with being a factor by the user. The relation ne n must hold in all cases. Figure 8.1 shows the sub-sampling process we used. Drawing a certain number of sub-samples and one evaluation sample from the data set Z and clustering all of them, we receive a nal solution after the projection on the starting data set Z .

8.3 Handling of Unassigned Points As mentioned before the unassigned points are used to protect against small clusters that are likely to be undetectable in the sub-sample. By de ning a number of unassigned points T for the data set Z we have to perform another calculation for the sub-sample. The number of unassigned points T has to be reduced to Ts unassigned points for the sub-samples to prevent the case that we allow more points to be unassigned than there are data points available. The method we implemented, which is probably the simplest one, is the construction of a new problem for the sub-sample, keeping the properties of g clusters with at least H data points. The number of unassigned points is determined by the percentage of the speci ed value of T in the data set Z , which is also the percentage to look for in the sub-sample. (8.3) Ts = Tn ns with n being the number of data points in Z and ns being the number of data points in the sub-sample.


133

8.4 Interaction between Sub-Samples A naive version of the sub-sampling algorithm is the generation of a certain number of sub-samples and the application of clustering on them. Afterwards, the best projection of the solution on the evaluation sample is used to generate the clusters in the original data set. The implementation for this is straightforward but does not use previously received information to improve the clustering process. Therefore, the result of clustering one sub-sample can be used as the input for the next sub-sample search. The metrics of the clusters can replace the all-data metric and the selection of the seed points can be done according to the means of the clusters from the last sub-sample. Further strategies to establish a communication between the sub-samples to direct the search will be discussed in Section 9.2 in the context of the parallelization of the clustering process.

8.5 Algorithm for Sub-sampling The following algorithm shows the sub-sampling that is realized in this thesis. Furthermore, we are going to show the time gain on a few examples using subsamples instead of doing the big data set. The algorithm has the following structure: 1. Generate evaluation sample of size ne and one sub-sample of size ns. 2. Cluster the sub-sample. The initial metrics for the clusters have to be chosen arbitrarily at the beginning (but in a way that is ane equivariant). After nding a solution in the sub-sample, the covariance matrix i and mean i for each cluster i with i = 1; : : : ; g of that solution can be used for the next sub-sample as the initial metrics. The arbitrary decision on the initial metric can be the all-data metric, which is as good or as bad as any other one. 3. Construct a solution on the evaluation sample using i and i of all clusters i from the sub-sampling solution. The solution is kept in form of the metrics ei and the means ei of the clusters i = 1; : : : ; g. 4. If the objective function value of the evaluation sample solution is the best seen so far, set i = ei and i = ei with i = 1; : : : ; g. 5. If the number of sub-samples to draw from the data set Z is not reached yet, generate a new sub-sample, and go to 2.


134

= 1:0 = 1:5 = 2:0 = 2:5 Avg. Time 1575.02 3427.03 6094.78 10102.93 Avg. MisClas. 737.34 0 0 0 Table 8.1: Average time and misclassi ed data points of 20 runs on the data set 4D1 with dierent settings for . 6. Construct a solution on the data set Z using i and i , with i 2 f1; : : : ; gg. The second sub-sample can use the results from the rst sub-sample as the initial metrics instead of the all-data metric. This is a small improvement, and we will brie y describe a more sophisticated method, a self-tuning of the algorithm, in Chapter 9 while introducing a parallelization of the seed clustering. The example for the sub-sampling will be based on a small experiment to demonstrate the in uence of the parameter on the running time and the solution quality. Furthermore, we clustered a large data set, more than 80000 data points, with and without sb-sampling to show the improvement in time. In our experiment we could not nd an evidence that the quality of the nal solution is worse than the one from the data set clustered without sub-sampling. The rst experiment was executed on the generated 4D1 data set with

= 1; 1:5; 2; 2:5. We used the setting SC1 and generated 20 instances of the problem. The results of the 20 runs are shown in Table 8.1. For = 1 we were not able to cluster the data without having misclassi ed data points but starting with the setting = 1:5 the \correct" solution was found. With an increasing the clustering of a sub-sample needs more time. We applied further test on dierent generated data sets (namely on the 4D3, DS1, DS2, and GFID) and received the same relation between , the quality of the solution, and the time. The second experiment was executed on the star data set used for the DPOSS project. This data set contains 83266 astronomical objects in dimension 15. First, the whole data set was clustered using no sub-sampling to nd g = 2 clusters. This took on the SGI Origin 2000 with the R10000 CPU 582301 seconds which is equivalent to approximately 6.7 days. Nowadays the problems are most likely to be of a size like the star data or even larger and therefore a faster method is essential because an answering time of a few days is not ecient. The sub-sampling can decrease the time needed to nd a solution but this can only be an alternative if the quality of the solution


135

stays the same. We clustered smaller data sets around 5000 data points (e.g. 4D2, DS1) and were always able to nd the same solution quality. We also applied the seed clustering on the star data using sub-sampling with = 1:5 and got a running time of around 500 seconds to run 10 sub-samples. This value is less than 1 per mill of the original time. We used several parameter settings to see if we can nd a large in uence on the sub-sampling but out of 100 runs we only received two dierent solutions. Compared to the solution from the large data set, the one sub-sampling process classi ed 25 objects dierently whereas the other did the same with 72 objects. This is less than a per mill of all data points and therefore we can not conclude that the sub-sampling has an in uence on the quality of the solution. Summarizing, we suggest = 1:5 to receive a good relation between time and quality. A lower setting is increasing the speed but also the number of misclassi cation and therefore the costs are to high for a gain of 50% in the execution time. The parameter was kept at a value where the evaluation sample has at least half of the size of the whole data set, n=2.

8.6 Application Using DPOSS Project

Sub-Sampling:

This example about an application is the DPOSS project where a part of the sky is photographed in three dierent bands and stored in a computer readable form. This data is used to nd new quasars1 , further relations between astronomical objects, and gain more information about the evolution of galaxies. This large data set with more than 2 billion objects after everything is stored in the computer will also be used to nd more about the evolution of the universe. DPOSS is the abbreviation for \Digitized Palomar Sky Survey" done by the Palomar Observatory and is a project for a survey of the entire northern sky. The sky is photographed using a Oschin Schmidt 48-inch telescope and photographed in three dierent colors. These approximately 900 photographic plates are scanned and analyzed to build an online catalog that can be accessed by everybody. The catalog will be available only via computer networks, there will be no printed versions. The nal size of the catalog is estimated to contain 50 million galaxies and 2 billion stars with 100 thousand quasars. Each astronomical object is described by 40 elds, but currently A quasar is neutron star that is emitting light in form of jets. A quasar is seen as a

ashing star due to its fast spin. 1


136

only 33 of them are used. The rst six elds are used for the location of the object, its classi cation into a star or galaxy, and an identi cation number. This is followed by three groups each one having nine observations from different band passes on the plates. The rst group is from the J plate sensitive to blue light, the second group is from the F plate sensitive to red light, and the third group is from the N plate sensitive to the near-infrared light. Appendix B.7 shows a table with the parameters and the range for the elds. More information can be found in the paper [ODBG98]. The plates are processed by a software package with the name \SKICAT" (The Sky Image Cataloging and Analysis Tool) developed by the Jet Propulsion Laboratory, California Institute of Technology, and National Aeronautics and Space Administration (NASA) (see [Fay98]). SKICAT is used to process the image, storing the data in a database, and is able to apply arti cial intelligence classi cation techniques on the data. The software is used to classify the astronomical objects automatically without being supervised (see [FDW96]). The researcher can concentrate his work on the analysis of the astronomical objects and their relation instead of doing the primary classi cation. Figure 8.2 is showing the process of cataloging the plates. The current status of the project is that almost all plates are done and nearly half of them are cataloged. From the research aspect, the project lead to the discovery of over 30 most distant quasars and the beginning of a new catalog of galaxy clusters (see [Djo98]). We used a sample of 648291 objects for an experiment using a large data set in high dimension. This data set had to be adjusted for our needs by eliminating the rst six elds of each object because they only contain data that are not useful for a clustering process. Furthermore, some objects had missing data that can not be handled by our software and some objects had measured values not being in the range given in the speci cation le. All these objects where deleted from the data set. The nal data set used by us had 607438 objects of which 83266 are classi ed as stars and the remaining 524172 as galaxies. All objects where in dimension 15 with the following elements in each band: AREA, IR2, SSBR, CSF, ELLIP. This is in contrast to the earlier mentioned data set but the 15 dimensions were the only ones we were able to receive for this project. We applied the seed clustering algorithm on the whole data set as well as on a data set containing only the stars and a data set containing only the galaxies. Due to the lack of knowledge about astronomical objects, the interpretation is limited to the numerical and statistical analysis. Further research can be done by analyzing the results from the astronomical view to see if there is a common relation


137

Figure 8.2: Overview of the SKICAT plate cataloging process (from [FDW96]). of the objects in one cluster. The clustering parameters were varied in certain ranges during the analysis. The number of clusters was varied in [2; : : : ; 4]. The parameter T was varied from zero, all data points have to be assigned to a cluster, up to 20%. The unassigned points were mainly used to allow the clusters to be smaller and to get a separation between them. The parameter H was varied from a very small value of 30 to 5% of the size of the data set. A large H might leave small clusters undetected. The parameters for the seed clustering are varied to check if they are in uencing the behavior and therewith the results of the clustering process. The experiment produced a large amount of results from which we want to show only one example. In all clustering results, we were not able to discover clusters that are separated from each other. The most obvious result was the detection of one cluster with 4952 stars. The cluster was found independent from the clustering parameters or the setting of the seed algorithm. The means and the standard deviations of that cluster are shown in Table 8.6. The interesting part is the fact that all elements are close to zero besides the one in the near-infrared eld. Therefore, these stars are similar in that they have only a near-infrared spectrum and none in the blue or red. Furthermore, the correlations between the elds in the near-infrared band N are about 0.2


138

AREA IR2 SSBR CSF ELLIP J plate Mean 0.00 0.01 0.00 0.01 0.01 J plate 0.00 0.00 0.00 0.00 0.00 F plate Mean 0.01 0.00 0.00 0.01 0.01 F plate 0.00 0.00 0.00 0.00 0.00 N plate Mean 216.41 137.72 146.28 101.20 0.25 N plate 360.01 0.52 179.93 0.44 0.15 Table 8.2: Mean and standard deviation of a cluster with 4952 data points found in the data set with all stars. higher than in all the other elds. There might be another possibility, that the values for the red and blue band were not available and that the value zero was used as a default but inquires at the source did not show any evidence of this. The clustering of the star data with the following parameters gave us a solution where we found another cluster with similar means and standard deviations: H = 30, g = 4, T = 0, = 17, w = 30, R = 10000. Here, the rst cluster is the same as the one mention above with 4952 stars in it whereas the fourth cluster contains of 7698 stars. The characteristic is similar but instead of the near-infrared band, the red band is non-zero. The next step could be the extraction of these two clusters for further interpretation or even further clustering. The remaining data set can be clustered again to see if the other two clusters can be split in smaller clusters. The experiments with the other data sets, the one with all astronomical objects and the one with only galaxies, did not show any further information from a statistical point of view, and without external help from an expert the interpretation was not possible. We clustered the data set with all astronomical objects with g = 2 to see if we can rediscover the galaxies and the stars but without any success. First, there was no information about the classi cation of stars and galaxies and if they were based on the elds from the plate and second, we never got the data set which contains all nine elds from each plate. This experiment showed that we are able to handle large data sets and that we can nd clusters with probably interesting information. The seed clustering software is not able to nd a meaning and interpretation for the clusters and therefore the experts are still necessary whenever clustering is applied to a problem from a certain discipline.

Chapter 9 Parallelization 9.1 Extended Sub-sampling Model The previous chapter introduced the sub-sampling which will be used as a foundation for a parallelization of the algorithm. The same idea of using sub-sampling for the parallelization is implemented in the software package CLARA developed by Kaufman and Rousseeuw [KR86]. The sub-sampling can be seen as a good starting point for the parallelization method because of the fact that each sub-sample is independent from all other sub-samples after its creation until the projection of the solution on a common evaluation sample. The simplest parallelization can be done by assigning one processor to each sub-sample, executing the clustering, and the storage of all results. These results have to be projected on the evaluation sample and the solution with the best objective function value is taken for the projection on the data set Z . In this approach, there is no communication between the processors. Due to a start of the clustering at the same time and not knowing a better metric, the initial metric is set to be the all-data metric for all sub-samples. Most of the time, the number of sub-samples will exceed the number of processors and therefore, a processor can be reassigned to cluster another sub-sample after nishing one. Every result is projected on the evaluation sample and in case of being better than another solution, it will be stored and used for later comparisons. Having a solution from a previous run, we are able to exchange the all-data metric as the initial metric with a better estimation of the expected cluster shape. The following list gives some possible methods to choose the initial metrics for the next sub-sample:

The metrics of the clusters in the sub-sample solution executed before on the same processor.

139

CHAPTER 9. PARALLELIZATION

140

The metrics of the clusters in the best sub-sample solution seen so far. The metrics of the clusters in the evaluation sample. The quality of the solution is compared, as before, by using the objective function value. The metrics in the evaluation sample might be the best estimation of the cluster metrics but based on more data points than a subsample has. Therefore, we chose in the thesis the second criterion, using the best sub-sample solution1.

9.2 Memory Model Besides the initial metrics, the algorithm has several parameters that in uence the search progress. Whenever a processor nds a solution , it can compare it to the best found solution . If the solution quality of is better, becomes the new best solution, = and next to the solution the settings of the parameters are stored in a special memory. Finding a solution of worse quality, the parameters for the next search can be modi ed in a way such that the search be guided for a better solution. Using this model, the algorithm would try to nd parameter settings based on other solutions. This self-tuning approach is not implemented in this thesis and must be seen as a suggestion to continue the research in the discipline of clustering using the approach of seed walking. The following list suggests some ideas for the seed clustering:

The seed size in uences the walking. If the previous best solution was found with a relatively small , an increase of it would lead to a longer walking time and the seeds might walk to better locations.

The sub-sample that caused the best solution can be stored and reclus-

tered with dierent parameters, assuming that the sub-sample is a good representation of the data set. Especially, if the frequency table stored the whole seed during the last clustering, the \best" sub-sample could be reclustered using every data point as a seed point to perform a more extensive search.

Having found a good sub-sample, extra data points from the data set can be drawn before reclustering it to get a better estimation of the clusters.

The parallelization was not the main subject of the diploma thesis and we did not run extensive experiments on this. It will be subject of further research. 1


A

B

C

141

D

E

F

Figure 9.1: Visualization of the parallelization process using sub-samples. The data set contains n data points (A) out of which a number of subsamples (B) is drawn. Each sub-sample has ns data points. As before, the size of the slices corresponds with the number of data points. The q rst sub-samples are assigned to one processor each (C) with q might be smaller than the number of the sub-samples. Using previous informations about metrics or parameter settings, the clustering is started. The results of all sub-samples (D) will be stored in the memory and also projected on the evaluation sample (E). The results of the projection are stored to be used for later clustering. Whenever a processor nishes the clustering, a new subsample will be assigned to the processor and restarted, maybe using updated initial metrics and parameter settings. The best solution on the evaluation sample, having the lowest objective function value, will be projected on the original data set (F).

Having found a solution using the seed algorithm, one processor can

be used to apply local search strategies starting from that solution. An exchange neighborhood, starting from a good solution, should terminating after only a few moves due to the fact that the number of improving moves is expected to be small.

Modifying the values of H and R to in uence the repair mechanism.


142

9.3 Parallel Algorithm Figure 9.1 shows the possible model for a parallelization of the algorithm. Our realization for this thesis was done without implementing the memory for the parameter settings or applying a self-tuning before clustering the next sub-sample. The algorithm can be described as follows: 1. Generate evaluation sample of size ne and q sub-samples of size ns where q is the number of processors. 2. Assign each sub-sample to one processor. 3. Start the clustering process. The initial metrics for the clusters have to be chosen arbitrary at the beginning. After nishing one sub-sample, the covariance matrix i and mean i for each cluster i ,with i = 1; : : : ; g, of its solution can be used for the next clustering of a new sub-sample. As before in the sub-sampling algorithm, the arbitrary decision on the initial metric can be the all-data metric. 4. Project the solution on the evaluation sample using i and i of the clusters from the sub-sample solution. During this time, no other processor is allowed to access the evaluation sample. The solution is kept in form of the metrics ei and the means ei of the clusters i = 1; : : : ; g. 5. If the objective function value of the evaluation sample solution is the best seen so far, set i = ei and i = ei , i = 1; : : : ; g. 6. If the number of sub-samples to cluster is larger than the number of subsamples that have been done plus the number of sub-samples that other processors are doing in this moment, a new sub-sample is generated. Continue with step 3. 7. Construct a solution on the data set Z using i and i , i = 1; : : : ; g.

9.4 Time Behavior The implementation of the parallel algorithm does not provide any communication between the processors and therefore this version of the software owes its increase in speed entirely to the fact that more than one sub-sample is clustered at a time. This factor depends on the number of processors that can be used and to which processor the sub-samples can be assigned to. After one processor found a solution in its sub-sample, the result in the form


143

Sub-sample 1 2 3 4 5 6 7 8 Process Time 9 3 4 7 10 10 5 13 10 20 30 40 50 8

P1 1 P2 2 5 P3 3 6 7 P4 4 Multi Processor Run 10 20 2 3 4 5 P1 1 Single Processor Run

30

6

40

7

50 8

60

60

Figure 9.2: Eect of scaling the data on the solution. Shown for g = 2 and g = 3 on a data set Z1 and its scaled counterpart Z2 . of means and covariance matrices of the clusters can be used as a starting point for the next sub-sample. The information can increase the speed of the clustering because a \good" estimation of the locations and shapes of the clusters shortens the time for the walking2. A short experiment demonstrates the parallelization and shows the time behavior. In theory, the factor for the increase of speed should be the number of processors because nothing else than clustering the same number of subsamples is done. It can happen that one processor is still clustering the last sub-sample while the others are nished and have to stay idle. A scenario of this is shown in Figure 9.2. The table shows the time used to cluster each sub-sample. The values are an idealization of an experiment to demonstrate the processor assignment. The multi processor run is shown in the upper half. The rst four sub-samples are assigned to one processor each and for the next three time units all processors are busy. Whenever a processor ends the calculation of one sub-sample the next one is assigned to it. After nine This is due to the fact that the distances for the data points to the seed point of a cluster are calculated under a metric that is a better estimation than the one from the rst run where we assumed that the all-data metric was valid. Therefore, the walking can converge faster because of a better location to start from. 2


144

time units, all sub-samples but one are assigned to a processor so that the last one must be assigned to processor one. After 14 time units, three out of four processors are done and stay idle while the last processor is busy to cluster the last sub-sample for another eight time units. The lower half shows the time needed to cluster the same sub-samples on a single processor system. After clustering one sub-sample the next one can be generated and clustered. The time that is needed to cluster eight sub-samples is higher compared to the parallel version by the factor of 2.8. We did not get a speed up by four due to the idle time of three processors and therefore a development of a strategy to receive an equal load on each of the processors is the subject of further research in this eld. Assuming an unlimited number of sub-samples, a linear increase of speed by the factor of processors should be seen. An experiment were we clustered 1000 sub-sample on four processors showed an increase by the factor 3:76.

Chapter 10 Conclusion The diploma thesis is about clustering algorithms and their application. We began with an introduction of general clustering algorithms and continued with the development of a probabilistic clustering method using the maximum likelihood model. This mathematical formulation of this ane equivariant method was used to create a presentation for the solution and a neighborhood which was used to implement some local search strategies. After running experiments for a performance analysis, we discovered that Simulated Annealing is a bad choice to be used for clustering applications but the other two tested local search strategies, Random Restart Descent and Reactive Tabu Search, have a good performance. Some disadvantages of these algorithms, e.g., having no convergence criterion and using random starting points, led to the development of an algorithm that is specialized on the task of clustering. The main idea for this algorithm was the usage of components that can be replaced independently with other components. Thus, a component can be replaced by a better one without redesigning the whole algorithm. Furthermore, the algorithm was constructed in a way that sub-sampling and parallelization would be possible without changing the algorithm but reusing its components. The performance of the seed clustering algorithm shows that it is competitive with other methods that are used in commercial products. We compared our algorithm to ane equivariant methods like FAST-MCD by Rousseeuw and a First Improving Descent as suggested by Spath and we were able to get the same or better solution quality. Due to the complex algorithm, the other methods are faster but even with allowing them to run longer they were not able to show the quality in the results that we received with the seed clustering. Next to the comparison with other algorithms we analyzed dierent pa145

CHAPTER 10. CONCLUSION

146

rameter settings. We executed a small experiment where we analyzed the in uence of three parameters on the results of the algorithm and we came to the following conclusion: The refresh rate is important for the running time as well as the solution quality and has to be seen as a real parameter. The seed sizes during the creation of the seeds and the cluster walking are in uencing the algorithm and therewith the running time and the solution quality but we could not nd a real connection between the parameters and the structure of the data set like dimension and size. These parameters can be set to a default value and do not have to be seen as real parameters. The default setting for the algorithm is given with the seed sizes of p + 1 with p being the dimension of the data set and the refresh rate set to the value 16. The last parameter is rather arbitrary but in our experiments we received good results in a reasonable time and we had not enough evidences for a dierent setting. The algorithms in the diploma thesis were used in dierent applications. The Simulated Annealing and Reactive Tabu Search were used to show the possible usage of clustering in data mining applications. We used the clustering algorithms to analyze the output of the same algorithm to nd hypotheses which can suggest interesting features about the algorithm itself. For the Simulated Annealing, we discovered the parameter setting suggested by Johnson et. Al. [JAMS89] by clustering the algorithm results. The data mining of the Reactive Tabu Search results discovered a hypothesis that three parameters could be replaced by a single parameter. A following experiment showed that the new reduced parameter setting showed no signi cant disadvantage compared to the setting with six parameters. The seed clustering was used to analyze the data set that was created by a project which is photographing and cataloging the sky (DPOSS). A statistical interpretation showed that we found some interesting solutions but only an expert in this eld can answer the question if the result is interesting from an astronomical point of view. The seed clustering algorithm developed in this thesis is competitive and able to nd good results. Nevertheless, the algorithm structure opens new elds for research. The components can be exchanged by other methods that are better in terms of time or quality. In Section 6.9 we described some ideas that can be implemented to improve the basic algorithm, but also the extension of the communication between sub-samples in the parallel version as well as the single processor version should be part of further research. To conclude, we like to note that the seed clustering algorithm supports the analysis of multivariate data sets by showing a good performance and solution quality in the clustering process.

Appendix A Mathematical Programming Formulation of MINO As described in Chapter 3 this appendix will show the mathematical formulation of the minimizing problem MINO for the maximum likelihood clustering. Given a data set Z with n data points, zj 2 Z; j = 1; : : : ; n with zj in 0) Table_[a]--; } // Decrease the frequency of one element // Increase the frequency of the data points in the whole solution void Inc_Frequenz_For_Seed_Solution (Mixed_Solution &Solution, int GroupNumber_Of_Outlier); int Report_Frequenz (int a) { return Table_[a];} void Clean_Frequenz_Table ();

// return the frequency of one element

// Set all frequency to zero

int Find_Lowest_Frequenz (int &Lowest);

// Return the element number with the lowest frequency (first find)

// Return element with lowest frequency (random choice more than one el.) int Get_Random_Lowest_Field (int &Lowest); void Init_Random_List(); // Used for random selection, can be reseted form outside void Print(); private: int Size_; long int int int

Seed_;

// Print the frequency table

// Number of elements // random number stream seed

*Table_; *Random_List; Random_Pos_;

// Table to store the frequencies //List to select a random element if more than one have the lowest frequence

}; #endif

List.h // // // // // // // // // // // // // // // // // // // // // // // // // // // //

====================================================================================== C L U S T E R

-

S E E D

W A L K I N G

A L G O R I T H M

Program is part of a diploma thesis by Torsten Reiners, UC Davis and TU Braunschweig Name of Module: List for collecing tuples and sorting them. To approach, array and dynamical list HEADERFILE Objects in file: Point MyListElem MyList ArrayList Version: Version : 1.0 Final Version for Diploma Thesis Last Update: 25-July-98 Notes: Research on Project continues, Release Version not reached yet Version Control and Updates done: Not available

181

APPENDIX E. SOURCE CODE

182

// // Future Projects: // Comments, but only standard list operations are used // // =======================================================================================

#ifndef LIST_H #define LIST_H #include #include #include #include #include #include #include #include #include

#include "vectors.h"

#define SMALLER 0 #define BIGGER 1 // ====================================================================================================) // ====================================================================================================) class Point { public : Point (Real x1, Real x2, Real x3, Real x4) { x_[1]=x1; x_[2]=x2; x_[3]=x3; x_[4]=x4;} Point () { x_[1]=0;x_[2]=0;x_[3]=0;x_[4]=0; } void print () { cout Set_Next(LE); tail_ = LE; } else { header_ = tail_ = LE; } } // ====================================================================================================) // ====================================================================================================) template int MyList::pop (T& Pt) { MyListElem *LPtr = header_; if (header_ != NULL) { Pt = *header_->DataElement(); header_ = header_->Get_Next(); delete [] LPtr; return True; } else return False; } // ====================================================================================================) // ====================================================================================================) template void MyList::insert (MyListElem *LE, int SortDir) { MyListElem *LPtr = header_, *PrevPtr = NULL, *newelem = NULL; newelem = new MyListElem(*LE);

switch (SortDir) { case SMALLER: while ((LPtr != NULL) && (!(*(LE->DataElement()) < *(LPtr->DataElement())))) { PrevPtr = LPtr; LPtr = LPtr->Get_Next(); } break; case BIGGER: while (LPtr != NULL) { if ((*(LE->DataElement()) > *(LPtr->DataElement()))) break; PrevPtr = LPtr; LPtr = LPtr->Get_Next(); } break; } if (header_ == NULL) { newelem->Set_Next(header_);

186

APPENDIX E. SOURCE CODE header_ = newelem; tail_ = newelem; } else if (LPtr == NULL) { tail_->Set_Next(newelem); newelem->Set_Next( NULL); tail_ = newelem; } else { if (PrevPtr == NULL) { newelem->Set_Next(header_); header_ = newelem; } else { newelem->Set_Next(LPtr); PrevPtr->Set_Next(newelem); } } } // ====================================================================================================) // ====================================================================================================) template MyList::print () { MyListElem *LPtr; LPtr = header_; int i = 0; cout Instance().X().p()); #endif SeedSize_ = a; } private: int SeedSize_; int NewGroup_; long RandomSeed_; }; // ==================================================================================================== // Generates a seed around a given point in the solution vector // ==================================================================================================== class Point_Seed_Maker : public Seed_Maker_Base { public: Point_Seed_Maker (MinW_Outlier_Problem& Inst, Location_and_Shape &LS, int a); ~Point_Seed_Maker (); void Make_Seed (Mixed_Solution &Seed_Sol); void Set_Starting_Point (int a) { #ifdef __DEBUG__ assert (a < Instance().X().n()); #endif Starting_Point_ = a; }

// Pass the seed point ot the function

Solution_Processing_Functions &get_SPF () {return *SPF_; } Location_and_Shape &Get_SeedLS () {return *Seed_LS_; } void Set_LS_For_Metric (Location_and_Shape &LS)

// return Object that worked with the solution

// returns the location and shape of the seeds

{ LS_ = LS; }

//Pass the metric for the seed generation

void Set_Seedsize (int a) { #ifdef __DEBUG__ assert (a > Instance().X().p()); #endif SeedSize_ = a; } private: Solution_Processing_Functions *SPF_; int SeedSize_; int Starting_Point_; Location_and_Shape &LS_;

// Size of the seed to produce // Seed point of theseed

// This Data Array represents the metric to do calculations in, default is // All-Data Metric.

Location_and_Shape *Seed_LS_; };

// ==================================================================================================== // Generate a seed based on the informations (clusters) in the solution vector (seedpoint closest dp to the mean // ==================================================================================================== class Closest_to_Mean_Seeds_Maker : public Seed_Maker_Base { public: Closest_to_Mean_Seeds_Maker (MinW_Outlier_Problem& Inst, int a); void Set_Closest_Seeds_Size (int CSS) { SeedSize_ = CSS; } void Make_Seed (Mixed_Solution &Seed_Sol); private: int SeedSize_; };


191

// ==================================================================================================== // Same as the Closest_to_Mean_Seeds_Maker, but is using the virtual data point equal the mean of the cluster // ==================================================================================================== class Mean_Seeds_Maker : public Seed_Maker_Base { public: Mean_Seeds_Maker (MinW_Outlier_Problem& Inst, int a); void Set_Closest_Seeds_Size (int CSS) { SeedSize_ = CSS; } void Set_Cluster_Number (int ClusterNr) { ClusterNr_ = ClusterNr; } void Make_Seed (Mixed_Solution &Seed_Sol); private: int SeedSize_; int ClusterNr_; };

// ==================================================================================================== // Generate a seed solution by passing the location and shapes to this object // ==================================================================================================== class LS_Mean_Seeds_Maker : public Seed_Maker_Base { public: LS_Mean_Seeds_Maker (MinW_Outlier_Problem& Inst, Location_and_Shape **LS); void Set_Closest_Seeds_Size (int CSS) { SeedSize_ = CSS; } void Set_Cluster_Number (int ClusterNr) { ClusterNr_ = ClusterNr; } void Make_Seed (Mixed_Solution &Seed_Sol); private: Location_and_Shape **LS_; int SeedSize_; int ClusterNr_; };

// // // //

==================================================================================================== Repeat the application of a seed maker for a certain amount of iterations, (get the best out of x random seed creations, e.g. ====================================================================================================

class Iterated_Seed_Clusters : public HS_Base { public: Iterated_Seed_Clusters(Seed_Maker_Base &SMin, MinW_Outlier_Problem &MINOin, long &RandomSeed,Mixed_Solution *OldSol=NULL); ~Iterated_Seed_Clusters(); void Go(long Iters); private: Mixed_Solution *SeedSol_;

// Store the starting solution and the iterations of it.

MinW_Outlier_Problem &Inst_;

// Reference to the problem

Grow_Seeds_Cluster *GSC_;

// Storage for the Growing Cluster Algorithm

Seed_Maker_Base &SM_; long &RandomSeed_; };

// // // //

==================================================================================================== Call with solution that is a valid seed_cluster_solution but not feasible. This object will repair the clsuters and produce a feasible solution ====================================================================================================

class Grow_Seeds_Cluster : public HS_Base { public: Grow_Seeds_Cluster (MinW_Outlier_Problem &Inst, Mixed_Solution &Solution); virtual ~Grow_Seeds_Cluster (); void Set_RefreshRate (int r) { RefreshRate_ = r;} void Set_Solution (Mixed_Solution &Solution) {

// Set refresh rate for the metric update


192

delete Sol_; Sol_ = new Mixed_Solution (Solution); } void SetGfxfile { gfxfile }

(ostream *gf)

// Activate the output of a gfxfile

= gf;

void Go (long Iters);

// Iters is not use

Mixed_Solution& SeedSoln() { return *Sol_; } virtual int IsValidSolution();

// returns aa reference to the original solution

// Checks, if it is possible to repair (otherwise asserts)

// Add Solution

virtual MinW_Outlier_Problem &Instance() { return Inst_; } virtual void Repair (int Group_Only = -1) ; protected: void Recalculate_Location_Shape (int NumberOfLS = -1); virtual void Recalculate_DistanceList (ArrayList &PointList, int Group_Only = -1);

int RefreshSteps_;

//Statistik

Solution_Processing_Functions *SPF; int NeedsRepair_; // Flag to mark, if the solution is valid int RefreshRate_; // Value of the time between refreshes in points added int NumberOfGroups_; private: MinW_Outlier_Problem &Inst_; Mixed_Solution &SeedSoln_; Mixed_Solution *Sol_;

// Reference to the Seed_Cluster_Solution for the grow&repair // Copy of the initial solution to work on

char *GFXFile; ostream *gfxfile; }; // ==================================================================================================== // Grow single seed cluster to the size of H // ==================================================================================================== class Grow_Single_Seed_Cluster : public Grow_Seeds_Cluster { public: Grow_Single_Seed_Cluster (MinW_Outlier_Problem &Inst, Mixed_Solution &Solution); virtual ~Grow_Single_Seed_Cluster (); virtual int Grow_Single_Seed_Cluster::IsValidSolution(); void Go (long Iters, int Group_To_Grow); // Iters is not use private: Mixed_Solution &SeedSoln_; // Reference to the Seed_Cluster_Solution for the grow&repair Mixed_Solution *Sol_; // Copy of the initial solution to work on }; #endif


Seedminwout.h // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

====================================================================================== C L U S T E R

-

S E E D

W A L K I N G

A L G O R I T H M

Program is part of a diploma thesis by Torsten Reiners, UC Davis and TU Braunschweig Name of Module: Derivat of the MINW_Neighborhood to include the fixed seed point while improving the seeds using descent Objects in file: Seed_MinW_Outlier_Neighborhood Version: Version : 1.0 Final Version for Diploma Thesis Last Update: 25-July-98 Notes: Research on Project continues, Release Version not reached yet Version Control and Updates done: Not available Future Projects:

=======================================================================================

#ifndef SEEDMINWOUT_H #define SEEDMINWOUT_H

// minwout.h --- the minw problem and neighborhood with a group of outliers

// System includes #include #include #include #include #include #include #include #include

// Own includes #include #include #include #include #include

"solver.h" "data.h" "matrix.h" "minwout.h" "minw.h"

class Seed_MinW_Outlier_Neighborhood : public MinW_Outlier_Neighborhood { public: Seed_MinW_Outlier_Neighborhood(Mixed_Solution &CS, MinW_Outlier_Problem &Inst); ~Seed_MinW_Outlier_Neighborhood(); void Set_Fixed_Points(Data_Array &Array); // Pass an array with the points fixed in the neighborhood virtual Real Calc_Move_Out_Obj(int Ci, int Cj, int Oi, int Oj) const; // Overload this function private: Data_Array *X_; // Array of fixed data points }; #endif

193


Spf.h // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

====================================================================================== C L U S T E R

-

S E E D

W A L K I N G

A L G O R I T H M

Program is part of a diploma thesis by Torsten Reiners, UC Davis and TU Braunschweig Name of Module: Object to handle solutions and the clusters, collection of methods. It mainly collects the data arrays for the clusters, their location and shape, and takes care of updating and accessing them. HEADERFILE Objects in file: Solution_Processing_Functions Version: Version : 1.0 Final Version for Diploma Thesis Last Update: 25-July-98 Notes: Research on Project continues, Release Version not reached yet Version Control and Updates done: Not available Future Projects:

=======================================================================================

#ifndef SPF_H #define SPF_H // System includes #include #include #include #include #include #include #include #include

// Own includes #include #include #include #include // // // // //

"minwout.h" "data.h" "solver.h" "list.h"

==================================================================================================== Object to store a solution and its clusters in form of data arrays and their location and shapes. Furthermore, methods to provide access are implemented. Note that cluster can also be a seed, general it can be called group ====================================================================================================

class Solution_Processing_Functions { public: // Constructors, either with the number of the groups defined in the problem or passed as a parameter Solution_Processing_Functions (Mixed_Solution &Sol, MinW_Outlier_Problem &MINO); Solution_Processing_Functions (Mixed_Solution &Sol, MinW_Outlier_Problem &MINO, int NumberGroup); ~Solution_Processing_Functions();

// Destructor

void Build_Member_Array(); // Redo the counting of elements in each cluster void Build_Data_Arrays(); // Redo the construction of the data arrays for the clusters void Build_Location_Shape(); // Redo the location and shape of the clusters // Move a data point from one cluster to another. Everything updated besides LS void Move_Data_x_to_GRP (int row, int to); // Move a data point from one cluster to another. Everything must be updated void Move_Data_x_to_GRP1 (int row, int to);

// return value from a cluster, specified by the number of data point and dimension

194


195

inline Real da_Value (int Array, int row, int col) {return GroupData_[Array]->Data(row,col);} // return the original data point number of a data point in a cluster inline int ds_Value (int Array, int row) { return DataSolution_[Array][row];} inline int Member (int Nr) { return Member_[Nr]; }

//return the number of elements in a cluster

Data_Array &get_DA(int Nr) { return *GroupData_[Nr]; } Location_and_Shape &get_LS(int Nr) { Build_Location_Shape(); return *Loc_Shape_[Nr]; } // return reference to a LS of a cluster int Get_Highest_Cluster_Number();

// return a reference to a data array of a cluster

// return the number of the highest cluster

inline int Data_Row_in_which_Group (int row) { return Sol_.x_int(row); } // return number of cluster for data point int NumberOfGroups() { return NumberOfGroups_; } int Check_H_T_valid ();

// return the number of groups used for setting up the structures

// Checks, if the actual solution has enough member in each group, T might be > Tin.

void Set_Solution (Mixed_Solution &Sol); // Pass new solution to object. All structures have to be updated void Set_NumberOfGroups (int a) { NumberOfGroups_ = a; } // Change the number of grous for the data set Mixed_Solution& Get_Solution () { return (Sol_); }

// return the solution vector

// A various number of methods to create structure of distances between data points. For a more detailed instruction see // the .cc file void void void void

Create_Distance_Matrix get_Distance_List_For_Point_From_Matrix get_Distance_List_For_Point get_Squared_Distance_List_For_Outlier

(Data_Array &Data ); (ArrayList &ML, int pnt); (Location_and_Shape &LS, ArrayList &ML, int pnt); (ArrayList &ML);

// return the distance between two data points (from the matrix) inline Real get_Distance_Matrix (int x, int y) { if (DistanceMatrix_ != NULL) return DistanceMatrix_->m_Value(y,x); else return 0.0; } protected: void Initalize_Variables (); private: int Location_and_Shape Data_Array int int int

// Initalize all variables when creating a new instance.

*Member_; **Loc_Shape_; **GroupData_; **DataSolution_; *PositionInGrp_; *LS_Update_Flag_;

Mixed_Solution &Sol_; MinW_Outlier_Problem &MINO_; Matrix *DistanceMatrix_;

// Number of Members in each Group (for G+1 Groups) // Array of Location and shape of each Group // Array of data_Arrays (for each group one) // Stores for every entry in GroupData_ the position in the orig. DataArray // Stores for each original data point the position in the data arrays // Flag to indicate the the location and shape has to be updated

// Storage for the solution vector // Storage for th eproblem base

// Storage for a Distance Matrix for one Group.

int NumberOfGroups_, NOGA_, // Number of groups the Arrays are build with NOGL_; // Number of groups the Location and Shapes are build with int Max; int Update_Member, Update_Data, Update_LS; };

#endif

// Boolean flags to keep track if an update is necessary (TRUE)


196

Walking.h // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

====================================================================================== C L U S T E R

-

S E E D

W A L K I N G

A L G O R I T H M

Program is part of a diploma thesis by Torsten Reiners, UC Davis and TU Braunschweig Name of Module: Walking of seeds and clusters in the solution space, Headerfile Objects in file: Walking_Seeds Walking_Seed

(also used for clusters)

Version: Version : 1.0 Final Version for Diploma Thesis Last Update: 25-July-98 Notes: Research on Project continues, Release Version not reached yet Version Control and Updates done: Not available Future Projects:

=======================================================================================

#ifndef WALKING_H #define WALKING_H // System inludes #include #include #include #include #include #include #include #include

// own includes #include #include #include #include #include

"data.h" "solver.h" "list.h" "seedcl.h" "spf.h"

// ==================================================================================================== // Moving of seeds in the solution space. The object is working on a solution vector. // ==================================================================================================== class Walking_Seeds : public HS_Base { public: Walking_Seeds (Seed_Maker_Base &SMin, MinW_Outlier_Problem &Inst, Mixed_Solution &Solution); ~Walking_Seeds (); void Set_RandomSeed (long &Seed) // Set the seed value for the random number stream. Influencing hash table { Seed_ = Seed; delete Seed_Hash_Table; Seed_Hash_Table = new K64_Hash (Inst_.X().n() ,Seed_); } void Set_Refresh (int Refresh) {GSC->Set_RefreshRate (Refresh); } void Set_Solution (Mixed_Solution &Solution) { *SeedSoln_ = Solution; } void SetDumpFile (ostream *df) { dmpfile = df; } void SetGfxFile (ostream *gf)

// Set refresh rate for growing // Pass a new solution to use for walking

// Pass stream to dump information to a file


197

{ GSC->SetGfxfile (gf); gfxfile = gf; } void Go (long Iters); // start the walking private: MinW_Outlier_Problem &Inst_; Mixed_Solution &Sol_; Mixed_Solution *SeedSoln_;

// Reference to the Seed_Cluster_Solution for the grow&repair

Grow_Seeds_Cluster *GSC;

// Growing algorithm for the seeds

Seed_Maker_Base &SM_;

// Reference to the seed make passed to the walking

K64_Hash *Seed_Hash_Table;

// Pointer to the hash table for convergence of walking

long Seed_;

// Seedvalue for random number stream

ostream *dmpfile; ostream *gfxfile;

// NULL is no dump, otherwise stream where to dump to

int Number_Of_Hits_To_Stop;

// How often to see a solution before stopping

};

// // // // // // //

==================================================================================================== Object for moving a single seed or cluster in the solution space. Next to the problem, the seedzize a the solution with all clusters and the shrinking factor for the candidatelist have to be passed The full solution is passed, the number of cluster/seed to move set with the according function, and the process started with go The result is a seed of size a, which can be later grown to a full sized cluster ====================================================================================================

class Walking_Seed : public HS_Base { public: Walking_Seed (MinW_Outlier_Problem& Inst, int a,Mixed_Solution &Seed_Sol,Real CLF = 1.0); ~Walking_Seed(); void Set_Closest_Seed_Size (int CSS) { SeedSize_ = CSS; } // Change the size of the seed void Set_Cluster_Number (int ClusterNr) { ClusterNr_ = ClusterNr; } // Set the number of the cluster/seed to move void Set_RandomSeed (long &Seed) // Sets the seed for the random number stream, used for the hash table { Seed_ = Seed; delete Seed_Hash_Table; Seed_Hash_Table = new K64_Hash (Inst_.X().n() ,Seed_); } void SetDumpFile (ostream *df) { dmpfile = df; } void SetGfxFile (ostream *gf) { gfxfile = gf; } Location_and_Shape &get_LS () { return *LS_; }

// Extra output in a dumpfile, default off

// Returns the final location and shape of the cluster that was moved

Real get_FirstDeter () {return FirstDeter; } // The initial value of the determinant of the cluster Real get_Deter () { return LastDeter; } // The final value of the determinant of the cluster Solution_Processing_Functions &get_SPF () { return *SPF_; } // solution. Can be reused to save calculation time

// Returns the object used for the processing of the

void Set_Solution (Mixed_Solution &Solution) // Pass a new solution to the walking object, restart with go { Seed_Sol = Solution; } void Go (long Iters); // Start the walking of the specified cluster/seed private: MinW_Outlier_Problem &Inst_; Mixed_Solution &Seed_Sol; K64_Hash long Seed_; int SeedSize_; int ClusterNr_;

// Problem to work on // Solution with the cluster/seed to move

*Seed_Hash_Table;

// Hash table for convergence

// Random number stream seed // Size of the seed to grow to // Number of seed/cluster to walk

ostream *dmpfile; // Output for extra information ostream *gfxfile;

APPENDIX E. SOURCE CODE Real CLF_;

// Factor for shrinking of the candidate list

Real LastDeter; Real FirstDeter; int Steps_;

// Storage for the determinants

Location_and_Shape *LS_; Solution_Processing_Functions *SPF_;

// Final location and shape of the seed/cluster // Object to handle the solution vector

}; #endif

Analyse.cpp // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

====================================================================================== C L U S T E R

-

S E E D

W A L K I N G

A L G O R I T H M

Program is part of a diploma thesis by Torsten Reiners, UC Davis and TU Braunschweig Name of Module: Functionality for analysing a solution, used in the beginning of the thesis, not updated anymore Objects in file: Analyse Version: Version : 1.0 Final Version for Diploma Thesis Last Update: 25-July-98 Notes: Research on Project continues, Release Version not reached yet Version Control and Updates done: Not available Future Projects: Extend and make it usable with interface =======================================================================================

// Own includes #include #include #include #include #include

"vectors.h" "solution.h" "data.h" "analyse.h" "utils.cc"

#define PI 3.1415 // ==================================================================================================== // ==================================================================================================== void Analyse::Create_and_Initialize (Data_Array& X, const Mixed_Solution& Sol) { // First Count the Number of Groups in the Soultion Member = new int[Sol.Size()+1]; for (int k=0; kx_Gets (l,GroupData_[col]->Data(j,l)); // Calculate Distance Real Dist = Loc_Shape_[row]->Mahalanobis(*FPVector); // Found smaller Distance ? if ((MinDist == -1) || ((MinDist > Dist) && (MinDist != -1))) MinDist = Dist;

200


201

delete FPVector; } } else MinDist = 0; ClosestMatr_->m_Gets (row,col,MinDist); MinDist = -1; } } } // Creating the Distance from the mean to the furthest of same Group

for ( i= 0; ip()); for (int k=0; k< Loc_Shape_[i]->p(); k++) { FPVector->x_Gets (k,GroupData_[i]->Data(j,k)); } Real Dist = Loc_Shape_[i]->Mahalanobis(*FPVector); if ((MinDist == -1) || ((MinDist < Dist) && (MinDist != -1))) MinDist = Dist; delete FPVector; } DistMatr_->m_Gets(i,i,MinDist); MinDist = -1; } } void Analyse::print_all_groups() { for (int i = 0; iPrint(); } }

// ==================================================================================================== // ==================================================================================================== void Analyse::Output_Covariance_Matrix ( ostream& output { // Output the covariance matrix to outputfile

)

for (int coematr = 0; coematr < NumberOfGroup_; coematr++) { output