Mining the Data From Experiments on Algorithms

5 downloads 0 Views 357KB Size Report
relationships can be discovered rather than in testing hypothesized relationships. 1 ..... the determinants of the covariance matrices of the a ected clusters. Equation (3) .... Even with the ability to \cheat" we must sift through far too many clusterings to report. ..... Discovery : First European Symposium, PKDD '97. Springer ...
Mining the Data From Experiments on Algorithms using Maximum Likelihood Clustering David L. Woodru 1, Ramanpreet Singh1, and Torsten Reiners2 University of California at Davis Davis CA, USA 2 Technische Universitat Braunschweig Braunschweig, Germany 1

Contact: Prof. David L. Woodru GSM U.C. Davis Davis CA 95616 USA dlwoodru @ucdavis.edu December 1998

1

Mining the Data From Experiments on Algorithms using Maximum Likelihood Clustering Abstract Data mining represents an exciting information technology frontier. This paper connects statistics, computer science, and operations research using information technology as the catalyst. We examine the use of heuristic search as a basis for data mining algorithms and we demonstrate that data mining can be used to improve heuristic search algorithm performance. In summation, we describe an important data mining problem, provide a neighborhood structure for the problem, and demonstrate its value for heuristic algorithm development.

1 Introduction Data mining represents an exciting information technology frontier with numerous opportunities to exploit data in new ways (see e.g., [3, 15]). We examine the use of heuristic search as a basis for data mining algorithms and we demonstrate that data mining can be used to improve heuristic search algorithm performance. The data mining community often refers to the activities of interest to us as data segmentation. The statistics community refers to them as cluster nding (with no a priori metric). The connections between data mining and statistics are fairly clear. For example, Glymour et al. [11] refer to data mining as being \on the interface of computer science and statistics." One of our goals is to highlight potential contributions of and to operations research. Statistical methods of nding clusters in data are described and applied to data generated by the parameters and performance of algorithms applied to a hard problem: the problem of nding maximum likelihood clusters in data. The analysis of algorithms applied to hard problems gives rise to a new set of hard problems. How can the results of experiments on algorithms be exploited? In this paper we describe some statistical methods of exploratory algorithm performance analysis. Often, one has a hypothesis concerning algorithm performance and experiments are conducted to test it. Certainly, this is an important activity. However, in this paper we are interested in using algorithm performance data to see if new relationships can be discovered rather than in testing hypothesized relationships. 1

The self-referential nature of our work requires some modi cations to standard notation. A simple example is that the optimization literature typically uses x as a variable while the statistics literature typically employs x to represent data; since we span the elds, we eschew the use of x altogether to avoid confusion. We de ne the generic hard problem to which algorithms are applied as min f ( ) (P) Subject to:  2  where the set  is intended to summarize the constraints placed on the decision vector  . We refer to all data for the problem { the data that speci es the objective function f () and  { as (P). In some cases it may be convenient to use a vector P that implies, rather than speci es, the problem instance. For the problem instances of interest to us here, there are no known algorithms that can nd provably optimal solutions in a reasonable amount of time, so heuristic algorithms are employed that nd reasonably good solutions fairly quickly. We will refer to the heuristic algorithm parameters as . We denote by h(; P ) the vector giving the results of applying a heuristic algorithm parameterized by  to the problem instance speci ed by (P). The results are a vector because at the very least they will include the best value of f () found and some measure of the time or e ort required to nd it and in many situations we will be interested in additional statistics. During experimentation, we execute the algorithm and the result is a tuple (P; ; h(; P )) : It will be convenient to refer to one such `record' as a vector z and the (arbitrarily ordered) collection of records from multiple runs as Z . We might also refer to a set of records, Z , as a dataset (or a sample or a population depending on the context). To be consistent with statistics literature on clustering, we will refer to the dimension of each vector as p and the number of vectors as n. In this paper, we examine ways of looking for naturally occuring clusters in a dataset, Z , of points in