Mining the Data From Experiments on Algorithms

Mining the Data From Experiments on Algorithms using Maximum Likelihood Clustering David L. Woodru1, Ramanpreet Singh1, and Torsten Reiners2 University of California at Davis Davis CA, USA 2 Technische Universitat Braunschweig Braunschweig, Germany 1

Contact: Prof. David L. Woodru GSM U.C. Davis Davis CA 95616 USA dlwoodru@ucdavis.edu December 1998

1

Mining the Data From Experiments on Algorithms using Maximum Likelihood Clustering Abstract Data mining represents an exciting information technology frontier. This paper connects statistics, computer science, and operations research using information technology as the catalyst. We examine the use of heuristic search as a basis for data mining algorithms and we demonstrate that data mining can be used to improve heuristic search algorithm performance. In summation, we describe an important data mining problem, provide a neighborhood structure for the problem, and demonstrate its value for heuristic algorithm development.

1 Introduction Data mining represents an exciting information technology frontier with numerous opportunities to exploit data in new ways (see e.g., [3, 15]). We examine the use of heuristic search as a basis for data mining algorithms and we demonstrate that data mining can be used to improve heuristic search algorithm performance. The data mining community often refers to the activities of interest to us as data segmentation. The statistics community refers to them as cluster nding (with no a priori metric). The connections between data mining and statistics are fairly clear. For example, Glymour et al. [11] refer to data mining as being \on the interface of computer science and statistics." One of our goals is to highlight potential contributions of and to operations research. Statistical methods of nding clusters in data are described and applied to data generated by the parameters and performance of algorithms applied to a hard problem: the problem of nding maximum likelihood clusters in data. The analysis of algorithms applied to hard problems gives rise to a new set of hard problems. How can the results of experiments on algorithms be exploited? In this paper we describe some statistical methods of exploratory algorithm performance analysis. Often, one has a hypothesis concerning algorithm performance and experiments are conducted to test it. Certainly, this is an important activity. However, in this paper we are interested in using algorithm performance data to see if new relationships can be discovered rather than in testing hypothesized relationships. 1

The self-referential nature of our work requires some modi cations to standard notation. A simple example is that the optimization literature typically uses x as a variable while the statistics literature typically employs x to represent data; since we span the elds, we eschew the use of x altogether to avoid confusion. We de ne the generic hard problem to which algorithms are applied as min f ( ) (P) Subject to: 2 where the set is intended to summarize the constraints placed on the decision vector . We refer to all data for the problem { the data that speci es the objective function f () and { as (P). In some cases it may be convenient to use a vector P that implies, rather than speci es, the problem instance. For the problem instances of interest to us here, there are no known algorithms that can nd provably optimal solutions in a reasonable amount of time, so heuristic algorithms are employed that nd reasonably good solutions fairly quickly. We will refer to the heuristic algorithm parameters as . We denote by h(; P ) the vector giving the results of applying a heuristic algorithm parameterized by to the problem instance speci ed by (P). The results are a vector because at the very least they will include the best value of f () found and some measure of the time or eort required to nd it and in many situations we will be interested in additional statistics. During experimentation, we execute the algorithm and the result is a tuple (P; ; h(; P )) : It will be convenient to refer to one such `record' as a vector z and the (arbitrarily ordered) collection of records from multiple runs as Z . We might also refer to a set of records, Z , as a dataset (or a sample or a population depending on the context). To be consistent with statistics literature on clustering, we will refer to the dimension of each vector as p and the number of vectors as n. In this paper, we examine ways of looking for naturally occuring clusters in a dataset, Z , of points in

Mining the Data From Experiments on Algorithms

Mining the Data From Experiments on Algorithms

Suggest Documents

Data Mining Algorithms - Quretec

On performance of data mining: from algorithms to ... - CiteSeerX

On performance of data mining: from algorithms to management ...

Mobile Data Stream Mining: From Algorithms to ... - Semantic Scholar

Mining Frequent Itemsets from Large Data Sets using Genetic Algorithms

Focusing on the Data in Data Mining: Lessons from ...

Data Mining Algorithms to Classify Students - International ...

Techniques of Cluster Algorithms in Data Mining

FUZZY DATA MINING AND GENETIC ALGORITHMS ... - CiteSeerX

Temporal Data Mining: clustering methods and algorithms

Evolutionary Algorithms in Data Mining - Semantic Scholar

Knowledge Extraction And Data Mining Algorithms ...

Frequent Data Itemset Mining Using VS_Apriori Algorithms

Classification performance of data mining algorithms ...

Data Mining Algorithms to Classify Students.pdf - SCI2S

Data Mining Algorithms And Medical Sciences

Association and Classification Data Mining Algorithms Comparison ...

PERFORMANCE ANALYSIS OF DATA MINING ALGORITHMS FOR ...

A Study on Selective Data Mining Algorithms - International Journal of ...

On Approximation Algorithms for Data Mining ... - Semantic Scholar

On Approximation Algorithms for Data Mining ... - Semantic Scholar

A Study on Classification and Clustering Data Mining Algorithms ...

A Characterization of Data Mining Algorithms on a Modern ... - CiteSeerX

Applying Data Mining Algorithms on Special EHR of ...