Very Fast Outlier Detection in Large Multidimensional Data Sets

65 downloads 1195 Views 598KB Size Report
have unknown distributions, are large in size, and are in high dimensional space. Existing algorithms ... Our method uses a k-d tree to partition the data set into.
Very Fast Outlier Detection in Large Multidimensional Data Sets

Amitabh Chaudhary1 , Alexander S. Szalay2 , and Andrew W. Moore3 1

Dept. of Computer Science, Johns Hopkins University, Baltimore, MD 21218; [email protected] Dept. of Physics & Astronomy, Johns Hopkins University, Baltimore, MD 21218; [email protected] Robotics Inst. & School of Computer Sc., Carnegie Mellon Univ., Pittsburgh, PA 15213; [email protected]

2 3

Abstract. Outliers are objects that do not comply with the general behavior of the data. Applications such as exploration in science databases need fast interactive tools for outlier detection in data sets that have unknown distributions, are large in size, and are in high dimensional space. Existing algorithms for outlier detection are too slow for such applications. We present an algorithm based on an innovative use of k-d trees that doesn’t assume any probability model and is linear in the number of objects and in the number of dimensions. We also provide experimental results that show that this is indeed a practical solution to the above problem.

1

Introduction

Outliers are the rare or atypical data objects that do not comply with the general behavior or model of the data. Applications such as fraud detection, customized marketing, network intrusion detection, weather prediction, pharmaceutical research, and exploration in science databases require the detection of outliers. There are many known algorithms for detecting outliers, but most of them are not fast enough when the underlying probability distribution is unknown, the size of the data set is large, and the number of dimensions in the space is high. There are, however, applications that need tools for fast detection of outliers in exactly such situations. Astronomers, for example, discover new kinds of heavenly bodies by studying those that are atypical in the space of some subset of the set of attributes. As a result, users of astronomical databases such as the Sloan Digital Sky Survey (SDSS) require an interactive OLAP filter that would automatically present just the top f fraction of the rarest objects from the large number of objects that satisfy their search queries. SDSS queries typically return about a 1,000,000 objects. The space typically has about 5 dimensions. The interactive filter needs to be able to detect outliers in under a minute to be practically useful. This implies that the required algorithm is preferably linear in both the number of objects n and the number of dimensions k. We present an solution for the above problem. Our method uses a k-d tree to partition the data set into groups such that all the objects in a group can be considered to behave similarly with respect to being outliers. We then identify those groups that contain outliers. Our algorithm also assigns to each object an outlierability value that gives a relative measure of how strong an outlier the object is. Section 3 gives more details on our method. Section 4 describes our experiments on real and synthetic data. We proceed by first looking at some existing methods for detecting outliers in the next section.

2

Related Work

Many algorithms in data mining literature find outliers as a by product of clustering [7, 8, 12, 17]. For these, outliers are objects that do not belong to any cluster. Note that to detect outliers, however, one doesn’t need to know how the rest of the data is clustered. Thus such algorithms are inherently slower to those that are tuned to exclusively find outliers. The statistics community has studied outlier detection extensively [3, 9]. They assume an underlying probability model representing the data and find outliers based on discordancy tests. Most available discordancy tests, however, are for single dimensional data.

In [10] Knorr and Ng define an object O in a data set as an distance based outlier with respect to the parameters p and D, if at least a p fraction of the objects lies at a distance greater than D from O. They present an algorithm to detect such outliers that, although is linear in the number of objects, is exponential in the number of dimensions. In [14] Ramaswamy et al. present a similar definition of outliers based on the distance from the kth nearest neighbor. Both these definitions require the existence of a metric distance function in the entire space – something not always possible in our applications. 1 Breunig et al. [5] present a definition for outliers based on the densities of the local neighborhood. They too, however, require the definition of a distance function. Further, their algorithm is not linear in the number of dimensions. Aggarwal and Yu [1] present evolutionary algorithms suited for outlier detection in very high dimensions (more than a 100). These are not suitable for our purpose where the number of dimensions are relatively modest but the number of objects are much larger. None of the algorithms above give a definition for outliers that is general enough for our purpose. Such a definition is given in [2]. It is of use to us and we consider it in some detail in the next subsection.

2.1

Outliers based on Smoothing Factors

In [2] Arning et al. define a measure for detecting outliers (they call them deviations) based on the degree to which a data element causes the “dissimilarity” of the data set to increase. They look for that subset of data that leads to the greatest reduction in the Kolmogorov complexity for the amount of data discarded. The formal definition follows. Given: – a set of items I (and thus its power set P(I)) – a dissimilarity function D : P(I) →