Robustness of Change Detection Algorithms

Robustness of Change Detection Algorithms Tamraparni Dasu1 , Shankar Krishnan1 , and Gina Maria Pomann2 1

2

AT&T Labs - Research North Carolina State University

Abstract. Stream mining is a challenging problem that has attracted considerable attention in the last decade. As a result there are numerous algorithms for mining data streams, from summarizing and analyzing, to change and anomaly detection. However, most research focuses on proposing, adapting or improving algorithms and studying their computational performance. For a practitioner of stream mining, there is very little guidance on choosing a technology suited for a particular task or application. In this paper, we address the practical aspect of choosing a suitable algorithm by drawing on the statistical properties of power and robustness. For the purpose of illustration, we focus on change detection algorithms (CDAs). We define an objective performance measure, streaming power, and use it to explore the robustness of three different algorithms. The measure is comparable for disparate algorithms, and provides a common framework for comparing and evaluating change detection algorithms on any data set in a meaningful fashion. We demonstrate on real world applications, and on synthetic data. In addition, we present a repository of data streams for the community to test change detection algorithms for streaming data.

1

Introduction

Data streams are increasingly prevalent in the world around us, in scientific, financial, industrial domains, in entertainment and communication and in corporate endeavors. There is a mind boggling variety of algorithms for summarizing data streams, for detecting changes and anomalies, and for clustering. However, the focus is typically on proposing, adapting, or improving a stream mining algorithm and comparing it to existing benchmark algorithms. Very little is understood about the behavior of the actual algorithm itself. Sometimes, the suitability of the algorithm is more critical than an incrementally improved efficiency or accuracy. In this paper, we outline a framework for addressing the behavioral properties of algorithms that will help practitioners choose the one most suited for their task. We focus on change detection algorithms for the purpose of illustration and propose a rigorous basis for understanding their behavior. Change detection algorithms (CDAs) are applied widely and of interest in many domains. The ability of an algorithm to detect change varies by type of change, and depends on the underlying decision making methodology. To understand this,

2

Dasu et al.

consider the following three CDAs that will be utilized throughout the paper. The first one is the rank-based method of Kifer et al. [11] which we call Rank. Their paper defines three tests based on the Kolmogorov-Smirnoff (KS), the φ-statistic (Phi) and the Ξ-statistic (Xi) and we look at all of them). It is intended for data in one dimension and uses statistical tests that are based on rank ordering the data. The second method proposed by Song et al. [13] which we call Density, uses kernel density estimation and the resulting test statistic has an asymptotic Gaussian distribution. The third one is the information theoretic algorithm of Dasu et al. [5] which we call KL, and it relies on bootstrapping to estimate the distribution of the Kullback Leibler distance between two histograms. Given the radically different approaches to comparing distributions, each algorithm had its strengths and weaknesses and a suitable choice depends on the task at hand. In this paper, we explore CDAs in the context of statistical power and robustness. Power measures the ability of a CDA to detect a change in distribution, while robustness refers to the effect of small perturbations in the data on the outcome of the algorithm. 1.1

Statistical Power

The power of a statistical test [4] is a fundamental concept in statistical hypothesis testing. It measures the ability of a test to distinguish between two hypotheses H0 and H1 . A test statistic Tn (X) computed from an observed data sample X = {X1 , X2 , . . . , Xn } is used to decide whether a hypothesis H0 is likely to be true. Let C0 , C1 be non-overlapping regions in the domain of Tn (X) such that the hypothesis is declared to be true if Tn (X) ∈ C0 , and not true if Tn (X) ∈ C1 . C1 is called the critical region, and is specified to meet the criterion P (T (X ) ∈ C1 |H0 ) = α,

(1)

where α is the Type I error (false positive) probability. For a specified hypothesis H1 and given Type I error rate α, Definition 1. The power of the test statistic T (X ) against the specified alternative H1 is defined to be PT = P (T (X ) ∈ C1 |H1 ).

(2)

The power in Equation 2 is estimated using Tn (X), the sample estimate of T (X ) based on the data. Figure 1(a) illustrates the power of the test T (X) in distinguishing between the null hypothesis H0 and alternatives H1 and H2 . The red curve represents the sampling distribution of the test statistic T (X) when the null hypothesis H0 is true. The vertical line delineates the critical region C1 . The power against each alternative hypothesis is given by the area under the corresponding curve in the region C1 , as per Equation 2. The area under H0 in C1 is the Type I error α. When the alternate hypotheses (e.g. H1 or H2 ) depart significantly from the

Robustness of Stream Mining Algorithms

3

null hypothesis H0 , the sampling distribution of T (X) under H0 differs significantly from that under H1 or H2 . The power increases, reflecting an increasing ability of the test statistic T (X) to distinguish between H0 and the alternative distributions. F0 

H1

H0

F1  Wj

Under H1

C0

H0

C1 H2

Type I error = α

W0 

Wk

W1

Power of T(X)

Wi  (W0,W1)

(W0,Wj)

γ

Under H2

(Wk,Wi)

Streaming Power

C1

C0

(a)

Time 

(b)

Fig. 1. (a) Static power: Given null hypothesis H0 , power against alternative hypotheses H1 , H2 . (b) Streaming power computation.

In Figure 2, we plot the power of the three variants of the Rank test for the two parameter Gaussian N (0, 1). We call the resulting plot the power surface. Blue and green represent regions of low power, while reds and browns represents high power. As the true distribution H1 departs from H0 due to the values of mean and variance deviating from the hypothesized values of 0 and 1 repsectively, the algorithm is able to detect the change in distributions with greater probability or power. The surface is obtained empirically. We defer a discussion of the estimation of power to Section 3. Note that the three variants of Rank CDA have very different power behavior. Rank-KS, based on the Kolmogorov Smirnov (KS) statistic has a very gradual increase in power as the distributions differ, and is less responsive to the change in variance as compared to the other two variants. This is because the KS statistic is based just on the maximum distance between the two cumulative distribution functions, which is influenced more by a location shift (mean change) than a scale change (different variance). The second statistical concept that is important in determining the utility of an algorithm is robustness. 1.2

Robustness

In order to study the stability or robustness of an algorithm to small perturbations in the data, we borrow the notion of influence function (IF) from robust statistics . Let x be a data point from the sample space χ, and let F0 be the distribution from which the sample is assumed to be drawn. Further, let ∆x be a Delta distribution that concentrates all probability mass at x. The IF measures the rate of change of T corresponding to infinitesmal contamination of the data.

4

Dasu et al.

(a)

(b)

(c)

Fig. 2. Decision surface of the Rank algorithms while comparing a standard Gaussian N (0, 1) to the two-parameter family of Gaussian distributions as the mean and standard deviation depart from 0 and 1 respectively. Blue and green represent regions of low power, while red and brown represents high power. (a) Rank-KS, (b) Rank-Xi, and (c) Rank-phi. Note that the contours are quite spread out, indicating a slow increase in power.

Definition 2. The influence function of a test statistic T is given by IF (x, T, F0 ) = lim

→0

T (∆x + (1 − )F0 ) − T (F0 ) .

(3)

For a statistic T to be robust, the IF should have the following properties: – Small gross-error sensitivity: The IF should be bounded and preferably small, otherwise a small contamination in the sample can lead to large changes and unpredictable behavior of the statistic. maxx IF (x, T, F0 ) should be small. – Finite rejection point: Beyond a certain point, outliers should have no effect on the statistic T . IF (x, T, F0 ) = 0, ∀x : |x| > r, for some reasonable r > 0. – Small local-shift error: No neighborhood of any specific value of x should lead to large values of the influence function, because this would result in unexpected behavior in specific neighborhoods. )−IF (x,T,F0 ) max(x,y,x6=y) IF (y,T,F0y−x should be small. A detailed discussion is outside the scope of this paper, see [8] for further reference. We use the notion of robustness of streaming power to propose a framework for exploring, evaluating and choosing a CDA.

2

Related Work

Change Detection Schemes: A variety of change detection schemes have been studied in the past, examining static datasets with specific structure [3], time series data [2, 10], and for detecting “burstiness” in data [12, 14]. The definition of change has typically involved fitting a model to the data and determining when the test data deviates from the built model [9, 7]. Other work has used statistical outliers [14] and likelihood models [12].


5

The paper by Ganti et al. [7] uses a family of decision tree models to model data, and defines change in terms of the distance between model parameters that encode both topological and quantitative characteristics of the decision trees. They use bootstrapping to determine the statistical significance of their results. Kifer et al. [11] lay out a comprehensive nonparametric framework for change detection in streams. They exploit order statistics of the data, and define generalizations of the Wilcoxon and Kolmogorov-Smirnoff (KS) test in order to define distance between two distributions. They also define two other test statistics based on relativized discrepancy using the difference in sample proportions. The first test, φ, is standardized to be faithful to the Chernoff bound, while the second test, Ξ, is standardized to give more weight to the cases where the proportion is close to 0.5. Their method is effective on one-dimensional data streams; however, as they point out, their approach does not trivially extend to higher dimensions. Aggarwal [1] considers the change detection problem in higher dimensions based on kernel methods; however, his focus is on detecting the “trends” of the data movement, and has a much higher computational cost. Given a baseline data set and a set of newly observed data, Song, Wu, Jermaine and Ranka [13] define a test statistic called the density test based on kernel estimation to decide if the observed data is sampled from the baseline distribution. The baseline distribution is estimated using a novel application of the Expectation Maximization (EM) algorithm. Their test statistic is based on the difference in log probability density between the reference and test dataset, which they show to exhibit a limiting normal distribution. Dasu et al. [5] present an algorithm for detecting distributional changes in multi-dimensional data streams using an information-theoretic approach. They use a space partitioning scheme called the kdq-tree to construct multi-dimensional histograms, which approximate the underlying distribution. The Kullback-Leibler (KL) distance between reference and test histograms, along with bootstrapping [6], is used to develop a method for deciding distributional shifts in the data stream.

3

Our Framework

In this section, we introduce a framework for evaluating CDAs using a mixture model that naturally captures the behavior of a data stream. Consider a multi-dimensional data stream where a data point x = (x1 , x2 , . . . , xd ) consists of d attributes, categorical or continuous. Assume that the change detection algorithms uses the sliding window framework, where a window refers to a contiguous segment of the data stream containing a specified number of data points n. The data stream distribution in each window Wi corresponds to some Fi ∈ F, the distribution space of the data stream. The data distribution Fi in Wi is compared to the data distribution F0 in a reference window W0 , each window of width n. Suppose that the data stream’s distribution changes over time from F0 to F1 . Define the distribution to be tested, Fδ as Fδ = δF1 + (1 − δ)F0

(4)

6

Dasu et al.

This is a natural model for the way change occurs in the sliding window frame work as shown in the example explained in Figure 1(b). The stream’s initial reference distribution is F0 (lighter shade) contained in windows W0 and W1 . As the stream advances, its distribution starts changing to F1 (darker shade). Window Wi starts encountering the new distribution and contains a mixture of F0 and F1 . The mixing proportion changes with time, with the contaminating distribution F1 becoming dominant by window Wk , culminating with δ=1, when all the data points in the window are from F1 . Once the algorithm detects the change, the reference window is reset to the current window. To compute power, we sample with replacement from each of the windows B in the test pair (W0 , Wi ) and generate B sample test pairs {(W0 0 , Wi 0 )}j=1 . We run the algorithm on each of the B pairs of windows and gather the set of B B binary change outcomes {Ij }j=1 . This set consists of B i.i.d. Bernoulli trials with probability pi (subscript refers to the test window Wi ) that Ij = 1. Since Ij = 1 ⇔ the algorithm detects change, pi represents the probability that the algorithm will detect a change between F0 and Fi , and is the streaming power of the algorithm A. We will define it formally in Section 3.1. We estimate pi using the B replications by computing the proportion of ”‘Change”’ responses. PB j=1 Ij (5) pˆi = B A high proportion of change represents a high ability to discriminate between the data distributions in the two windows of the test pair (W0 , Wi ), i.e high streaming power. A graph of pˆi , the proportion of change responses in B Bernoulli trials is shown as a blue curve at the bottom of Figure 1(b). When the data in the two windows W0 and Wi of the test pair come from the same distribution F0 , we expect the streaming power to be low, shown by the first downward arrow in Figure 1(b). When the data stream distribution starts changing from F0 to F1 , the windows reflect a mixture of the old and new distributions as seen in Wj . The comparison between W0 and Wj should yield a higher proportion of ”‘Change”’ responses in the B resampled test pairs, a higher streaming power. This is reflected in the high value of the blue curve as shown by the second downward arrow in Figure 1(b). When the power is consistently high, change is declared and the reference window is reset from W0 to Wk . Finally, since the windows Wk and Wi share the same distribution F1 , the algorithm adjusts to the new distribution F1 and returns to a lower streaming power (third downward arrow). 3.1

Streaming Power

As discussed in the preceding section, streaming power measures the ability of an algorithm to discriminate between two distributions. Formally, Definition 3. The streaming power of an algorithm A at time t is defined to be the probability that the algorithm will detect a change with respect to a reference


7

distribution F0 , given that there is a change in the data distribution of the stream, F0 SA (Ft ) = P (IA (t) = 1|Ft ), Ft ∈ F.

(6)

where IA (t) is the indicator function that is equal to 1 if the CDA detects a change and 0 otherwise. For convenience, we drop F0 from our notation for streaming power, and denote it as SA (Ft ). From Equation 2, it is clear that IA (t) = 1 ⇔ T (X) ∈ C1 , where T (X) and C1 are the decision function and critical region (with respect to F0 ) used by the algorithm to determine the binary response It . Therefore, streaming power can be thought of as a temporal version of power.

(a)

(b)

(c)

(d)

Fig. 3. (a) Power curves of Rank (3 variants) (dotted lines), KL (solid red) and Density (solid blue) tests for the two parameter 1D-Gamma distribution. The X-axis represents the mixing proportion of F0 (Γ (0.5, 0.5)), the reference distribution, and F1 (Γ (0.5, 0.6)), the contaminating distribution. (b) Corresponding sensitivity curves. (c) Power curves of KL (solid red) and Density (solid blue) tests where F0 = N 3 (0, 1) (the 3D standard Gaussian) is the reference distribution and F1 = (N 2 (0, 1), N (0.2, 1)), the contaminating distribution. (d) Corresponding sensitivity curves.

3.2

Robustness of CDAs

In order to study robustness of a CDA, we define a function that is analogous to the influence function from Section 1.2 . Definition 4. The rate curve of an algorithm A as a function of δ (mixing proportion parameter defined in Section 3) is the first derivative of the power curve, denoted by, SA (δ + ) − SA (δ) µA (δ) = lim . (7) →0 Note that the rate curve is analogous to the influence function from Section 1.2, and measures the rate of change of streaming power when the hypothesized distribution F0 has an infinitismal contamination from a δ-mixture of F0 and Ft . A CDA should detect change with rapidly increasing power in some region [δ1 , δ2 ] of the mixing proportion δ, and taper off to become constant. In order to be stable to outliers, δ1 > α, the significance level of the test, which represents the acceptable proportion of false positives. Moreover, δ2 should be considerably

8

Dasu et al.

< 1 so that the CDA detects the change with high probability before there is too much contamination. Beyond, δ2 , the power should be constant, and the rate curve should have value 0, analogous to the finite rejection point criteria from Section 1.2. In addition, to measure how the streaming power of an algorithm increases in relation to its distance from the reference distribution (as measured by the mixing proportion δ), we define the sensitivity curve, ηA (δ), to be ηA (δ) = µA (δ)/δ, and further define : Definition 5. The sensitivity of an algorithm A ∗ ηA = max ηA (δ) δ

(8)

Sensitivity is akin to local-shift error from Section 1.2. In the following section, we explore and evaluate the three CDAs using these concepts.

4

Investigating the CDAs

In this section, we use experiments on real and simulated data to investigate the three CDAs, Rank, Density and KL. The rank ordering based algorithms of [11] are applicable mainly in the 1-D setting and hence included only in the 1D experiments. In the multi-dimensional setting, only Density [13] and KL [5] algorithms are compared. We use the mixture model introduced in Section 3 to perform the study. We describe the results of applying the algorithms to real and ”‘hybrid”’ data. The data sets are described in detail in Section 4.5. All three algorithms are run with the same α, window size and power replications. Hybrid streams are created by injecting the data segment with changes, and alternating the clean baseline segment (no change) with a changed segment, e.g. ”‘Clean-Change1Clean-Change2-Clean-Change3”’ and so on. The clean baseline segment is inserted between every change to ensure that a change is always with respect to the original baseline. Changes of a given type, e.g. level shift, occur consecutively with increase in severity, and enable us to track the extent of change at which the algorithm can detect the given type of change. For example, an algorithm might detect a level shift only if it exceeds say 2% of the current mean. 4.1 Experiments We use univariate and multivariate distributions to investigate the behavior of the CDAs. One dimensional Gamma: Figures 3 (a) and (b) show the power and sensitivity curves for Rank (KS, Phi, xi), Density and KL algorithms in 1D for the two parameter family of gamma distributions. F0 is Γ (0.5, 0.5), while F1 is Γ (0.6, 0.5). The X-axis shows δ, the mixing proportion of F0 and F1 as it ranges from 0 to 1. The Y -axis shows the streaming power and sensitivity of the algorithms, respectively, computed using the approach described in Section 3. The number of bootstrap samples is 1000, α=0.01 and the number of test pair replications is 200. As δ increases, the power of all the algorithms increases. Rank and its variants have low power and do not show the ability to discriminate until δ exceeds 0.6.


9

The rate curve never hits zero, so that the finite rejection point criteria is not satisfied. This is true of Density and KL too, but to a much lesser extent. KL and Density tests have higher sensitivity, 4.08 and 4.16 respectively, than the Rank algorithms (2.99, 2.87, 3.3) as seen from the maxima of the sensitivity curves, and are able to detect the change at a lesser level of contamination. Multidimensional Gaussian: Figure 3 (c) and (d) show the power and sensitivity respectively of Density and KL algorithms for a 3D family of standard Gaussian distributions. Rank and its variants are not applicable in higher dimensions. Here, F0 is N 3 (0, 1), while F1 is (N 2 (0, 1), N (0.2, 1)), note that the change is only in a single dimension. Density algorithm shows a much higher sensitivity (20.36) than the KL test (2.46). This is surprising since Density and KL are almost identical in one dimension. Since the change in this example is restricted to one dimension, we expect their behavior to be similar. Therefore, we conducted experiments specifically to test the effect of dimensionality on the CDAs. 4.2 Effect of Dimensionality In order to investigate the effect of dimensionality on change detection, we designed the following experiment. We created a controlled data set from the 5D IP Data. We confined the change to just one dimension. We ran Density and KL algorithms on 1D, 2D, 3D, 4D and 5D, where in each case only the first dimension had changes. The results of our experiment is shown in Figure 4 (b). The top gray portion of the figure shows the raw attributes in the dataset, with changes present only in one dimension. The middle portion of the figure shows the power curves for the KL, Density and Rank tests, while the bottom part shows the change points for various algorithms. Rank (and its variants) could be applied only in 1D. The blue dots corresponding to change detection in 1D, are the real changes detected by the algorithm. KL detects fewer and fewer changes as the number of dimensions increases and the power degrades. This is to be expected, since the kdq-tree based multidimensional histograms accommodate the additional dimensions while losing the accuracy in the marginal distribution of the first dimension that contains the change. However, the reverse is true for Density. The number of changes detected increases with dimension (black, cyan, brown) even though there are no changes in the additional dimensions, and conversely, even when the power remains high in 5D (magenta), no changes are declared. A similar plateau like behavior is exhibited by Density in Figure 4 (a). This behavior is surprising and inconsistent since with such a high value of power, we would expect many more changes to be detected. One possible explanation is that Density is very sensitive to sampling variation and the resampling done for the power computation results in small perturbations in the data that are picked up by Density algorithm as a change, even when there is no change. 4.3 Real World Applications Figure 4 (a) illustrates the results of testing KL and Density on a real data stream. The individual attributes are shown in gray. Change points, as detected

10

Dasu et al.

Raw Values

Change in 1D

KL Power

Density Rank(1D) Rank(1D) Density

Change Points

KL

Time

(a)

(b)

Fig. 4. (a) Weather Data-6D: Real data stream with bursty changes resulting in successive change points. Change points that seem to be related to the same big change are shown in gray. Density finds many changes but also exhibits the characteristic ”‘table”’ shape of sustained high power while detecting no changes. KL finds frequent changes during periods of turbulence. (b) Power behavior of the three algorithms as the number of dimensions increases, where only the first dimension of the 5D data has change (hybrid data stream). Blue=1D, Brown=2D, Cyan=3D, Black=4D, Magenta=5D. As additional dimensions are added, no new changes should be detected, and power should either decrease or remain unchanged. Density detects additional changes as additional dimensions are added and its power increases even though there is no change in the additional dimensions. KL detects more or less the same changes and the power gets diluted as expected. Rank(Xi) is applicable only in 1-D, and detects most of the level and scale shifts but none of the mass shifts. This is true of the other Rank variants as well.

by a single application of the algorithm to the data stream are denoted by dots, Blue dots for changes detected by Density and Red dots for changes detected by KL. Clustered changes that occur successively are shown in gray. The power curve is computed by resampling and bootstrapping as described in Section 3. Weather Phenomena: The data stream in panel Figure 4 (a) consists of a 6D weather data stream that is publicly available, described in Section 4.5. It is characterized by spikes as can be seen in the plot of the raw attributes shown in gray. The power curve of Density shown in blue remains high almost all the time, sometimes even while no change is detected by the original algorithm as evidenced by the absence of a dot. KL detects multiple changes in regions where the stream is undergoing a shift, and the change points are consistent with periods of high power. When the data stream is undergoing constant change, change detection algorithms generate continuous alarms. This behavior can be seen in Figure 4 (a), where changes are declared continuously during a turbulent phase of the data stream. In order to suppress multiple alarms related to the same root cause, we ignore change points that are within a window length of each other. The initial change point at the beginning of a turbulent period is colored, but the rest are shown in gray.


4.4

11

Findings

We describe below our findings from running extensive experiments, only a few of which are described in this paper due to space constraints. Rank offers low type I error but low power and is applicable only in one dimension. Density is not robust in higher dimensions demonstrating a spuriously high value of power that corresponds presumably to sampling variation, and a high local-shift (sensitivity) error and gross-error sensitivity. KL is robust but its rejection point exceeds δ = 1 in some of our experiments, indicating that there is a low probability that it might miss egregious changes. The practitioner needs to decide which of these properties is most essential for the task at hand. 4.5

Data DNA: Building blocks for test data streams

We describe the different ”real” and ”hybrid” data streams used in our experiments in this section. Most of the data described below (unless noted otherwise) will be made publicly available upon publication of this paper, along with the data description files. Real Data Segments: (A) File Descriptor Data (length=168, dimensions=3): File characteristics of three different types of file streams that are generated automatically during the logging of telephone calls. There are interesting correlations that make the data challenging. (B) IP Data (length=288, dimensions=5): Resource usage from a single network element with a high degree of variability and some periodicity. Figure 4 shows the hybrid data stream built from this segment using controlled changes. (C) E-commerce Data (length=2016, dimensions=6): Aggregated resource usage for a server farm, with interesting outliers. (D) Weather Data (length=variable, dimensions=6): This is a publicly available dataset available at http://www.ngdc.noaa.gov/stp/GOES/goes.html (see Figure 4). (E) Data stream with quality issues (length=8632, dimensions=10): This data stream segment is riddled with quality issues. It is perfect for testing an algorithm’s ability to deal with missing data, duplicates, and outliers. Controlled Data Segments: We provide five data sets of four-dimensional video stream segments. Each segment has approximately 3000 data points. (AD) Plain background: A video sequence with a plain background, with one, two and three people walking in front of the camera. (E) Object data: Contains change points at known times corresponding to a person walking through the scene and leaving objects such as chairs, on the scene. Types of Change: It is useful to introduce a variety of changes to understand the behavior of algorithms. We list below a few broad categories of changes, noting that a systematically increasing fraction of the data can subjected to these changes to determine the threshold at which the algorithm can detect the change. (1) Level shift: A data point is moved from xi → xi +a, where a is some constant. (2) Scale change: xi → bxi , where b is some constant. (4) Outliers: A fraction of the data points are relocated to the tails of the distribution. The changes can be injected in one or more dimensions, independently or with complex interrelationships among the changes in each dimension.

12

Dasu et al.

5

Conclusion

We have proposed a framework that uses the novel concept of streaming power in conjunction with robustness to systematically evaluate and compare the behavior of CDAs. Within this framework, we evaluated three change detection algorithms [11, 13, 5] by defining a rate curve and exploring its properties analogous to finite rejection point, gross-error sensitivity and local-shift error of the influence function in robust statistics. In addition, we have provided a mechanism for constructing data streams, along with a valuable repository of test streams.

References 1. C. C. Aggarwal. A framework for diagnosing changes in evolving data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 575–586, 2003. 2. S. Chakrabarti, S. Sarawagi, and B. Dom. Mining surprising patterns using temporal description length. In Proceedings of 24rd International Conference on Very Large Databases, pages 606–617, 1998. 3. S. S. Chawathe and H. Garcia-Molina. Meaningful change detection in structured data. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 26–37, 1997. 4. D. R. Cox and D. V. Hinkley. Theoretical Statistics. Wiley, New York, 1974. 5. T. Dasu, S. Krishnan, D. Lin, S. Venkatasubramanian, and K. Yi. Change (detection) you can believe in: Finding distributional shifts in data streams. In IDA, pages 21–34, 2009. 6. B. Efron and R. J. Tibshirani. An Introduction to the Bootstrap. Chapman and Hall, 1993. 7. V. Ganti, J. Gehrke, R. Ramakrishnan, and W.-Y. Loh. A framework for measuring differences in data characteristics. pages 126–137, 1999. 8. P. J. Huber. Robust Statistics. John Wiley, New York, 1981. 9. G. Hulten, L. Spencer, and P. Domingos. Mining time-changing data streams. In KDD, pages 97–106, 2001. 10. E. Keogh, S. Lonardi, and B. Y. Chiu. Finding surprising patterns in a time series database in linear time and space. In KDD, pages 550–556, 2002. 11. D. Kifer, S. Ben-David, and J. Gehrke. Detecting changes in data streams. In Proceedings of the 30th International Conference on Very Large Databases, pages 180–191, 2004. 12. J. Kleinberg. Bursty and hierarchical structure in streams. Data Mining and Knowledge Discovery, 7(4):373–397, 2003. 13. X. Song, M. Wu, C. Jermaine, and S. Ranka. Statistical change detection for multi-dimensional data. In ACM SIGKDD ’07, pages 667–676, 2007. 14. Y. Zhu and D. Shasha. Efficient elastic burst detection in data streams. In Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 336–345, 2003.