Correlating burst events on streaming stock market data - CiteSeerX

Data Min Knowl Disc (2008) 16:109–133 DOI 10.1007/s10618-007-0066-x

Correlating burst events on streaming stock market data Michail Vlachos · Kun-Lung Wu · Shyh-Kwei Chen · Philip S. Yu

Received: 17 May 2006 / Accepted: 24 January 2007 / Published online: 9 March 2007 Springer Science+Business Media, LLC 2007

Abstract We address the problem of monitoring and identification of correlated burst patterns in multi-stream time series databases. We follow a two-step methodology: first we identify the burst sections in our data and subsequently we store them for easy retrieval in an efficient in-memory index. The burst detection scheme imposes a variable threshold on the examined data and takes advantage of the skewed distribution that is typically encountered in many applications. The detected bursts are compacted into burst intervals and stored in an interval index. The index facilitates the identification of correlated bursts by performing very efficient overlap operations on the stored burst regions. We present the merits of the proposed indexing scheme through a thorough analysis of its complexity. We also manifest the real-time response of our burst indexing technique, and demonstrate the usefulness of the approach for correlating surprising volume trading events using historical stock data of the NY stock exchange. While the focus of this work is on financial data, the proposed methods and data-structures can find applications for anomaly or novelty detection in telecommunication, network traffic and medical data. Keywords Time-series · Indexing · Burst detection · Correlation 1 Introduction “Panta rhei”, said Heraklitos; everything is ‘in flux’. The truth of this famous aphorism by the ancient Greek philosopher is so much more valid today. People Responsible editor: Chang-shing Perng. M. Vlachos (B)· K.-L. Wu · S.-K. Chen · P. S. Yu IBM T. J. Watson Research Center, 19 Skyline Dr, Hawthorne, NY 10532, USA

110

M. Vlachos et al.

need to make decisions about financial, personal or inter-personal matters based on the observations of various factoring parameters. Therefore, since everything is in constant flow, monitoring the volatility/variability of important measurements over time, becomes a critical determinant in any decision making process. When dealing with time sequences, or time-series data, one important indicator of change is the presence of ‘burstiness’, which suggests that more events of importance are happening within the same time frame. Therefore, the identification of bursts can provide useful insights about an imminent change in the monitoring quantity, allowing the system analyst or individual to act upon a timely and informed decision. Monitoring and modeling of burst behavior (see Fig. 1) is significant in many areas; • First and foremost, in computer networks it is generally recognized that network traffic can be bursty in various time-scales (Leland et al. 1993; Jiang and Dovrolis 2005). Detection of bursts is therefore inherently important for identification of network bottlenecks or for intrusion detection, since an excessive amount of incoming packets may be a valid indication that a network system is under attack (Scott 2004). • Detection of bursty behavior can also be useful during the auditing of computer system logs, with the goal of spotting problems or system bottlenecks. Additionally, burstiness has been successfully used as a measure of similarity for the analysis of weblogs (Vlachos et al. 2004). • For applications such as fraud detection it is very critical to efficiently recognize any anomalous activity (typically in the form of over-utilization of resources). For example, burst detection techniques can be fruitfully utilized for spotting suspicious activities in large stock trading volumes (Lerner and Shasha 2003) or for identification of fraudulent phone activity (Nguyen and Tjoa 2004). • In natural sciences, researchers are also interested in unmasking bursty behavior in cosmic radiation, such as gamma-rays (Zhu and Shasha 2003) or sunspot activity, because such measurements can be used as evidence of a forthcoming climatic change. As an example, it has been noted that the solar variability greatly affects the earth’s climate and in fact a rise in the sunspot numbers, also suggests an increase in the northern hemisphere land temperatures (Friss-Cristensen and Lassen 1991). • In epidemiology and bio-terrorism, scientists are interested in the early detection of a disease outbreak. This may be indicated by the discovery of a sudden increase in the number of illnesses or visits to the doctor within a certain geographic area (Widdowson et al. 2003; Wong et al. 2003; Stern and Lightfoot 1999). • Finally, in medical sciences, discovery of burstiness in certain biometric measures may also suggest a health abnormality. For example, EEG burst patterns can be a valid indication of brain dysfunction (Muthuswamy et al.

Correlating burst events on streaming stock market data

Network Data

Weblog

111

Sunspot

Fig. 1 Burst examples in time-series data

1999; Laeven et al. 2001). Additionally, in the field of biology and bioinformatics, scientists are interested in discovering and measuring gene coexpression, that is, genes that display similar patterns of expression. In this field, burstiness is typically encountered as ‘up-regulation’ and holds substantial biological significance, because identification of coexpressed genes gives insight into functionally related groups of genes and proteins (Heyer et al. 1999). Many recent works address the problem of burst detection (Zhu and Shasha 2003; Kleinberg 2002; Shasha and Zhang 2005). However, in many disciplines, more effective knowledge discovery can be achieved by identifying correlated bursts when monitoring multiple data sources. From a data-mining perspective, this task is more exciting and challenging, since it involves the identification of burst ‘clusters’ and it can also aid the discovery of causal chains of burst events, which possibly occur across multiple data streams. Instances of the above problems can be encountered in many financial and stock market applications, e.g., for triggering fraud alarms. Additionally, it has also been shown that correlation of burst events can indicate useful connections on weblog data or even on online search patterns (Vlachos et al. 2004; Liu et al. 2006). Addressing the above issues, this paper presents a complete framework for effective multi-stream burst correlation. Similar to (Vlachos et al. 2004), we represent detected bursts as time intervals of their occurrence. We provide a new burst detection scheme, which is tailored for skewed distributions, such as the financial data that we examine here. Additionally, we introduce a memorybased index structure for identification of overlapping bursts. The new index structure is based on the idea of containment-encoded intervals (CEIs), which were originally used for performing stabbing queries (Wu et al. 2004). Building on the idea of encoded time intervals, we develop new search algorithms that can efficiently answer overlapping range queries. Moreover, we develop an approach to incrementally maintain the index as more recent data values are added. Using this new index structure we can achieve more than three orders of magnitude better search performance for solving the problem of burst overlap computation, compared to the B+tree solution proposed in (Vlachos et al. 2004). Our contributions are summarized as follows: 1. We elaborate on a flexible and robust method of burst extraction on skewed distributions.

112

M. Vlachos et al.

2. We present a memory-based index structure that can store the identified burst regions of a sequence and perform very effective overlap estimation of burst regions. 3. Finally, we depict the real-time response of the proposed index and we demonstrate the intuitiveness of the matching results on financial stock data at the NYSE. The work presented here represents an expanded version of the work that appears in (Vlachos et al. 2005). Additional sections include a thorough complexity analysis of the proposed indexing scheme, and a significantly expanded experimental section which showcases better the index performance. The remainder of the paper is structured as follows; in Sect. 2 we present an overview of our framework. Section 3 deals with burst detection schemes for skewed distributions and with burst summarization strategies. In Sect. 4 we explain how to efficiently organize the extracted burst digests for facilitating their fast search and we provide an analysis of the search algorithm. Empirical validation regarding the quality and the effectiveness of the proposed index is the focus of Sect. 5. Finally, Sect. 6 concludes the paper and instigates future directions of this work. 2 Problem formulation Let us consider a database D, containing m time-series sequences of the form S = s1 . . . sn , si ∈ R. Fundamental is also the notion of a burst interval b = [tstart , tend ), representing a time-span of a detected burst, with an inclusive left endpoint and an exclusive right endpoint, where tstart , tend are integers and tstart < tend . Between two burst intervals q, b one can define a time overlap operator ∩, such that: ⎧ if tqend ≤ tbstart ⎨0 if tqstart ≥ tbe q∩b= 0 ⎩ min(tqend , tbend ) − max(tqstart , tbstart ) otherwise We dissect the burst correlation problem into the following steps: (i) Burst identification on sequences residing in a database D. The burst detection process will return for each sequence A a set of burst intervals Bs = {b1 , . . . , bk }, of different cardinality k for every sequence. The set containing all burst intervals of database D, is denoted as BD . (ii) Organization of BD in a CEI-Overlap index I. (iii) Discovery of overlapping bursts with a query Q given index I, where Q is also a set of burst intervals: Q = {q1 , . . . ql }. The output of the index will be a set of intervals V = {v1 , . . . , vr }, vj ∈ BD such that: i

j

qi ∩ vj = 0


113

(iv) Return of top-k matches [optional]. This step involves the ranking of the returned sequences based on the degree of overlap, between their respective burst intervals and the query intervals. Since this step is merely a sorting of the result set, we do not elaborate any further on this for the remaining of the paper. In Fig. 2, we illustrate the steps that we follow and in Table 1 we summarize the paper notation. 3 Burst detection The burst detection process involves the identification of time regions in a sequence, which exhibit over-expression of a certain feature. In our setting, we consider the actual value of a sequence S as an indication of a burst. That is, if si > τ , then time i is marked as a burst. The determination of the threshold τ depends on the distributional characteristics of the dataset. Assuming a gauss-

Fig. 2 Overview of our approach Table 1 Description of main notation

Symb.

Description

D BD I m bi qi τ r L

Database of time-series Set of detected bursts for database D Burst Index Database size (number of time-series) Burst interval Query interval Burst Detection Threshold Maximum time in future for indexing Time length per index region

114

M. Vlachos et al.

ian data distribution τ could be set as the mean value µ plus three times the standard deviation. In this work we focus on financial data, therefore we first examine the distribution of their values. In Fig. 3 we depict the volume distribution of traded shares for two stocks (period 2001–2004). Similar shapes were observed for the majority of stocks volume measurements. We notice a highly skewed distribution that is also typically encountered in many streaming applications (Cormode and Muthukrishnan 2005). The shape of such a distribution is typically captured using a zipfian model. In this work we use an exponential model to describe the shape of the distribution, because of its simplicity (one only parameter to determine) and intuitiveness of the produced results. As we will show in the following lines, the desired parameters can be easily determined from the time-series values and are trivially maintained when dealing with streaming data. The CDF of the exponential distribution of a random variable X is given by: P(X > x) = e−λx where the mean value µ of X is λ1 . Solving for x, after elementary calculations we derive at the following: n x = −µ · ln(P) = −

i=1 si

· ln(P)

n

In order to calculate the critical threshold above which all values are considered as bursts, we estimate the value of x by looking at the tail of the distribution, hence setting P to a very small probability, i.e., 10−4 . Figure 3 depicts the threshold value and the discovered bursts on two stock volume sequences. Notice that the computed threshold is amenable to incremental computation in the case of streaming time-series (either for a sliding or aggregate window), because it only involves the maintenance of the running sum of the sequence values. However, setting a global threshold might introduce a bias when the range of values changes drastically within the examined window, i.e., when there is a ‘concept drift’ (Lazarescu et al. 2004; Harries and Horn 1995). Therefore, one DIET, trading volume

ARKR, trading volume 800 600 400

600 400

200

200

0

0

Fig. 3 Two examples of the value distributions of stock trading volumes. For each stock we depict on the left the volume trading for one year and on the right the respective distribution


115

DIET (stock volume), variable threshold

200

400

600

800

1000

Fig. 4 Variable threshold using overlapping subwindows

can compute a variable threshold, dividing the examined data into overlapping partitions. The distribution in each partition still remains highly skewed and can be estimated by the exponential distribution, due to the self similar nature of financial data (Lux 1996; Turiel and Perez-Vicente 2003). An example of the modified threshold (for the second stock of Fig. 3) is shown in Fig. 4, where the length of the partition is 200 and the overlap is 100. At the overlapping part, the threshold is set as the average threshold calculated by the two consecutive windows. We observe that in this case we can also detect the smaller burst patterns that were overshadowed by the high threshold value of the whole window (notice that a similar algorithm can be utilized for streaming sequences). After the bursts of a sequence are marked, each identified burst is transcribed into a burst record. Consecutive burst points are compacted into a burst interval, represented by its start and end position in time, such as [m, n), m < n. Burst points at time m are therefore represented by an interval [m, m + 1). In what follows, we will explicate how these burst regions can be organized into an efficient index structure. Notice that in this work we focused primarily on the skewed distributions that are prevalent on financial data. For different applications other definitions of bursts (such as the ones in Shasha and Zhang (2005) and Kleinberg (2002)) might be more appropriate. However, as long as the detected bursts are eventually transcribed into intervals, the index that will shortly be described, is directly applicable without any modifications.

4 Index structure For the fast identification of overlapping burst intervals,1 we adapt the notion of containment-encoded-intervals (CEI’s), which were originally utilized for answering stabbing queries (Wu et al. 2004) (CEI-stab). In this work we present the CEI-Overlap index, which shares a similar structure with CEI-Stab. We introduce a new efficient search technique for identifying overlapping bursts regions. Moreover, we present an effective approach for handling the nonstop progress of time.

1 For the remainder of the paper, “burst regions” and “burst intervals” will be used interchangeably.

116

M. Vlachos et al.

L = 23

L = 23

L = 23

L = 23

1 2

3

4

8

6

5

9

10

11

12

7

13

14

15

Fig. 5 Example of containment-encoded intervals and their ID labeling

4.1 Building a CEI-Overlap index There are two kinds of intervals in CEI-Overlap indexing: (a) burst intervals and (b) virtual construct intervals. Burst intervals are identified as described in sect. 3. The notion of virtual construct intervals is also introduced for facilitating the decomposition of burst intervals and for enabling the effective search operations. As noted before, burst intervals are represented by their start and end position in time and the query search regions are also expressed similarly. Figure 5 shows an example of containment-encoded intervals and their local ID labeling. Assume that the burst intervals to be indexed cover a time-span between [0, r).2 First, this range is partitioned into r/L segments of length L, denoted as SSi , where i = 0, 1, . . . , (r/L − 1), L = 2k , and k is an integer. Note that r is assumed to be a multiple of L. In general, the longer the average length of burst regions is, the larger L should be (Wu et al. 2004). Segment SSi contains time interval [iL, (i + 1)L). Segment boundaries can be treated as guiding posts. Then, 2L − 1 CEI’s are defined for each segment as follows: (a) Define one CEI of length L, containing the entire segment; (b) Recursively define 2 CEIs by dividing a CEI into two halves until the length is one. For example, there are one CEI of length eight, two CEIs of length four, four CEI’s of length two and eight CEIs of length one in Fig. 5. These 2L − 1 CEI’s are defined to have containment relationships among them. Every unit-length CEI is contained by a CEI of size 2, which is in turn contained by a CEI of size 4,. . . and so on. The labeling of CEI’s is encoded with containment relationships. The ID of a CEI has two parts: the segment ID and the local ID. The local ID assignment follows the labeling of a perfect binary tree. The global unique ID for a CEI in segment SSi , where i = 0, 1, · · · , (r/L) − 1, is simply computed as l + 2iL, where l is the local ID. The local ID of the parent 2 Section 4.3 will describe how to handle the issue of choosing an appropriate r as time continues

to advance nonstop.


117

CEI-based burst index burst intervals

2L-1

b1 c1 c2 c3 c4 c5 c6 c7

x

b1 b4 b3

b4 x

x

b2 b3

x

x

x

decomposition

x x

time

L = 2k

b3 b2 b2

c1

c3

c2 c4

c5

c6

CEI’s

c7

Fig. 6 Example of CEI-Overlap indexing

of a CEI with local ID l is l/2, and it can be efficiently computed by a logical right shift by 1 bit. To insert a burst interval, it is first decomposed into one or more CEIs, then its ID is inserted into the ID lists associated with the decomposed CEIs. The CEI index maintains a set of burst ID lists, one for each CEI. Figure 6 shows an example of a CEI-Overlap index. It shows the decomposition of four burst intervals: b1 , b2 , b3 and b4 within a specific segment containing CEI’s of c1 , . . . , c7 . b1 completely covers the segment, and its ID is inserted into c1 . b2 lies within the segment and is decomposed into c5 and c6 , the largest CEI’s that can be used for decomposition. b3 also resides within the segment, but its right endpoint coincides with a guiding post. As a result, we can use c3 , instead of c7 and c8 for decomposition. Similarly, c2 is used to decompose b4 . Burst IDs are inserted into the ID lists associated with the decomposed CEIs.

4.2 Identification of overlapping burst regions To identify overlapping burst regions, we must first find the overlapping CEIs. One simple approach is to divide the input interval into multiple unit-sized CEIs and perform a point search for each of the unit-sized CEIs using the CEIStab search algorithm. However, replicate elimination is required to remove redundant overlapping CEIs. Figure 7 shows an example of identifying CEIs overlapping with an input interval. There are 9 unique overlapping CEIs. Using the point search algorithm of the CEI-Stab index (Wu et al. 2004), there will be 16 overlapping CEIs, four from each upward-pointing dotted arrow. Seven of them are replicates. There are four replicates of c1 , and two duplicates each of c2 , c3 , c5 and c6 , respectively, if we use the point search algorithm of CEI-Stab for searching overlap CEI’s.

118

M. Vlachos et al.

c1 c2 CEI’s

c3

c4

c8

c6

c5 c9

c10

c11

c12

c7 c13

c14

c15

CEI’s overlapping with Query burst

input query burst

Fig. 7 Example of finding CEI’s overlapping with an input interval

Fig. 8 Pseudo code for searching overlap bursts

Eliminating redundant CEIs slows down search time. In this paper, we develop a new search algorithm for CEI-Overlap that does not involve replicate elimination. Figure 8 shows the pseudo code for systematically identifying all the overlapping bursts for an input region [x, y), where x and y are integers, x < y and [x, y) resides within two consecutive guiding posts (other cases will be discussed later). First, we compute the segment ID i = x/L. Then, the local IDs of the leftmost unit-sized CEI, l1 = x − iL + L, and the rightmost unit-sized CEI, l2 = (y − 1) − iL + L, that overlap with [x, y) are computed. From l1 and l2 , we can systematically locate all the CEIs overlapping with the input interval. Any CEI’s whose local ID is between l1 and l2 also overlaps with the input. We then move up one level to the parents of l1 and l2 . This process repeats until l1 = l2 = c1 . Each overlapping CEI is examined only once. Hence, no duplicate elimination is needed. Figure 7 shows the identification of overlapping CEI’s, from which the overlapping bursts can easily be found via the CEI index. Now we discuss the cases where the input interval does not reside within two consecutive segment boundaries. Similar to the decomposition process, the


119

input interval can be divided along the segment boundaries. Any remnant can use the search algorithm described in Fig. 8. The full segment, if any, has all the 2L − 1 CEIs within that segment as the overlapping CEI’s. In contrast to CEI-Stab (Wu et al. 2004), there might be duplicate burst IDs in the search results of CEI-Overlap. Note that, even though the search algorithm of CEI-Overlap has no duplicate in overlapping CEIs, it might return duplicates in overlapping burst IDs. This is because a burst can be decomposed into one or more CEI’s and more than one of them can overlap with an input interval. To efficiently eliminate these duplicates, the burst ID lists are maintained so that the IDs are sorted within individual ID lists. During search, instead of reporting all the burst IDs within each overlapping CEI one CEI at a time, we first locate all the overlapping CEI’s. Then, the multiple ID lists associated with these CEI’s are merged to report the search result. During the merge process, duplicates can be efficiently eliminated.

4.3 Incrementally maintaining the index Since time continues to advance nonstop, no matter what initial [0, r) is chosen, current time will exceed at some point the maximal range r. Selecting a large r to cover a time-span deep in the future is not a good approach because the index storage cost will increase (Wu et al. 2004). A better approach is to choose an r larger than the maximum window of burst regions at the moment, and to keep two indexes in memory, similar to the double-buffering concept. More specifically, we start with [0, r). When time passes r, we create another index for [r, 2r). When time passes 2r, we create an index for [2r, 3r), but the index for [0, r) will be likely not needed any more and can be discarded or flushed into disk. Using this approach no false dismissals are introduced, since any burst interval covering two regions can be divided along the region boundary and indexed or searched accordingly.

4.4 Discussion and limitations Before proceeding in analyzing the theoretical performance of the search algorithm, we identify several issues that can be of practical interest on real system implementations of the CEI-overlap index. First, CEI-overlap indexing is designed for fast insertions and fast search operations, but not for fast deletions. To delete a burst interval from the index, we first decompose it into a set of CEIs. For each decomposed CEI, we then sequentially scan the associated ID list to remove the burst ID, which on average is less efficient than an insertion. For the application of burst correlations that we are considering in this work, there are no deletions involved, therefore performance is not compromised. Second, the storage cost of the CEI-overlap indexing can be large if r is large, especially in the case when we need to store a large amount of burst intervals. A simple but effective solution for this scenario,

120

M. Vlachos et al.

would be to partition the burst intervals and build a separate CEI-overlap index for each partition. 4.5 Complexity of the overlap search algorithm Here, we analyze the complexity of the overlap search algorithm described in Fig. 8. We show that it has an average case complexity of O(L), with a constant factor of 23 , and a worst case complexity of (2L − 1), when the entire CEI’s in the segment need to be examined. In contrast, the simple CEI-Stab search algorithm has a complexity of O(L log(L)), where there are (log(L) + 1) CEIs stabbed by each unit-length CEI (see Fig. 7). We derive a closed-form formula for the average number of CEI’s visited for all possible input intervals that are completely inside the same segment, the same as in Fig. 8. The complexity for input intervals that cross at least k one guiding post can be similarly derived. any segment of length L = 2 , k For 2 distinct pairs of unit-length CEIs there is a 1-to-1 mapping between the 2 and all the input intervals that fall completely inside the segment. We sum up the numbers of CEIs examined by all the pairs to get the average. We make the following definitions before establishing recurrence relation equations and deriving closed-form formulas. Definition 1 Let Nn denote the number of nodes and Fn denote the number of leaf nodes in a perfect binary tree with (n + 1) levels. Fn and Nn are easy to obtain since they are general properties of a perfect binary tree. For example, we have F1 = 2, F2 = 4 and Fn = 2n . Similarly, we have N1 = 3, N2 = 7, and Nn = 2n+1 − 1. Definition 2 If we pick any pair of leaf nodes, or unit-length CEIs with local IDs of x and y, from a perfect binary tree to form an input query interval, there is a unique minimal sub-tree that includes all the leaf nodes with IDs from x to y, and all the ancestors of these leaf nodes. Let Bn denote the sum of sizes of all the minimal sub-trees for all possible pairs of leaf nodes in a perfect binary tree with (n + 1) levels. As an example, Fig. 9(a) shows the case of n = 1, where the tree in the box on the left includes only 1 distinct pair of leaf nodes, or unit-length CEIs. There is only one minimal sub-tree on the right. The solid edges form a sub-tree whose number of nodes contributes to the total sum. Hence, we have B1 = 3. Figure 9(b) shows the case of n = 2, where the tree in the box on the left includes 4 four leaf nodes, and thus = 6 distinct pairs of leaf nodes, as represented 2 by the six horizontal lines. The numbers above the individual lines denote the corresponding contributing amounts to B2 , respectively. On the right, we only show two of the six trees with solid edges detailing the contributing CEIs to B2 .


3

121

3

(a) B1 = 3

4 6 7 5 6

4

4

5

(b) B2 = 4+6+7+5+6+4=32 Fig. 9 Examples for B1 and B2

(a) A0 = 1

(b) A1 = 3 + 2 = 5

(d) Reverse definition forA1

(c) A2 = 7+6+4+3 = 20

Fig. 10 Examples for A0 , A1 , and A2

2k , where k = log(L). In order to establish a Our goal is to calculate Bk / 2 recurrence relation for Bk , we need to define an auxiliary term Ak . This term accounts the cases where the input intervals cross an imaginary vertical line passing the root of a perfect binary tree with (k + 1) levels.

Definition 3 Let Tm denote a perfect binary tree with (m+1) levels. We define Am as the sum of all the minimal sub-tree nodes in Tm that cover those input intervals with one endpoint chosen from the leaf nodes of Tm and the other endpoint fixed at one point outside of Tm on the right or the left. As an example, Fig. 10(a) shows the case of m = 0, where clearly A0 = 1. Figure 10(b) shows the case of m = 1, where the fixed point is on the right of T1 . There are two leaf nodes in T1 to contribute to the total sum of A1 , which is five. Figure 10(d) shows an alternative definition for A1 where the fixed point is on the left of T1 . Figure 10(c) shows the case of m = 2, where there are four

122

M. Vlachos et al.

Fig. 11 Recurrence relation for An

trees corresponding to the cases when one of the four leaf nodes is chosen. They contribute 7, 6, 4, and 3, respectively, to A2 . Therefore, we have A2 = 20. Now, we first establish a recurrence relation for An that involves Fn and Nn . Then we will set up a recurrence relation for Bn that involves An and Fn . Lemma 1 An = 2An−1 + 2Fn−1 + Fn−1 Nn−1 , and A0 = 1 Proof Figure 11 shows that An can be derived by two sub-trees with one fewer level. Thus we have the term 2An−1 . When we add the root node, we need to count additional nodes for both sub-trees. For the right sub-tree, each leaf node contributes one more for the root node, and there are Fn−1 leaf nodes. Thus we have the term Fn−1 . For the left sub-tree, each leaf node contributes one more for the root node, and the whole number of nodes in the right sub-tree Nn−1 . Thus we have the term Fn−1 (Nn−1 + 1). Summing the three terms up, we have the recurrence relation. By a simple induction, we have the following lemma: Lemma 2 An = n2n−1 + 4n Now we are ready to establish the recurrence relation for Bn . Fn−1 2 + Fn−1 + 2Fn−1 An−1 , and B1 = 3 Lemma 3 Bn = 2Bn−1 + 2 2 Proof By splitting an (n + 1)-level perfect binary tree into three parts, the root node, the left sub-tree and the right sub-tree, we can set up a recurrence relation for Bn . Figure 12 shows that Bn can be derived by two sub-trees with one fewer level. We partition the set of input intervals into three classes: those completely in the left sub-tree, those completely in the right sub-tree, and those with one


123

Fig. 12 Recurrence relation for Bn

endpoint in the left sub-tree and the other endpoint in the right sub-tree. The root should be counted once for every input interval in the first two classes, and 2 is due to the cross prodhence we have the first two terms. The third term Fn−1 uct of the An−1 ’s in the two sub-trees, which has 1-to-1 relation with the third class of the input intervals. Each pair of sub-trees (one from the left sub-tree and one from the right sub-tree) in the cross product and the root node form 2 in the third class, where each a unique (n + 1)-level sub-tree. There are Fn−1 should contribute 1 (the root node) to the count Bn . The last term is trickier. We first look at the left sub-tree, and the right sub-tree is similarly done (thus the product term 2). For every sub-tree that contributes to An−1 , it will contribute to the final sum by Fn−1 since there are Fn−1 sub-trees from the right sub-tree. Hence we have the final term 2Fn−1 An−1 . As an example, Fig. 13 shows the case of B3 , where 6 is added into B2 = 4 + 6 + 7 + 5 + 6 + 4 for the left and right sub-trees, respectively. Thus we have the amount 2 · (32 + 6) = 76. There is a cross product that includes pairs of input intervals, where each pair of interval forms a cross sub-tree input interval. The cross product can be nicely arranged into a matrix as shown in the lower part of Fig. 13. Therefore, we have the amount 2 · 4 · 20 + 42 = 176. Summing these terms up, we have B3 = 76 + 176 = 252. n n Theorem 1 Bn = 83 + n42 − n2 + 13 2n , and B1 = 3 Proof By substituting Fn−1 and An−1 into the recurrence relation, and by induction. Theorem 2 The average number of covering CEIs per input interval within two consecutive guiding posts is O(L), with a constant factor of 2/3. k k 2 ≈ 282k/3 = 23 · 2k = 23 L, for a Proof The average number equals Bk / /2 2 large k.

124

M. Vlachos et al.

F2 2

B2 +

B2 +

4+1

F2 2

4+1 6+1

6+1 7+1 5+1

7+1 5+1 6+1 4+1

6+1 4+1 3

7 6

4 7

3 7+3+1 7+4+1 7+6+1 7+7+1

A2

6

4

A2

6+3+1 6+4+1 6+6+1 6+7+1

4+3+1 4+4+1 4+6+1 4+7+1

3+3+1 3+4+1 3+6+1 3+7+1

F22+2F2A2=176

Fig. 13 Recurrence relation example for B3

5 Experiments We evaluate three parameters of the burst correlation scheme: (i) the quality of results (is the burst correlation useful?), (ii) the index response time (how fast can we obtain the results?), (iii) indexing scheme comparison (how much better is it than other approaches?). 5.1 Meaningfulness of results Our first task is to assess the quality of results obtained through the burst correlation technique. To this end, we search for burst patterns in stock trading volumes during the days before and after the 9/11 attack, with the intention of examining our hypothesis that financial and/or travel related companies might have been affected by the events. We utilize historical stock data obtained from finance.yahoo.com totaling 4793 stocks of length 1000, that cover the period between 2001 and 2004 (STOCK dataset). We use the trading volume of each stock as the input for the burst detection algorithm. Our burst query range is set for the dates 9/7/2001–9/20/2001, while we should note that the stock market did not operate for the dates between 9/11 and 9/16. Figures 14–17 illustrate examples of several affected stocks. The graphs display the volume demand of the respective stocks, while on the top right we also enclose the stock price movement for the whole month of September (the price during the search range is depicted by thicker line). Stocks like ‘Priceline’ or ‘Skywest’ which are related to traveling, experience a significant increase in selling demand, which leads to share depreciation when the stock market

Correlating burst events on streaming stock market data 8

x 10

125

[09/07/01 −> 09/20/01], PCLN on 17−Sep−2001, volume 122969400

2 5

Detected Burst

1.5

9/10

0 Sep

1

9/17

stock price

Oct

0.5 0 2001

2002

2003

2004

Fig. 14 Volume trading for the Priceline stock. We notice a large selling tendency, which results in a drop in the share price 6

x 10

[09/07/01 −> 09/20/01], SKYW on 17−Sep−2001, volume 7025900 40

6

Detected Burst

20 9/10

0 Sep

4

9/17

stock price

Oct

2

0 2001

2002

2003

2004

Fig. 15 Volume trading for the Skywest stock

re-opens on Sep. 17. At the same time, the stock price of ‘NICE Systems’ (a provider of air traffic control equipment) and ‘Mercury Computer Systems’ (a manufacturer of defense electronics) depict a value increase (Figs. 16 and 17). More examples of stocks with burst trends in the stock demand within the requested time frame are presented in Table 2 and in Fig. 20. In general, we can notice that stocks related to traveling, air transportation, banking and pharmaceuticals, experience a strong depreciation. On the other hand, stocks related to defense electronics and (surprisingly) cinemas (e.g., Carmike) demonstrate an appreciation in stock price, indicated by an accompanied surge in the buying demand. We demonstrate additional examples at different chronological ranges, with the intention of indicating how powerful a tool burst correlation can be, especially for deducing connections and interactions between companies and events. The first example is a strong correlated burst on 18th April 2005, between the stock volumes of Adobe (ADBE) and Macromedia (MACR) (Fig. 18). Both companies experience a large buying demand in their stocks. Through browsing of historical news events, one can realize that this was the announcement day of the merger between the two companies. While the connection between these two incidents is simple, it serves as an indication of how burst correlation can be used for inferring the presence of significant news events.

126

M. Vlachos et al.

6

x 10

[09/07/01 −> 09/20/01], NICE on 18−Sep−2001, volume 965300

3

16 9/17

9/10

14

2

12 Sep

Detected Burst

stock price

Oct

1

0 2001

2002

2003

2004

Fig. 16 Volume trading for the stock of Nice Systems (provider of air traffic control systems). In this case, the high stock demand results in an increase of the share price 6

x 10

[09/07/01 −> 09/20/01], MRCY on 17−Sep−2001, volume 3822200 40

6

Detected Burst

30 9/10

20 Sep

4

9/17

Oct

2

0 2001

2002

2003

2004

Fig. 17 Volume trading for the stock of Mercury Computer Systems (designer and manufacturer of defense electronics). The stock price rose significantly on Sep-17-2001

Our second example is a bit more subtle. We try to examine which stocks were affected by the release of iPod Photo by Apple on 26th of October 2004. Besides the stock of Apple (AAPL), one can observe that there is also a large demand for TTM Technologies (TTMI), which is a provider of printed circuit boards (Fig. 19). At first we cannot derive an immediate connection between the two companies. However, by examining the 10-K annual report form of TTM technologies, one can see that Apple is mentioned as one of the company’s customers. This example serves a strong indication of how even indirect connections can be deduced by careful examination of correlated burst events.

5.2 Index comparison We compare the performance of the new CEI-Overlap indexing scheme with the B+tree approach proposed in Vlachos et al. (2004). In that work the burst index was used to identify similar high demands for a range of keywords posed at the MSN search engine. The B+-tree index employed there recorded the start and end positions of each burst range and performed an efficient identification of bursts overlapping with the query range by noting the following: any burst b overlapping with the query burst q should have start time before the query end time (tbstart < tqend ) and end time after the query start time (tbend > tqstart ). We note


7

127

ril

Adobe −stock trading volume

x 10

p hA

05

20

t

18

4 3 2 1 2001

2002 7

2003

2004

2005

Macromedia −stock trading volume

x 10

Merger of Adobe and Macromedia is announced

2.5 2 1.5 1 0.5 2001

2002

2003

2004

2005

Fig. 18 Strong correlation is indicated between stock volume movements of Adobe and Macromedia on 18th April 2005, the day their merger was announced

that since our work utilizes an in-memory index, we also use a memory resident version of the B+-tree, so as to provide meaningful experimental comparisons. Other potential indexing techniques that could answer range overlap queries are ‘Interval Skip Lists’ (Hanson and Johnson 1996) and R-trees (Guttman 1984). However, it has been shown that CEI-based outperforms these interval indexing schemes (Wu et al. 2004), therefore we do not provide comparisons with these indices. We quantify the index performance using two metrics; the index insertion time and the index response time. Because for indexing purposes the STOCK dataset is quite small, we generate a larger artificial dataset that simulates the burst ranges returned by a typical burst detection algorithm. We generate three instances of the dataset with increasing number of burst ranges (low density—250 K records, medium density—500 K records and high density—750 K records). The burst ranges are generated randomly at various positions and covering different time-spans. A small sample of this dataset (corresponding to the bursts of 100 synthetic sequences), along with three query ranges, is depicted in Fig. 21. 5.2.1 Index insertion time This experiment measures the time required to build a burst range index. We populate the B+tree and a CEI-Overlap index using the artificial dataset containing 750,000 burst ranges. The average insertion time per burst record for both indices is depicted in Fig. 22. One can observe that the B+tree exhibits

128

M. Vlachos et al.

7

x 10

c hO

4

00

t2

Apple −stock trading volume

t

26

8 6 4 2 2001

2002 6

2003

2004

iPod Photo is released

TTM Technologies −stock trading volume

x 10 4 3 2 1

2001

2002

2003

2004

Fig. 19 A strong connection is suggested by correlated bursts of Apple and TTM Technologies, during the release of iPod Photo

Table 2 Some of the stocks that exhibited high trading volume after the events of 9/11/2001 Symbol

Name (Description)

Price

AVSR BEAV CKEC EMCI ESLT FKKY FLYI HAVNP INSU KEYN LIFE MAIR MRCY NICE PCLN PRCS SKYW STNR STNJ TSBK

Avistar Communications Be Aerospace Inc Carmike cinemas EMC Insurance ELBIT Systems LTD (defense electronics supplier) Frankfort First Bancorp Atlantic Coast Airlines Holdings, Inc Haven Capital Trust Insituform Technologies (pipe tunneling) Keynote Systems (e-business services) Lifeline Systems (Medical Emergency Response) Mair Holdings (Airline Subsidiary) Mercury Computer Systems (defense electronics) NICE Systems (Air traffic Control Systems) Priceline Praecis Pharmaceuticals Skywest Inc Steiner Leisure (Spa & Fitness Services) Sterling Bank Timberland Bancorp, Inc.

28% ↓ 65% ↓ 97% ↑ 4% ↑ 11% ↑ 0.3% ↑ 35% ↓ 4.5% ↓ 38% ↓ 11% ↓ 1.5% ↓ 36% ↓ 44.8% ↑ 25% ↑ 60% ↓ 41% ↓ 61 % ↓ 51 % ↓ 2.5% ↓ 7% ↓


5

129

6

x 10

x 10

3

15

Detected Burst

20

8 Detected Burst

2 9/10

10

9/10

9/17

6

9/17

1 Sep

10

0 Sep

Oct

Oct

4 5

2

0 2001

2002

2003

0 2001

2004

6

2002

2003

2004

5

x 10

x 10 2

2 Detected Burst

1.5

1

1.5

Detected Burst

15 9/10

1 9/10

9/17

0 Sep

Oct

1

9/17

0.5 Sep

10

Oct

5

0.5 0 2001

2002

2003

0 2001

2004

5

2002

2003

2004

5

x 10

x 10

16

Detected Burst

6

18

4 Detected Burst

14 9/10

16

3

9/17

12 Sep

4

9/10

9/17

14 Sep

Oct

Oct

2

2

1

0 2001

2002

2003

0 2001

2004

4

2002

2003

2004

6

x 10

x 10 17

Detected Burst

3

30

Detected Burst

15

16.5 9/10

20

9/17

16 Sep

2

9/10

Oct

9/17

10 Sep

10

Oct

5

1 0 2001

2002

2003

0 2001

2004

4

2002

2003

2004

4

x 10

x 10

16

Detected Burst

18

4

15.5

10

9/10

Detected Burst

17

9/17

15 Sep

9/10

Oct

3

9/17

16 Sep

Oct

2

5

1 0 2001

2002

2003

0 2001

2004

4

2002

2003

2004

6

x 10

x 10 11.5

15

30 9/17

Detected Burst

9/17

11

Detected Burst

6

9/10

10

10.5 Sep

Oct

5

20 9/10

10 Sep

4

Oct

2

0 2001

2002

2003

0 2001

2004

6

2002

2003

2004

6

x 10

x 10

10

10 9/17

3

Detected Burst

9/17

2.5

Detected Burst

8 9/10

9/10

2

6 Sep

2

Oct

5 Sep

1.5

Oct

1

1

0.5 0 2001

2002

2003

0 2001

2004

6

2002

2003

2004

4

x 10

x 10

12

2.5

6

Detected Burst

2

10.5

Detected Burst

11 9/10

10 9/10

9/17

10 Sep

1.5

Oct

9/17

9.5 Sep

4

Oct

1 2 0.5 0 2001

2002

2003

0 2001

2004

6

2002

2003

2004

4

x 10

x 10 30

2

Detected Burst

16 9/17

20

Detected Burst

15

10 Sep

9/17

15

9/10

1.5

9/10

Oct

14 Sep

10

Oct

1 5

0.5 0 2001

2002

2003

2004

0 2001

2002

Fig. 20 More examples of affected stocks after the 9/11 events

2003

2004

130

M. Vlachos et al. Q2 Q3

Q1

100

Sequence ID

80

60

40

20

0 0

100

200

300

400

500

600

700

800

900

1000

Time

Fig. 21 Artificial dataset and example of three burst range queries

Insertion Time

1

10

B+Tree Burst Index

B+tree

6

x 10 0

10

Time (sec)

Time(sec)

5 4.5 4 3.5 250K

500K

750K

−1

10

−2

10

0

250000

500000

750000

Number of Bursts

Fig. 22 Time required to populate the index. The B+tree insertion time is linear to the number of objects, while the CEI-index exhibits constant insertion time

a linear trend in the insertion time, with respect to the database objects already inserted. This is more clearly indicated in the ‘zoom-in’ of the B+tree index, which is also shown in top of the same figure. The linear performance is to be expected, since any tree-based index incurs a balancing phase, which increases with the database size. On the contrary, the CEI-based index requires approximately constant insertion time, irrespective of the database size. This is largely attributed to the fast ‘hash-based’ mechanism of object insertion. No balancing is required, hence the insertion expense remains constant and in the range of 10–20 ms for the experiment of Fig. 22. In general, we note that the index proposed in this paper exhibits three orders of magnitude faster insertion time, compared to a B+tree based approach.


131

Low Density − 250K Objects

107

B+tree CEI−index

Running Time (microsec)

106

105

104

103

102

0

1

2

3 4 Answer set size

5

6

7 4

x 10

Medium Density − 500K Objects

107

B+tree CEI−index


106

105

104

103

102

0

2

3 4 Answer set size

5

6

7 4

x 10

High Density − 750K Objects

107


1

B+tree CEI−index

106

105

104

103

102

0

1

2

3

4

5

6

7 4

Answer set size

Fig. 23 B+Tree versus CEI-Overlap runtime

x 10

132

M. Vlachos et al.

5.2.2 Index response time A more critical factor of the index performance is its response time to various queries, or in other words, how much time is required to identify overlapping burst regions for a number of burst query ranges. On both the CEI-Overlap and the B+tree we probed 5,000 query ranges that cover different positions and ranges. Intuitively, the cost of the search operation is proportional to the number of burst intervals that overlap with a given query. Therefore, we need to plot the running time of each query with respect to the size of the answer set (more overlaps suggest longer running time). We create a histogram of the running time by dividing the range of the answer set into 20 bins and in Fig. 23 we plot the average running time of all the results that ended in the same histogram bin. We perform this experiment for datasets using different burst cardinalities (250 K, 500 K, 750 K burst ranges). The results indicate the superior performance of the CEI-based index, which is approximately 2–3 orders of magnitude faster than the competing B+tree approach. We should also notice that the running time is reported in µsecs, which clearly demonstrates the real-time search performance of the proposed indexing scheme.

6 Conclusion We have presented a complete framework for efficient correlation of bursts. The effectiveness of our scheme is attributed not only to the effective burst detection but also to the efficient memory-based index. The index hierarchically organizes important burst features of time sequences (in the form of ‘burst segments’) and subsequently performs very efficient overlap computation of the discovered burst regions. We have demonstrated the enhanced response time of the proposed indexing scheme, and presented interesting burst correlations that we mined from financial data. In the future, we plan to evaluate the applicability of the proposed index for efficient clustering of detected bursts, as well as for detecting cross-correlation between multiple data-streams based on their burst characteristics.

References Cormode G, Muthukrishnan S (2005) Summarizing and mining skewed data streams. In Proc of SDM, pp 44–55 Friss-Cristensen E, Lassen K (1991) Length of solar cycle - an indicator of solar-activity closely related with climate. Science 254: 698–700 Guttman A (1984) R-trees: a dynamic index structure for spatial searching. In Proc of ACM SIGMOD, pp 47–57 Hanson E, Johnson T (1996) Selection predicate indexing for active databases using interval skip lists. Inform Syst 21(3):269–298 Harries M, Horn K (1995) Detecting concept drift in financial time series prediction. In 8th Australian joint conf on artif intelligence, pp 91–98


133

Heyer LJ, Kruglyak S, Yooseph S (1999) Exploring expression data: identification and analysis of coexpressed genes. Genome Res 9:11 Jiang H, Dovrolis C (2005) Why is the Internet traffic bursty in short (sub-RTT) time scales?. In Proc of ACM SIGMETRICS, pp 241–252 Kleinberg J (2002) Bursty and hierarchical structure in streams. In Proc 8th ACM SIGKDD, pp 91–101 Laeven R, Gielen C, Coenen A, Rijn CV (2001) Principal component analysis and gabor transform in analysing burst-suppression EEG under propofol anaesthesia. In Sleep-wake research in the Netherlands, Vol 12, pp 75–80 Lazarescu M, Venkatesh S, Bui HH (2004) Using multiple windows to track concept drift. Intel Data Analy J 8(1):29–59 Leland WE, Taqqu M S, Willinger W, Wilson DV (1993) On the self-similar nature of ethernet traffic. In Proc of ACM SIGCOMM, pp 183–193 Lerner A, Shasha D (2003) The virtues and challenges of ad hoc + streams querying in finance. IEEE Data Eng Bull:49–56 Liu B, Jones R, Klinkner K (2006) Measuring the meaning in time series clustering of text search queries. In Proc of ACM CIKM, pp 836–837 Lux T (1996) Long-term stochastic dependence in financial prices: evidence from the German Stock Market. Appl Econ Lett 3:701–706 Muthuswamy J, Sherman D, Thakor N (1999) Higher-order spectral analysis of burst patterns in EEG. IEEE Trans Biomed Eng 46(1):92–99 Nguyen T M, Tjoa A M (2004) Grid-based Mobile phone fraud detection system. In Proc of PAKM Shasha D, Zhang X (2005) Better Burst Detection. NYU, Computer Science Dept, Technical report TR2005-876 Stern L, Lightfoot D (1999) Automated outbreak detection: a quantitative retrospective analysis. Epidemiol Infect 122:103–110 Scott SL (2004) A Bayesian paradigm for designing intrusion detection systems. Comput Stat Data Anal (special issue on Computer Security) 45:69–83 Turiel A, Perez-Vicente C (2003) Multifractal geometry in stock market time series. Physica A 322:629–649 Vlachos M, Meek C, Vagena Z, Gunopulos D (2004) Identification of similarities, periodicities & bursts for online search queries. In Proc of SIGMOD, pp 131–142 Vlachos M, Wu K-L, Chen S-K, Yu P (2005) Fast burst correlation of financial data. In Proc of PKDD, pp 368–379 Widdowson M-A, Bosman A, van Straten E, Tinga M, Chaves S, van Eerden L, van Pelt W (2003) Automated, laboratory-based system using the Internet for disease outbreak detection, the Netherlands. Emerg Infect Dis 9:9 Wong W-K, Moore A, Cooper G, Wagner M (2003) WSARE: what’s strange about recent events?. In J Urban Health 80:66–75 Wu K-L, Chen S-K, Yu P S (2004) Interval query indexing for efficient stream processing. In Proc of ACM CIKM, pp 88–97 Zhu Y, Shasha D (2003) Efficient elastic burst detection in data streams. In Proc of SIGKDD, pp 336–345