Mining for Core Patterns in Stock Market Data - IEEE Xplore

2 downloads 0 Views 461KB Size Report
core patterns within sectors of stock market data. We use the sector data directly in conjunction with the stock time series for finding the core patterns. In contrast ...
2009 IEEE International Conference on Data Mining Workshops

Mining for Core Patterns in Stock Market Data Dianxiang Xu National Center for the Protection of the Financial Infrastructure Dakota State University Madison, SD [email protected]

Stock Open Values

Abstract—We introduce an algorithm that uses stock sector information directly in conjunction with time series subsequences for mining core patterns within the sectors of stock market data. The core patterns within a sector are representative groups of stocks for the sector when it shows coherent behavior. Multiple core patterns may exist in a sector at the same time. In comparison with clustering algorithms, the core patterns are shown to be more stable as the stock price evolves. The proposed algorithm has only one free parameter, for which we provide an empirical choice. We demonstrate the effectiveness of the algorithm through a comparison with the DBScan clustering algorithm using data from the Standard and Poor 500 Index.

Stocks in a sector Remainder of the data set 1 0

−1

In this paper, we introduce an algorithm for identifying core patterns within sectors of stock market data. We use the sector data directly in conjunction with the stock time series for finding the core patterns. In contrast to standard machine learning approaches, which use Boolean in form of a class label (supervised) or not at all (unsupervised), our algorithm inherently tests the relationship between the Boolean data and the time series, and returns patterns only when the relationship is significant. For the purpose of this paper, a core pattern is a representative group of stocks that show coherent behavior specific to their sector. Multiple core patterns may exist in one sector concurrently. Whether a sector is showing coherent behavior is determined through a modified version of the density histogram technique introduced in [1]. The distribution of stocks within a sector is compared with that of the overall data set, and the statistical significance is calculated. Core patterns of a significant, and hence coherent, sector can be extracted. A detailed discussion is given in Sect. III. Fig. 1 illustrates the concepts for an example. The top two panels show the preprocessed stock open values in a time window. In the top left panel, the stocks in black highlight a sector of stocks. A randomly selected sample of the same size as the sector is highlighted using charcoal grey color in the top right panel. Two histograms (one in black, the other in charcoal grey) in the bottom panel summarize the neighboring relationships between stocks in the sector and in the random sample respectively. It can be observed that the stocks in the sector have more neighbors on average than those in the random sample. Furthermore, the stocks that are represented 978-0-7695-3902-7/09 $26.00 © 2009 IEEE DOI 10.1109/ICDMW.2009.115

# of Stocks

A random subset Remainder of the data set 1 0 −1

Trading Date Sector Random subset

3 2

A core pattern

1 0

Index Terms—core pattern; time series; desity histogram; quasi-clique;

I. I NTRODUCTION

Trading Date

4

Stock Open Values

Jianfei Wu and Anne Denton and Omar Elariss Department of Computer Science and Operations Research North Dakota State University Fargo, ND {jianfei.wu, anne.denton, omar.elariss}@ndsu.edu

1

2

3 4 5 # of Stocks in Neighborhood

6

7

Fig. 1. The top two panels show the preprocessed stock open values in a time window. The top left panel highlights a sector of stocks. The top right panel shows a randomly selected sample which has the same size as the sector. Two histograms (one in black, the other in charcoal grey) in the bottom panel summarize the neighboring relationships between stocks in the sector and in the sample respectively. A core pattern is identified by comparing these two histograms. .

by the circled bins of the black histogram have more neighbors than any of the stocks in the random sample. They form a core pattern for this sector. In Fig. 1, only one random sample is shown. However, in the algorithm, which will be presented shortly, we use multiple random samples and construct a histogram that represents averages over those samples. This histogram acts as a reference for the sector and allows determining if the sector differs significantly from what would be expected by random chance. The histograms quantify the number of neighbors for a subset of stocks, which can be viewed as a measure of local density of the subset of stocks. We will call them density histograms. We refer to the histogram for a sector as observed density histogram, and the one that was averaged over multiple random samples as expected density histogram. In this study, the stock data come from the Standard and Poor 500 Index from 07/30/07 to 07/29/08. We follow the suggestion of [2] and use the first derivatives instead of raw values, i.e., we take the differences between successive trading days of each stock. The evaluation shows that our algorithm is more effective than a comparison algorithm at determining groups of sequences that show coherent behavior in a successive window. The remainder of the paper is organized

501 558

as follows: In Sect. II we introduce the related work. In Sect. III we present our algorithm in detail. In Sect. IV we systematically evaluate our algorithm, and compare it with the density-based clustering algorithm DBScan. In Sect. V we conclude the paper. II. R ELATED W ORK Many previous studies discuss cluster stability analysis for static data sets [3], [4]. All of them separate the original data set into subsets and then evaluate consistency between clusters found in different subsets. In [5], Ben et al present a method for quantitatively assessing the presence of structure in a clustered data set. The method counts the number of similarities between the labels of the objects common to both subsamples. In [3], Tilman et al present an approach that evaluates the partitions of a data set of some clustering algorithm, by checking whether similar structures are identified under repeated applications of the clustering algorithm. Volkovicha et al present an approach to evaluate the goodness of a cluster by the similarity among the entire cluster and its core [4]. None of these discuss stability in a dynamic data set or time series data set. How stable clusters are expected to be over consecutive time windows depends on factors that influence the time dependence. For stock data, previous studies [6] have shown that stocks that belong to the same sector tend to move more coherently than stocks from different sectors. From a business point of view this can be expected, since stocks in the same sectors are likely to be influenced by the same external factors, and may also influence each other. That means that coherence that is specific to a sector may be an indication that the stocks are influenced by the same factors and may continue to do so over longer periods of time. We test this coherence using concepts developed in [1]. If the stocks show a statistically significant pattern with respect to their membership in a sector, we expect the core or cores of that pattern to be stable over time. The term core pattern has previously been used to describe a subset of a frequent itemset with a certain core ratio [7].

A. Outline of the algorithm The algorithm iteratively executes the following steps until all training and evaluation windows are processed: (a) Detecting Significant Sectors in a training window Normalization: Stock data are first normalized within a training window, using a row-wise z-score normalization followed by a column-wise z-score normalization. More details will be given in Sect. III-B. Significance test: For each sector, an observed density histogram, together with an expected density histogram, is constructed. Then the statistical significance of the sector is calculated by applying a χ2 goodness-of-fit test on the two density histograms. The sectors which are significant at 0.1% level are used to form core patterns. (b) Forming core patterns Extract high-density stocks: By comparing the two density histograms, stocks that have more neighbors than expected are identified. Form core patterns: Core patterns are extracted from highdensity stocks of a sector that is considered to be significant. Details will be discussed in Sect. III-D. (c) Extract core patterns in the successive evaluation window Core patterns in the successive evaluation window are extracted with the same procedures as in the training window. B. Normalization In each time window, we first perform a row-wise z-score normalization on each vector, then apply a column-wise zscore normalization on all dimension values of stock vectors along each dimension, as (1) and (2) show respectively. Vk,L i,j =

k,L Vk,L ) i,j −mean(Vi

std(Vk,L ) i L  k,L mean(Vi ) = L1 Vk,L i,j  j=1 1 L

std(Vk,L i ) =

III. A LGORITHM The proposed algorithm has two main steps. In each time window, the algorithm first identifies significant sectors by comparing the two density histograms, i.e., for each sector the observed density histogram is compared with the expected density histogram that is constructed from random samples. In a second step, the algorithm extracts core patterns from those significant sectors. In this study, we use the stream sliding window concepts. Two adjacent sliding windows are used, namely training window and evaluation window. The algorithm detects significant sectors in the training window, and builds core patterns for the significant ones. Evaluation windows are used for testing how stable the core patterns are. In the remainder of the paper, the term ”time window” is used to represent either a training window or an evaluation window.

L 

j=1

k,L 2 (Vk,L i,j − mean(Vi ))

Vk,L i,j −

Vk,L i,j =  1 N

N  i=1

(1)

1 N

(Vk,L i,j

N  i=1



1 N

Vk,L i,j N  i=1

(2) 2 Vk,L i,j )

N is the number of stock vectors in a time window. It is equal to the number of stocks in the data set. Vk,L is the i ith (i ∈ [1, N ]) stock vector that starts from the k th trading day and ends at the (k + L − 1)th trading day. It has a th dimensionality of L. Vk,L (j ∈ [1, L]) dimension i,j is the j k,L value of the stock vector Vi . The row-wise z-score normalization was chosen in accordance with the study in [2]. Together with the column-wise z-score normalization, this normalization approach ensures that random data would result in a distribution that closely

559 502

Observed/Expected/Therotical density histograms

resembles the normal distribution that is assumed in the theoretical model. All stock vectors are further projected to a unit hypersphere, as (3) shows, such that the cosine similarity between two stock vectors can be calculated efficiently.

|Vik,L |

=



L 

j=1

k,L 2 (Vi,j )

6 4

0 0

5

10

15

# of Neighborhood

The observed histogram is used to summarize the neighboring relationships between stocks in a sector. If a sector shows coherent behavior, the stocks within this sector have more neighbors than expected. The distribution of expected neighbors is quantified through calculating neighbors of stocks in random samples. In other words, if stocks in a sector have more neighbors than those in random samples, we conclude that the sector is showing coherent behavior. We apply a χ2 goodness-of-fit test on the observed and expected density histograms for each sector. If a sector is significant at a 0.1% level (P-Value of 0.001), we consider that sector to be significant, and showing coherent behavior. In this study, cosine similarity is used to determine whether two stock vectors are neighbors of each other. If the cosine similarity of two stock vectors exceeds a threshold τ , they are considered neighbors. The cosine similarity between stock vector X and stock vector Y is defined as: L 

8

2

(3)

C. Significance test

Sim(X, Y) =

Random Sampling Theoretical Model Energy sector

10 Occurance

k,L = Vi,j

k,L Vi,j |Vik,L |

12

(Xi Yi )

(4)

i=1

X and Y are two vectors both having a dimensionality of L. Xi and Yi are the ith dimension values for X and Y respectively. To build an observed density histogram for a sector, the number of neighbors for each stock vector in the sector is calculated, and a contribution of 1 is added to a corresponding bin of the observed density histogram. To build an expected histogram, T random samples are first drawn. Then the contribution of these random samples are averaged to create the expected density histogram for the sector. In this study, T is set to 30. Fig. 2 shows an example of the observed density histogram and the expected density histogram, together with the theoretical model, which will be presented shortly, for the energy sector from 07/31/07 to 08/06/07. Notice the difference between the two density histograms in Fig. 2. The mean of the observed density histogram is greater than that of the expected density histogram, which implies that the stock vectors in energy sector have a higher density than the overall data set. After the two density histograms are constructed, statistical significance of the sector is attained by using a χ2 goodnessof-fit test. Those sectors that are significant at 0.1% level are further used to extract core patterns.

Fig. 2. Density histogram for energy sector using random sampling and the theoretical model for the time window. from 07/31/07 to 08/06/07.

D. Forming core patterns Definition 1: High-density stock vectors: The stock vectors that can be extracted from the tail of the observed density histogram until the aggregated density of the expected density histogram is greater than or equal to 1. We extract those high-density stock vectors for each significant sector. Once those high-density stock vectors are extracted, core patterns can be formed among them. Quasi-clique detection [8], [9] is a state-of-art technique to mine dense subgraphs in a large graph. It has been used in many applications, including functional prediction of uncharacterized genes [10], mining highly correlated stocks [11]. We use quasi-clique mining technique to extract core patterns from the high-density stock vectors: Let G = (V, E) be a graph where V is a set of vertices representing the high-density stock vectors, and E is a set of edges representing neighboring relationships among the high-density stock vectors. Unfortunately, the problem of mining γ − cliques in a graph is a N P − hard problem. Multiple definitions for quasi-cliques exist [8], [9]. In this study, we largely follow the definition of quasi-cliques in [9]. However, considering the graph G is relatively small (remember that we mine quasi-cliques only among high-density stock vectors), we adopt an absolute cutoff for deg G (v), that is, we conside a threshold (Th, which is an integer) for the number of edges that all vertices in a quasi-clique must possess. In other words, we try to find quasi-cliques, in which all vertices have at least Th edges (In our study Th equals to 3). Such a modification greatly simplifies the quasi-clique mining procedure. Algorithm 1 depicts an algorithm for mining this type of quasi-cliques, and it has a time complexity of O(n2 ) in the worst case. Algorithm 1 iteratively deletes the vertices that do not belong to a quasi-clique (i.e., delete the vertices that have less than Th edges), until only those ones that belong to a quasi-clique are left. Fig. 3 illustrates a simple example (Th equals to 2 in this example). The original graph has 6 vertices. Before algorithm 1 proceeds to the while loop, vertices 5 and 6 are deleted after executing Updatematrix command. After one iteration of the while loop, vertex 4 is deleted, and a quasi-clique which includes vertices 1, 2 and 3 is found.

560 503

Y

5

3

3

4 2 Original graph

Fig. 3.

ι θ

1

1

2

2 UpdateMatrix

vector x

r=1

4 6

1

3

θ X

After one iteration of while loop

Illustration of the procedure for extracting quasi-cliques. Fig. 4.

The relationship between τ and the hyper-surface area of the cap.

Algorithm 1: Mining Quasi-Clique

1

2

3 4

5

6 7

Data: graph M atrix; /* graph matrix in which entry value equals to 1 indicates an edge, 0 other wise. */ Data: T h; /* threshold for the minimum number of edges. */ Result: quasi cliques candidate vertices = findVertices(graph M atrix,≥,T h); /* find vertices which have at least T h edges. */ graph M atrix = updateMarix(graph M atrix,candidate vertices); /* update graph matrix by eliminating entries having value 1 that are not connecting vertices in candidate_vertices */ while findVertices(graph M atrix,