A Novel System Anomaly Prediction System Based on Belief Markov ...

4 downloads 985 Views 2MB Size Report
Jul 31, 2013 - Table 1: Monitoring metrics used for anomaly prediction. Monitored metrics. LOAD1. LOAD5. AVAILCPU. UPTIME. FREEMEM. FREEDISK.
Hindawi Publishing Corporation Mathematical Problems in Engineering Volume 2013, Article ID 179390, 10 pages http://dx.doi.org/10.1155/2013/179390

Research Article A Novel System Anomaly Prediction System Based on Belief Markov Model and Ensemble Classification Xiaozhen Zhou, Shanping Li, and Zhen Ye College of Computer Science and Technology, Zhejiang University, Hangzhou 310012, China Correspondence should be addressed to Xiaozhen Zhou; [email protected] Received 17 March 2013; Revised 13 July 2013; Accepted 31 July 2013 Academic Editor: Yingwei Zhang Copyright Β© 2013 Xiaozhen Zhou et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Computer systems are becoming extremely complex, while system anomalies dramatically influence the availability and usability of systems. Online anomaly prediction is an important approach to manage imminent anomalies, and the high accuracy relies on precise system monitoring data. However, precise monitoring data is not easily achievable because of widespread noise. In this paper, we present a method which integrates an improved Evidential Markov model and ensemble classification to predict anomaly for systems with noise. Traditional Markov models use explicit state boundaries to build the Markov chain and then make prediction of different measurement metrics. A Problem arises when data comes with noise because even slight oscillation around the true value will lead to very different predictions. Evidential Markov chain method is able to deal with noisy data but is not suitable in complex data stream scenario. The Belief Markov chain that we propose has extended Evidential Markov chain and can cope with noisy data stream. This study further applies ensemble classification to identify system anomaly based on the predicted metrics. Extensive experiments on anomaly data collected from 66 metrics in PlanetLab have confirmed that our approach can achieve high prediction accuracy and time efficiency.

1. Introduction As computer systems are growing increasingly complicated, they are more vulnerable to various anomalies such as performance bottlenecks and service level objective (SLO) violations [1]. Thus, it requires the computer systems to be more capable of managing anomalies under time pressure, and avoiding or minimizing the system unavailability by monitoring the computer systems continuously. Anomaly management methods can be classified into two categories: passive methods and proactive methods. Passive methods notify the system administrator only when errors or faults are detected. These approaches are appropriate to manage anomalies that can be easily measured and fixed in a simple system. However, in nowadays dynamic and complex computer systems, detecting some anomalies may have a high cost, which is unacceptable for continuously running applications. Proactive methods take preventive actions when anomalies are imminent; thus, they are more appropriate for systems that need to avert the impact of anomalies and achieve continuous operation. Nowadays proactive methods

are preferred in both academic research and real world applications. Previous work has addressed the problem of system anomaly prediction, which can be categorized as data-driven methods, event-driven methods, and symptom-driven methods [2]. Event-driven methods directly analyze the error or failure that events report and use error reports as input data to predict future system anomaly. Salfner and Malek use error reports as input and then perform a trend analysis to predict the occurrence of failure in a telecommunication system by determining the frequency of error occurrences [3]. Kiciman and Fox use decision tree to identify faulty components in J2EE application server by classifying whether requests are successful or not. These approaches have the basic assumption that anomaly-prone system behavior can be identified by characteristics of anomaly [4]. This is why only reoccurring anomaly presented in the error report can be predicted by event-driven method. Data-driven methods learn from the temporal and spatial correlation of anomaly occurrence. They aim at recognizing

2 the relationship between upcoming failures and occurrence of previous failures. Zhang and Ma use modified KPCA method to diagnose anomalies in nonlinear processes [5]. In nonlinear fault detection scenario, they utilize statistical analysis to improve the learning techniques [6], which is also applicable for large scale fault diagnosis processes [7]. Liang et al. exploit these correlation characteristics of anomaly on IBM’s BlueGene/L [8]. They find that the occurrence of a failure is strongly correlated to the time stamp and the location of others in a cluster environment. Zhang et al. propose a hybrid prediction technique which uses model checking techniques; an operational model is explored to check if a desirable temporal property is satisfied or violated by the model itself [9]. To conclude, the basic idea of datadriven methods is that upcoming anomalies are from the occurrence of the previous ones. Symptom-driven methods analyze some workloadrelated data such as input workload and memory workload in order to predict further system resource utilization. Tan and Gu [10] monitor a series of run-time metrics (CPU, memory, I/O usage, and network), use a discrete-time Markov chain to forecast the system metrics in the future, and finally predict the system state based on NaΒ¨Δ±ve Bayesian classification. Luo et al. [11] build autoregressive model using various parameters from an Apache webserver to predict further system resource utilization; failures are predicted by detecting resource exhaustion. Efficient proactive anomaly management relies on the system monitoring data, and the metric system generated by monitor infrastructures are continuously arriving and invariably noisy, so one big challenge is to provide high accurate and good and efficient system anomaly prediction for noisy monitoring data stream. Recently, some approaches have been proposed for system anomaly prediction using discretetime Markov chain (DTMC) [10, 12]. However, their work does not consider the issue that monitoring data may oscillate around the real value as we mentioned previously. DTMC which uses explicit state boundaries will lead to significantly different values even when the metrics oscillation around the boundaries is very slight. Soubaras [13] proposed Evidential Markov chain model which extends DTMC to overcome the noise value around explicit state boundaries problem caused by inaccuracies monitoring metrics. The problem of Evidential Markov chain is that although it works excellently in a static data scenario, it cannot be applied directly to stream data. Its fixed transition matrix is too restrictive for continuously changing stream data and brings in enormous amount of calculation. In this paper, we present the design and implementation of an approach to solve the system anomaly prediction problem on noise data stream. We first present an improved belief Markov chain (BMC) to fit into a data stream scenario. We use a stream-based π‘˜-means clustering algorithm [14] to dynamically maintain and generate Markov transition matrix. Only information of microclusters is stored after clustering, and new comers will falls into or newly establish one of the π‘˜ groups. Compared to Evidential Markov chain method, by which all the data has to be stored and recalculated every time when new one arrives to get Markov state,

Mathematical Problems in Engineering our approach is time efficient and more feasible in a highly dynamic and complex system. We then employ aggregate ensemble classification method [15] to determine whether the system will turn into anomaly in the future. Aggregate ensemble classification can address the incorrect anomaly mark problem in a continuously running system. Extensive experiments on PlanetLab dataset [16] of different parameter settings show that averagely BMC achieves 14.8% smaller mean prediction error than DTMC method in various previous works [10, 12, 17, 18]. Our system anomaly prediction method (SAPredictor), which combines BMC and aggregate ensemble classification, is proved to achieve better prediction performance than other prediction models, for example, DTMC+NaΒ¨Δ±ve Bayes, DTMC+KNN, and DTMC+C4.5. SAPredictor demonstrates the best performance in the three key criteria, namely, 71.6% for precision, 84.6% for recall, and 77.5% for 𝐹-measurement. The main contributions of this paper are summarized as follows. (1) We propose the belief Markov chain by improving the Evidential Markov model using a stream-based π‘˜means clustering algorithm and make it more suitable for system metrics prediction on noisy data stream. (2) We integrate belief Markov chain and aggregate ensemble classification as SAPredictor to predict system anomaly. (3) We validate the effectiveness of SAPredictor by extensive experiments on real system data. The rest of this paper is organized as follows. Section 2 introduces our SAPredictor method. Section 3 demonstrates the experiments and analyzes the results. Finally, we conclude and give some future research directions in Section 4.

2. Approach Overview In this section, we present the detailed design of SAPredictor. We first describe the problem of system anomaly prediction and then propose our SAPredictor method, which is composed by the two components: belief Markov chain model and aggregate ensemble classification model. Belief Markov chain model is used to predict the changing pattern of measurement metrics; aggregate ensemble classification is a supervised learning method which employs multiple classifiers and combines their predictions. In this work, we use sliding window to partition the system metrics stream into some chunks and then train the belief Markov chain and aggregate ensemble learning model by the history. The future system status is predicted by putting future metrics as input into the classification model. 2.1. Problem Statement. For a system, we have a vector of observations at time 𝑑 for the system metrics, π‘Œπ‘‘ = [𝑦1,𝑑 , 𝑦2,𝑑 , . . . 𝑦𝑛,𝑑 ]. π‘Œπ‘‘ is a vector that contains 𝑁 system metric time series at time 𝑑, namely, 𝑦𝑖,𝑑 (𝑖 = 1, 2, . . . , 𝑛), 𝑖 is the 𝑖th metric. We label π‘Œπ‘‘ at time 𝑑 as normal (state 0) or anomaly (state 1) by monitoring the system state at time 𝑑. The system anomaly prediction problem we focus on in this paper is that

Mathematical Problems in Engineering

3

whether π‘Œπ‘‘ will fall into anomaly status in the next 𝛽 steps, where 𝛽 > 0 and 𝛽 ∈ 𝑁. To solve this problem, we need to first forecast the future value of 𝑦𝑖,𝑑+𝛽 for each metric. Then, we train ensemble classifier EC based on a sliding window [𝑦𝑖,π‘‘βˆ’π‘‘+1 , . . . 𝑦𝑖,𝑑 ] of π‘Œπ‘‘ , where 𝑑 is the size of sliding window. Finally, we use EC to test on 𝑦𝑖,𝑑+𝛽 (𝑖 = 1, 2 . . . , 𝑛) and predict the state label of π‘Œπ‘‘+𝛽 . 2.2. SAPredictor Approach. Figure 1 describes the SAPredictor system anomaly prediction approach. 𝑁 measurement metrics (e.g., CPU, memory, I/O usage, network, etc.) are collected from the system continuously. Then, the collected system metrics streams are partitioned into some chunks by sliding window. The current and history chunks are used to train the belief Markov chain model and aggregate ensemble learning model. Then, the future system metrics is predicted by the belief Markov chain model, and having these metrics as input into the aggregate ensemble classification model, we can ascertain whether the system will fall into anomaly in the future. Belief Markov chain and aggregate ensemble classification will be presented in the following subsections. 2.2.1. System Metrics Value Prediction. In this section, we first introduce why the Evidential Markov chain which is based on the Dempster-Shafer theory [19] is preferred over discrete-time Markov chain in dealing with system anomaly prediction for noisy data, and then we explain the advantages of our belief Markov chain method compared to Evidential Markov method in a data stream environment. When we build discrete-time Markov chain model, it is necessary to divide all the data into discrete states. Traditional discretion techniques used in discrete-time Markov chain include equal-width and equal-depth. Both techniques generate status with explicit boundaries using all the data. However, the system metrics being monitored are usually imprecise due to system noise and measurement error. Thus, discretetime Markov chain which uses explicit boundary to divide the states will generate highly different prediction results even if their initial values are almost the same. Evidential Markov model [13] has made big improvement by being capable of coping with noisy data. Following is an example of explicit boundary problem in discrete-time Markov chain. In one possible situation, we have a metric ranging in [0, 150], and then we use equal-width approach to discrete the range into three bins, namely, [0, 50), [50, 100), and [100, 150]. 𝑆1 , 𝑆2 , and 𝑆3 denote the states when metric is in [0, 50), [50, 100), and [100, 150], respectively. The transition matrix for the metric is a 3 βˆ— 3 matrix: 0.3 0.2 0.5 π‘Ž11 π‘Ž12 π‘Ž13 𝑀 = (π‘Ž21 π‘Ž22 π‘Ž23 ) = (0.1 0.2 0.7) . π‘Ž31 π‘Ž32 π‘Ž33 0.1 0.5 0.4

(1)

Here, each element π‘Žπ‘–π‘— in matrix 𝑀 denotes the probability of transition from state 𝑖 to state 𝑗. When we use discrete-time Markov chain to predict future value, a vector πœ‹ = [πœ‹π‘‘,𝑆1 , πœ‹π‘‘,𝑆2 , πœ‹π‘‘,𝑆3 ] is needed to denote the probability of the metric in each state at time 𝑑. If we have an initial value 99 which is in state 𝑆2 , then the corresponding probability vector

is πœ‹0 = [0, 1, 0]. We can calculate the probability vector πœ‹1 after one time unit as 0.3 0.2 0.5 πœ‹1 = πœ‹0 βˆ— 𝑀 = [0, 1, 0] βˆ— (0.1 0.2 0.7) (2) 0.1 0.5 0.4 = [0.1, 0.2, 0.7] . Here, the probability vector πœ‹1 represents that the initial value will transfer into 𝑆3 most likely, and the predicted value after one step will be 125 = (100 + 150)/2 as the mean of state 𝑆3 . However, if the initial value turns to be 101, then the vector πœ‹0σΈ€  will be [0, 0, 1]. By applying (2) again, it turns out that the prediction value will stay in state 𝑆2 with the predicted value of 75 = (50 + 100)/2 in the next step: 0.3 0.2 0.5 πœ‹1σΈ€  = πœ‹0σΈ€  βˆ— 𝑀 = [0, 0, 1] βˆ— (0.1 0.2 0.7) 0.1 0.5 0.4

(3)

= [0.1, 0.5, 0.4] . Note that there is only a slight difference between 99 and 101 in the initial value, yet the forecasted value after one step is in large difference from 75 to 125. As the example shows, discrete-time Markov chain uses explicit state boundaries, and it will have very different prediction value if the original metric is around the state boundary. To solve this problem, we propose belief Markov chain based on the Dempster-Shafer theory. The DempsterShafer theory is an inaccurate inference theory. It can handle the uncertainty caused by unknown prior knowledge and extend the basic event space to its power set. The detailed definitions for Dempster-Shafer [19] are as follows. Definition 1 (frame of discernment). Suppose that 𝑃 is the exhaustive set of random variable 𝑋, so 𝑃 = {π‘₯1 , π‘₯2 , . . . π‘₯𝑗 } and the elements in 𝑃 are mutually exclusive. Then, the set of all possible subsets of 𝑃 is called a frame of discernment 𝐴: 𝐴 = (πœ™, {π‘₯1 } , {π‘₯2 } , . . . , {π‘₯𝑗 } , {π‘₯1 , π‘₯2 } , {π‘₯1 , π‘₯𝑗 } , . . . , {π‘₯2 , π‘₯3 } , . . . , {π‘₯2 , π‘₯𝑗 } , . . . , {π‘₯1 , π‘₯2 , . . . , π‘₯𝑗 })

(4)

= (𝐴 0 , 𝐴 1 , 𝐴 2 , . . . , 𝐴 𝑗 ) . We use 𝐴 π‘˜ (π‘˜ ∈ 𝑁, π‘˜ ∈ [0, 𝑗]) to represent the subset in power set of 𝑃 which contains π‘˜ elements. Definition 2 (mass function). Have 𝑃 and 𝐴, for every subset of 𝑃; if the following statements satisfy, then the function 𝐹 is called the mass function on 𝐴: 𝐹 (πœ™) = 0, βˆ‘ 𝐹 (𝐴 π‘˜ ) = 1, 𝐹 (𝐴 π‘˜ ) ∈ [0, 1] ,

(5) 𝑃

π΄π‘˜ ∈ 2 .

Definition 3 (transferable belief model). Suppose that we have discernment frame 𝐴 and mass function 𝐹 on 𝑃. Then, the probability for each random variable 𝑋 in 𝑃 can be calculated by transferable belief model: 𝑇 (π‘₯) = βˆ‘

𝐹 (𝐴) . |𝐴|

(6)

4

Mathematical Problems in Engineering

Predict metric value

Belief Markov chain

Metric one Train the classifier model

Ensemble classification model

Metric two

Server

. ..

Predict anomaly

Server Server

System Result Metric N

Figure 1: System anomaly prediction approach.

S1

S2

S3 S2,3

S1,2

SN

SNβˆ’1 Β·Β·Β·

Snβˆ’1,n

Figure 2: State-space with cross-state region.

The subset of 𝐴 includes both single event set {π‘₯𝑖 } and multiple event combinations {π‘₯𝑖 , π‘₯𝑗 , . . . , π‘₯π‘˜ }. This is why we need transferable belief model to calculate the probability of one single random variable. Figure 2 illustrates a metric divided into 𝑁 states, 𝑆 = {𝑆1 , 𝑆2 , 𝑆3 , . . . , 𝑆𝑛 }, and each pair of adjacent states has a state 𝑆𝑖,𝑖+1 which means that the value is in cross-region between state 𝑖 and state 𝑖 + 1. When using BMC model to predict, the initial metric may belong to a single state entirely or belong to the cross-region of two adjacent states. So, the discernment frame of this problem can be simplified to 𝐴 BMC = (πœ™, {π‘₯1 } , {π‘₯2 } , . . . , {π‘₯𝑛 } , {π‘₯1 , π‘₯2 } , (7)

{π‘₯2 , π‘₯3 } , . . . , {π‘₯π‘›βˆ’1 , π‘₯𝑛 }) .

Then, we declare the mass function to assign probability to each subset in 𝐴 BMC . Any function that satisfies (5) can be used as mass function. The probability of each event in 𝑃 can be calculated as 𝐹 (π‘₯𝑖 , π‘₯𝑖+1 ) { 𝐹 (π‘₯𝑖 ) + { { { 2 { { 𝐹 (π‘₯π‘–βˆ’1 , π‘₯𝑖 ) 𝐹 (π‘₯𝑖 , π‘₯𝑖+1 ) 𝑝 (π‘₯𝑖 ) = { + 𝐹 (π‘₯𝑖 ) + { 2 2 { { { { 𝐹 (π‘₯π‘–βˆ’1 , π‘₯𝑖 ) + 𝐹 (π‘₯𝑖 ) { 2

𝑖 = 1, 1 < 𝑖 < 𝑛, 𝑖 = 𝑛. (8)

At last, we need to infer the transition matrix which describes the probabilities of moving from one state to others as we did in discrete-time Markov chain. Each element 𝑝𝑖𝑗 of transition matrix 𝑀BMC denotes the probability of the currently state 𝑆𝑖 , and then it moves to state 𝑆𝑗 . It can be calculated by 𝑝𝑖𝑗 =

βˆ‘π‘›π‘‘=1 (𝑝(𝑖)𝑑 βˆ— 𝑝(𝑗)𝑑+1 )

βˆ‘π‘—βˆˆπ΄ BMC βˆ‘π‘›π‘‘=1 (𝑝(𝑖)𝑑 βˆ— 𝑝(𝑗)𝑑+1 )

.

(9)

However, the Evidential Markov chain needs to store all the data and recalculate the Markov state when new data arrives, this is not time efficient and feasible for the systems that need real-time response, especially for data stream applications. Thus, we improve Evidential Markov chain using stream-based π‘˜-means clustering method. The arriving data points can be mapped onto π‘˜ states using data stream clustering algorithm where each cluster represents a Markov state. For each cluster 𝑖 representing state 𝑠𝑖 , we need to store a transition count vector 𝑐𝑖 . All transition counts can be seen as a 𝐾 βˆ— 𝐾 transition count matrix 𝐢 where 𝐾 is the number of clusters. As we use stream clustering, there is a list of operations for cluster: adding a new data to an existing cluster, creating a new cluster, deleting clusters, merging clusters, and splitting clusters. And we use Jaccard [20] as a dissimilarity threshold to detect clusters. Thus, the π‘˜ states are adaptively changing to fit the arriving data, which

Mathematical Problems in Engineering is also an advantage compared to Evidential Markov chain method. 2.2.2. System Status Classification. In this section, we first introduce why we choose ensemble classification to forecast the system status and then how the aggregate ensemble method can address the concept drift and noisy data problem in data stream. Tan and Gu [10] apply single statistical classifier on static dataset to make classification. Though this approach works well on static dataset, it is not applicable in a dynamic environment where system logs are generated continuously, and even the underlying data generating mechanism and cause of anomaly are constantly changing. To capture the time-evolving anomaly pattern, many solutions have been proposed to build classification models from data stream. One simple model is using online incremental learning [11, 21]. The incremental learning methods deliver a single learning model to represent an entire data stream and update the model continuously when new data arrives. Ensemble classification always regards the data stream as several separated data chunks and trains classifiers based on these chunks using different learning algorithms, and then ensemble classifier is built through voting of these base classifiers. Although these models are being proved to be efficient and accurate, they depend on the assumption that data stream being learned is high quality and without consideration of data error. However, in real world applications, like system monitoring data stream and sensor network data stream, they are always containing erroneous data values. As a result, the tradition online incremental model is likely to lose accuracy in the data stream which has error data values. Ensemble learning is a supervised method which employs multiple learners and combines their predictions. Different from the incremental learning, ensemble learning trains a number of models and gives out final prediction based on classifiers voting. Because the final prediction is based on a number of base classifiers, ensemble learning can adaptively and rapidly address the concept drift and error data problem in data stream. Based on the above reason, we choose to use ensemble classification. In summary, the ensemble of classification can be categorized into two categories: horizontal ensemble and vertical ensemble classification [15]. The horizontal ones build classifiers using several buffered chunks, while the vertical ones build classifiers using different learning algorithm on the current chunks. Vertical ensemble is shown in Figure 3. It uses π‘Ÿ different classification algorithms (e.g., we simply set π‘Ÿ = 3) to build classifier on the current chunk and then use the results of these classifiers to form an ensemble classification model. The vertical ensemble only uses the current chunk to build classifiers, and the advantage of vertical ensemble classification is that it uses different algorithms to build the classifier model which can decrease the bias error between each classifiers. However, the vertical ensemble assumes that the data stream is errorless. As we discussed before, the real-world data stream always contains error. So, if the current chunk is mostly containing noise data, then the result

5 Table 1: Monitoring metrics used for anomaly prediction.

LOAD1 UPTIME DISKUSAGE CPUUSE LOAD12 NUMSLICE SERVTEST1 CPUHOG RXHOG RXRATE

Monitored metrics LOAD5 FREEMEM DISKSIZE RWFS LOAD13 LIVESLICE SERVTEST2 MEMHOG PROCHOG PURKS1-10

AVAILCPU FREEDISK MYFREEDISK LOAD11 PUKPUKS1-10 VMSTAT1-17 BURP TXHOG TXRATE MEMINFO1-3

may have severe performance deterioration. To address this problem, horizontal ensemble which uses multiple history chunks to build classifiers is employed. Horizontal ensemble is showed in Figure 4. The data stream is separated into 𝑛 consecutive chunks (e.g., 𝐷1 and 𝐷2 are history chunks, and 𝐷3 is the current chunk), and the aim of ensemble learning is to build classifiers on these 𝑛 chunks and predict data in the yet-to- arrive chunk (𝐷4 in this picture). The advantage of horizontal structure is that it can handle the noise data in the stream because the prediction of newly arriving data chunk depends on the average of different chunks. Even if the noise data may deteriorate some chunks, the ensemble can still generate relatively accurate prediction result. The disadvantage of horizontal ensemble is that the data stream is continuously changing, and the information contained in the previous chunks may be invalid so that use these old-concept classifiers will not improve the overall result of prediction. Because of the limitation of both horizontal and vertical ensembles, in this paper, we use a novel ensemble classification which uses π‘˜ different learning algorithms to build classifiers on 𝑝 buffered chunks and then train π‘˜by-𝑝 classifier as Figure 5 shows. By building an aggregate ensemble, it is capable of solving a real-world data stream containing both concept drifting and data errors.

3. Experiment and Result 3.1. Experiment Setup. We evaluate our SAPredictor method on the anomaly data collected from realistic system: PlanetLab. The PlanetLab [22] is a global research network that supports the development of new network services. The PlanetLab data set [16] which we use in this paper contains 66 system-level metrics such as CPU load, free memory and disk usage, shown by Table 1. The sampling interval is 10 seconds. There are 50162 instances, and among which 8700 are labeled as anomalies. Our experiments were conducted on a 2.6-GHz Inter Dual-Core E5300 with 4 GB of memory running Ubuntu10.4. We use sliding window (window size = 1000 instances) based validation because in real system, the labeled instances are sorted in chronological order of collecting time. The reason

6

Mathematical Problems in Engineering

C1

C2

Ensemble

C3 Training D1

D2 History chunks

Prediction D4

D3 Current chunk

Future chunk

Figure 3: Vertical ensemble classification.

Ensemble

C1

C2

C3

Training

Training

Training

D1

D2

D3

Prediction D4

Current chunk

History chunks

Future chunk

Figure 4: Horizontal ensemble classification.

Ensemble

C1

C1

C2

C2

C2

C3

C3

C3

Training

Training

Training

D1

D2

D3

C1

Prediction

History chunks

Current chunk

Figure 5: Aggregate ensemble classification.

D4 Future chunk

Mathematical Problems in Engineering

7

Table 2: Comparison between BMC and DTMC at 10% noise. Noise = 10% Time units 1 2 3 4 5

Mean prediction error (%) BMC DTMC 14.04% 16.60% 18.78% 20.96% 22.98% 24.35% 26.75% 27.83% 27.57% 28.50%

Improvement 15.42% 10.40% 5.63% 3.88% 3.26%

that we do not use cross-validation is that it randomly divides the dataset into pieces without considering the chronological order. Under such circumstances, it is possible that current data is used to predict past data, which does not make sense. Thus, sliding window validation is more appropriate for our experiments. 3.2. The Metrics Prediction Accuracy. Short term predictions are helpful to prevent potential disasters and limit the damage caused by system anomalies. Usually, predicting near term future is more clever and successful than long term predictions [5]. So, in our experiment, we assess system state prediction in short term. In this experiment, we choose π‘˜-means discretion technique to create state boundary. The reason is that by π‘˜-means the state will have more adjacent data compared to the state discrete by equal-width and equal-depth, when we divide the data into π‘˜ clusters, because each middle point of the cluster will be used as a state. We set the size of bins as 5, 10, 15, . . . to 30, and evaluate the quality of metric prediction by mean prediction error (MPE) as the study by Tan and Gu [10]: MPE =

󡄨 󡄨 (󡄨󡄨󡄨π‘₯π‘š βˆ’ π‘₯Μ‚π‘š 󡄨󡄨󡄨 /π‘₯π‘š )) . (|π‘š| βˆ— |𝐷|)

(βˆ‘π· βˆ‘π‘–π‘š=1

(10)

𝐷 is the test dataset, and |𝐷| is the number of instances in 𝐷. |π‘š| is the number of system metrics, and π‘₯π‘š is the actual value of metric π‘š. π‘₯Μ‚π‘š is the prediction value of metric π‘š, which is represented by the mean value of samples in that bin. The less the value of MPE, the more accurate the predictor. We assess the MPE in near term future (1–5 time units ahead) for different bin sizes (5, 10, 15, 20, 25, and 30) on PlanetLab dataset. Figure 6 shows the MPE of PlanetLab for time units (1–5) with bin size of 20. From these two figures, we have the following observations: (1) BMC can achieve less prediction error than DTMC from time units 1 to 5. One step prediction has the most notable advantage, and the advantage decreases slightly as time goes on, which means that our algorithm fits better when the forecast period is shorter; (2) BMC and DTMC both lose prediction accuracy as time goes by, which indicates that predict anomaly in longer term is more challenging. Figure 7 shows the MPE of PlanetLab with different bin sizes (5, 10, 15, 20, 25, and 30) when time unit is one. From these figures, we can see that both methods have higher MPE with less number of bins. The reason is that less number of states tends to group a larger range of data into a bin. Since the mean of the bin is used as the prediction value, the

Table 3: Comparison between BMC and DTMC at 20% noise. Noise = 20% Time units 1 2 3 4 5

Mean prediction error (%) BMC DTMC 14.67% 17.42% 19.61% 21.85% 23.61% 25.23% 27.66% 28.73% 28.42% 29.42%

Improvement 15.79% 10.25% 6.42% 3.72% 3.40%

Table 4: Comparison between BMC and DTMC at 30% noise. Noise = 30% Time units 1 2 3 4 5

Mean prediction error (%) BMC DTMC 15.34% 18.27% 20.47% 22.77% 24.36% 26.14% 27.07% 28.67% 29.29% 30.38%

Improvement 16.04% 10.10% 6.81% 5.58% 3.59%

Table 5: Comparison between BMC and DTMC at 40% noise. Noise = 40% Time units 1 2 3 4 5

Mean prediction error (%) BMC DTMC 15.89% 19.15% 21.43% 23.76% 25.14% 27.11% 27.93% 29.69% 30.22% 31.43%

Improvement 17.02% 9.81% 7.27% 5.93% 3.85%

Table 6: Comparison between BMC and DTMC at 50% noise. Noise = 50% Time units 1 2 3 4 5

Mean prediction error (%) BMC DTMC 16.38% 20.07% 22.26% 24.79% 25.90% 28.12% 28.83% 30.75% 31.23% 32.48%

Improvement 18.36% 10.21% 7.89% 6.24% 3.85%

gap between the prediction value and the real value will be enlarged. In Tables 2, 3, 4, 5, and 6, we compare the mean prediction error of DTMC and BMC under different noise percentage. The noise percentage 𝑛 means that the monitoring value at state 𝑖 oscillates around the true value 𝑦𝑖 in the range of [π‘¦π‘–βˆ’1 + (𝑦𝑖 βˆ’π‘¦π‘–βˆ’1 )βˆ—π‘›%, 𝑦𝑖+1 βˆ’(𝑦𝑖+1 βˆ’π‘¦π‘– )βˆ—π‘›%] as illustrated in Figure 2, where π‘¦π‘–βˆ’1 is the value of the last state and 𝑦𝑖+1 is the value of next state. We choose n from 10 to 50 in our experiment because the previous 𝑖 will be falsely recognized as state 𝑖 βˆ’ 1 or state 𝑖 + 1 if 𝑛 is larger than 50%. Thus, in this paper, we set the noise in the percentage from 10% to 50%. The mean prediction error results in Tables 2–6 show that our proposed method BMC has better prediction quality than DTMC. Both BMC and DTMC have the smallest prediction error in one step prediction, and the error magnifies as prediction steps

8

Mathematical Problems in Engineering Table 7: The four cases that prediction belongs to.

25.00

Mean prediction error (%)

20.00

Actually abnormal

Predicted as normal

True positive (TP) (true warning)

Predicted as abnormal False positive (FP) (missing warning)

15.00

10.00

5.00

0.00 1

2

3

4

5

Time units BMC DTMC

Figure 6: Prediction accuracy when bin size is 20.

40.00 35.00

Mean prediction error (%)

Cases

other classification algorithms, as decision tree and logistic regression. For ease of comparisons, we first summarize the assessment of criteria of different classification methods. Suppose that a data stream has 𝑛 data chunks. We aim to build a classifier to predict all instances’ label in the yet-tocome chunk. To simulate different types of data stream, we use the following approaches used in [21]: noise selectionβ€” we randomly select 20% chunks from each dataset as noise chunks and then arbitrarily assign each instance a class label which does not equal its original class label, and finally we put these noisy data chunks back into the data stream. The performance of system anomaly prediction is evaluated by 3 criteria according to [20]: precision, recall, and 𝐹-measure. We use Table 7 to help explain the definitions of these criteria, where state 0 denotes normal and state 1 denotes anomaly. These three criteria are defined as

30.00

precision =

25.00

recall =

20.00

𝐹-measure =

15.00 10.00 5.00 0.00 5

10

15

20

25

30

Number of bins BMC DTMC

Figure 7: Prediction accuracy when time unit is 1.

become larger. BMC has the most notable advantage over DTMC in one step prediction and the advantage decreases as step goes larger. Based on the above observation, we conclude that our algorithm has better performance than DTMC in all noise ranges and fits better when we forecast imminent anomalies. 3.3. Ensemble Classification with Data Stream. In this experiment, we compare three ensemble classification methods and

Actually normal False negative (FN) (incorrectly warning) True negative (TN) (correctly no warning)

TP , TP + FP

TP , TP + FN

(11)

2 βˆ— precision βˆ— recall . precision + recall

We define precision as the proportion of successful prediction for each predicted state in chunk 𝐷𝑛+1 , recall as the probability of each real state to be successfully predicted in the chunk 𝐷𝑛+1 , and 𝐹-measure as the harmonic mean of precision and recall. Following the above process 𝑛 βˆ’ 1 times, we have the average precision, recall, and 𝐹-measure. Ideally, a good classifier for noise data stream should have high average precision, high average recall, and high average 𝐹-measure. Table 8 shows the quality of classification between different classifiers. In this experiment, we choose three basic classifiers C4.5, Logistic, and NaΒ¨Δ±ve Bayes as our base classifiers. And we set the sliding window size as 1000 instances. Column 2 to Column 4 in Table 8 are the classification results that employ single classifier. So, we choose the 𝐷1 to train the model and test the model use 𝐷2 then repeat the process by training the model using 𝐷2 and test on 𝐷3 and so on. HTree, HNB, and HLogist are three horizontal ensemble classification methods which use both history and current chunks to train the classifier model. So, we first use 𝐷1 to train the model and test on 𝐷2 and then use both 𝐷1 and 𝐷2 to train the model and test on 𝐷3 . Repeat this process until the end of the data stream. VerEn is the vertical ensemble

Mathematical Problems in Engineering

9

Table 8: Classification quality between different classifiers.

Precision Recall 𝐹-Measure

Tree 73.05% 63.04% 61.21%

NB 72.36% 69.48% 68.85%

Logistic 70.23% 55.34% 58.33%

Table 9: Anomaly prediction cost. Time Training time Prediction time

SAPredictor 452 ms 205 us

Table 10: Prediction results for different methods. Methods KNN + DTMC C4.5 + DTMC NB + DTMC TAN + DTMC SAPredictor

Precision 72.7% 74.9% 82.7% 74.1% 71.6%

Recall 50.5% 73.9% 26.6% 80.0% 84.6%

𝐹-Measure 57.3% 74.4% 25.0% 76.6% 77.5%

model which uses all three base classifiers to train on the current chunk and test the next chunk. The last column is the aggregate ensemble which builds all base classifiers on history and current chunks. The result in Table 8 shows that AggEn performs the best for all three measurements, the single NaΒ¨Δ±ve Bayes is the second best, and VerEn is the third best. And HLogist and Logistic are listed as the last. 3.4. Anomaly Prediction System Cost. We have evaluated the overhead of our anomaly prediction model. Table 9 shows the average training time and prediction time. The training time includes the time of building BMC model and inducing the anomaly classifier. The prediction time includes the time to retrieve state transition probabilities and generate the classification result for a single data record. These results are collected over 100 experiment runs. We observe that the total training time is within several hundreds of milliseconds, and the prediction requires almost 200 microseconds. The above overhead measurements show that our approach is practical for performing online prediction of system anomalies. 3.5. SAPredictor Compared with Other Models. In this section, we compare the prediction quality between SAPredictor and DTMC combining other state-of-the-art classifiers in the machine learning literature, that is, π‘˜-Nearest Neighbor, C4.5, NaΒ¨Δ±ve Bayes, and Tree-Augmented NaΒ¨Δ±ve Bayesian (TAN) Network. We compare two kinds of prediction models: one is our SAPredictor which uses ensemble classification based on predicted metrics from BMC, the other is DTMC combining different single classifiers mentioned above. The performance of system anomaly prediction is evaluated by the same criteria used in Section 3.3: precision, recall, and 𝐹-measure.

HTree 70.04% 72.18% 68.73%

HNB 69.68% 71.05% 68.10%

HLogist 69.69% 51.93% 51.76%

VerEn 72.38% 66.68% 65.41%

AggEn 73.25% 75.08% 72.70%

Table 10 presents the experiment results of SAPredictor and other classifiers integrating DTMC on the dataset of PlanetLab. We notice that Na¨ıve Bayes and KNN have the worst performance: its recall scores are 26.6% and 50.5%, respectively, and 𝐹-measure scores are 25.0% and 57.3%, respectively. SAPredictor receives the highest scores in recall and 𝐹-measure on this dataset, which are 84.6% and 77.5%. Thus, our SAPredictor is much more accurate than the other models.

4. Conclusions and Future Work In this paper, we propose a novel system anomaly prediction model SAPredictor: it has clear advantages over discrete-time Markov chain which combines other classifiers. SAPredictor consists of two parts, one is belief Markov chain method which extends Evidential Markov chain by being capable of dealing with stream data, and the other is aggregate ensemble classification which identifies anomaly based on the value predicted by BMC. To conclude, SAPredictor can handle data stream from real application and systems with noise and measurement error. Our experiments show that the BMC model achieves higher prediction accuracy than DTMC at any noise level and is especially fit for imminent anomaly prediction. SAPredictor achieves better system status prediction quality than the other popular models such as DTMC + Na¨ıve Bayes, DTMC + C4.5, and DTMC + KNN. Our SAPredictor has small overhead, which makes it more practical for performing online prediction of system anomalies. In the future, we plan to test and make possible improvement of SAPredictor in more real applications. In this paper, we consider the system as either normal or abnormal, while in reality the situation could have been more complicated. SAPredictor also can be improved to distinguish each kind of anomalies when making prediction and sending different level of alert. We also plan to publish a tool of SAPredictor and apply it to complex, distributed systems.

Acknowledgment This research was supported by the Ministry of Industry and Information Technology of China (no. 2010ZX01042-002003-001).

References [1] A. Ray, β€œSymbolic dynamic analysis of complex systems for anomaly detection,” Signal Processing, vol. 84, no. 7, pp. 1115– 1130, 2004.

10 [2] F. Salfner, M. Lenk, and M. Malek, β€œA survey of online failure prediction methods,” ACM Computing Surveys, vol. 42, no. 3, article 10, 2010. [3] F. Salfner and M. Malek, β€œUsing hidden semi-Markov models for effective online failure prediction,” in Proceedings of the 26th IEEE International Symposium on Reliable Distributed Systems (SRDS ’07), pp. 161–174, October 2007. [4] E. Kiciman and A. Fox, β€œDetecting application-level failures in component-based Internet services,” IEEE Transactions on Neural Networks, vol. 16, no. 5, pp. 1027–1041, 2005. [5] Y. Zhang and C. Ma, β€œFault diagnosis of nonlinear processes using multiscale KPCA and multiscale KPLS,” Chemical Engineering Science, vol. 66, no. 1, pp. 64–72, 2011. [6] Y. Zhang, H. Zhou, S. J. Qin, and T. Chai, β€œDecentralized fault diagnosis of large-scale processes using multiblock kernel partial least squares,” IEEE Transactions on Industrial Informatics, vol. 6, no. 1, pp. 3–10, 2010. [7] Y. Zhang, β€œModeling and monitoring of dynamic processes,” IEEE Transactions on Neural Networks and Learning System, vol. 23, no. 2, pp. 277–284, 2012. [8] Y. Liang, Y. Zhang, H. Xiong, and R. Sahoo, β€œFailure prediction in IBM BlueGene/L event logs,” in Proceedings of the 7th IEEE International Conference on Data Mining (ICDM ’07), pp. 583– 588, October 2007. [9] P. Zhang, H. Muccini, A. Polini, and X. Li, β€œRun-time systems failure prediction via proactive monitoring,” in Proceedings of the 26th IEEE/ACM International Conference on Automated Software Engineering (ASE ’11), pp. 484–487, November 2011. [10] Y. Tan and X. Gu, β€œOn predictability of system anomalies in real world,” in Proceedings of the 18th Annual IEEE/ACM International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems (MASCOTS ’10), pp. 133–140, August 2010. [11] G. Luo, K.-L. Wu, and P. S. Yu, β€œAnswering linear optimization queries with an approximate stream index,” Knowledge and Information Systems, vol. 20, no. 1, pp. 95–121, 2009. [12] X. Gu and H. Wang, β€œOnline anomaly prediction for robust cluster systems,” in Proceedings of the 25th IEEE International Conference on Data Engineering (ICDE ’09), pp. 1000–1011, April 2009. [13] H. Soubaras, β€œOn evidential markov chains,” in Foundations of Reasoning Under Uncertainty, vol. 249 of Studies in Fuzziness and Soft Computing, pp. 247–264, Springer, Berlin, Germany, 2010. [14] C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu, β€œA framework for clustering evolving data streams,” in Proceedings of the International Conference on Very Large Data Bases (VLDB ’03), pp. 81– 92, 2003. [15] P. Zhang, X. Zhu, Y. Shi, L. Guo, and X. Wu, β€œRobust ensemble learning for mining noisy data streams,” Decision Support Systems, vol. 50, no. 2, pp. 469–479, 2011. [16] Y. Zhao, Y. Tan, Z. Gong, X. Gu, and M. Wamboldt, β€œSelf-correlating predictive information tracking for large-scale production systems,” in Proceedings of the 6th International Conference on Autonomic Computing (ICAC ’09), pp. 33–42, Barcelona, Spain, June 2009. [17] C.-H. Lee, Y.-L. Lo, and Y.-H. Fu, β€œA novel prediction model based on hierarchical characteristic of web site,” Expert Systems with Applications, vol. 38, no. 4, pp. 3422–3430, 2011. [18] D. Katsaros and Y. Manolopoulos, β€œPrediction in wireless networks by Markov chains,” IEEE Wireless Communications, vol. 16, no. 2, pp. 56–63, 2009.

Mathematical Problems in Engineering [19] J. Y. Halpern, Reasoning about Uncertainty, MIT Press, Cambridge, Mass, USA, 2003. [20] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2006. [21] S. Pang, S. Ozawa, and N. Kasabov, β€œIncremental linear discriminant analysis for classification of data streams,” IEEE Transactions on Systems, Man, and Cybernetics B, vol. 35, no. 5, pp. 905–914, 2005. [22] Y. Zhao, Y. Tan, Z. Gong, X. Gu, and M. Wamboldt, β€œSelfcorrelating predictive information tracking for large-scale production systems,” in Proceedings of the 6th International Conference on Autonomic Computing (ICAC ’09), pp. 33–42, ACM, June 2009.

Advances in

Operations Research Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Advances in

Decision Sciences Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Journal of

Applied Mathematics

Algebra

Hindawi Publishing Corporation http://www.hindawi.com

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Journal of

Probability and Statistics Volume 2014

The Scientific World Journal Hindawi Publishing Corporation http://www.hindawi.com

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

International Journal of

Differential Equations Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Volume 2014

Submit your manuscripts at http://www.hindawi.com International Journal of

Advances in

Combinatorics Hindawi Publishing Corporation http://www.hindawi.com

Mathematical Physics Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Journal of

Complex Analysis Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

International Journal of Mathematics and Mathematical Sciences

Mathematical Problems in Engineering

Journal of

Mathematics Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Discrete Mathematics

Journal of

Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Discrete Dynamics in Nature and Society

Journal of

Function Spaces Hindawi Publishing Corporation http://www.hindawi.com

Abstract and Applied Analysis

Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

International Journal of

Journal of

Stochastic Analysis

Optimization

Hindawi Publishing Corporation http://www.hindawi.com

Hindawi Publishing Corporation http://www.hindawi.com

Volume 2014

Volume 2014