Intelligent failure prediction models for scientific workflows

0 downloads 0 Views 2MB Size Report
Sep 21, 2014 - scientific workflows using the state of the art of machine learning approaches for .... care data using decision trees and Bayesian networks.
Expert Systems with Applications 42 (2015) 980–989

Contents lists available at ScienceDirect

Expert Systems with Applications journal homepage: www.elsevier.com/locate/eswa

Intelligent failure prediction models for scientific workflows Anju Bala ⇑, Inderveer Chana Computer Science and Engineering Department, Thapar University, Patiala, India

a r t i c l e

i n f o

a b s t r a c t

Article history: Available online 21 September 2014

The ever-growing demand and heterogeneity of Cloud Computing is garnering popularity with scientific communities to utilize the services of Cloud for executing large scale scientific applications in the form of set of tasks known as Workflows. As scientific workflows stipulate a process or computation to be executed in the form of data flow and task dependencies that allow users to simply articulate multi-step computational and complex tasks. Hence, proactive fault tolerance is required for the execution of scientific workflows. To reduce the failure effect of workflow tasks on the Cloud resources during execution, task failures can be intelligently predicted by proactively analyzing the data of multiple scientific workflows using the state of the art of machine learning approaches for failure prediction. Therefore, this paper makes an effort to focus on the research problem of designing an intelligent task failure prediction models for facilitating proactive fault tolerance by predicting task failures for Scientific Workflow applications. Firstly, failure prediction models have been implemented through machine learning approaches using evaluated performance metrics and also demonstrates the maximum prediction accuracy for Naive Bayes. Then, the proposed failure models have also been validated using Pegasus and Amazon EC2 by comparing actual task failures with predicted task failures. Ó 2014 Elsevier Ltd. All rights reserved.

Keywords: Cloud Computing Workflows Failure prediction Scientific workflows Machine learning

1. Introduction Cloud Computing has transformed the Information and Communication Technology industry by facilitating on-demand services, elasticity, flexibility and provisioning of computing resources based on utility (Beloglazov & Buyya, 2013). Cloud Computing adopts virtualization technologies to provide various services to the user such as Infrastructure as a Service (IaaS), Hardware as a Service (HaaS), Platform as a Service (PaaS), Software as a Service (SaaS) and Workflow as a Service (WFaaS) (Cushing, Koulouzis, Belloum, & Bubak, 2014; Wang, Korambath, Altintas, Davis, & Crawl, 2014; Zhao, Melliar, & Moser, 2010). The layered architecture of the Cloud services and required tools for these services along with their challenges has been revealed in Fig. 1. As the Cloud offers WFaaS and SaaS on the top layer of Cloud where the user can utilize the services on internet without buying the proprietary rights of software and applications such as Workflow Management System (WMS), scientific applications, E-mail, business and multimedia applications. Some of the key challenges that need to be addressed at software layer are scheduling, data heterogeneity, fault tolerance, fault prediction, interoperability ⇑ Corresponding author. E-mail (I. Chana).

addresses:

[email protected]

(A.

http://dx.doi.org/10.1016/j.eswa.2014.09.014 0957-4174/Ó 2014 Elsevier Ltd. All rights reserved.

Bala),

[email protected]

and security etc. The second layer of the architecture acts as a core middleware between software layer and infrastructure layer that endowed PaaS for testing, deploying and controlling web applications which comprise platforms such as Google Apps Engine, Microsoft Azure, Hadoop, Aneka and heroku etc. (Buyya, Shin Yeo, Venugopal, Broberg, & Ivona, 2009). The issues which need to be resolved at this layer are management of big data applications, data analytics and intelligence etc. The infrastructure layer at the bottom consists of IaaS and HaaS that offers various hardware resources and infrastructures on demand without purchasing. Amazon EC2, Eucalyptus, OpenNebula, Nimbus and Open Stack, Rackspace are the examples of infrastructure and hardware providers. Henceforth, Cloud Computing has high availability of these services and thus we are evaluating the use of Cloud services for deploying scientific applications. Scientific applications are represented as workflows that consist of few tasks to million of tasks which have the dependency between them (Ramakrishnan, Reutiman, Chandra, & Weissman, 2013). The workflow applications for real world business process are identified as Cloud Workflows and WFaaS has been used by some of the researchers for deploying and executing these workflows in Cloud environment. Although, Tan et al. (2009) have also enhanced the performance of real world tumor analysis using the Workflowas-a-Service in Grid Computing yet they have not defined any architecture for WFaaS in Cloud. Then, Pathirage, Perera, S,

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

Kumara, and Weerawarana (2011) proposed architecture for WFaaS to host workflow applications securely in the Cloud but they have not considered Workflow Scheduling. Furthermore, Wang et al. (2014) has defined WFaaS architecture for scheduling the workflow applications to increase the scalability and extensibility. Cushing et al. (2014) have also proposed WFaaS approach for task framing of scientific applications in Cloud. As the work of Juve and Deelman (2010) evaluated Cloud infrastructures as an execution platform for deploying scientific workflows as WFaaS due to the benefits of using Cloud such as provisioning on demand, elasticity, provenance, reproducibility etc. (Deelman, Livny, Berriman, & Good, 2008; Pandey, Karunamoorthy, & Buyya, 2011) have also concluded various advantages of executing scientific workflows with Cloud infrastructure such as cost effective, scalable, decreased runtime, on-demand resource provisioning, and ease of resource management etc. Although, WFaaS is used by only few of the researchers in Cloud, however there are some open challenges that needs to be resolved such as efficient and scalable management of workflows, handling application and resource failures, heterogeneity of data, failure prediction of tasks and fault tolerant scheduling etc. (Deelman, 2009; Gil1 & Deelman, 2007). Among these research issues, the key challenge is to handle the resource and task failures through intelligent prediction of failures for scientific workflows which is not implemented by any of the authors till now. As most of the existing works employed several of fault tolerant techniques for scientific workflows such as replication, checkpointing, job migration, retry, task resubmission etc. (Bala & Chana, 2012; Ganga, Karthik, & Christopher Paul, 2012). Less research has been done to predict and detect task failures intelligently by adopting machine learning approaches for implementing proactive fault tolerance. As the workflows have used for simulation, high energy physics, astronomy and many other scientific applications. Hence, the reliability models for software and hardware failures cannot be simply applied to handle the task failures proactively (Bala & Chana, 2013; Xie, Dai, & Poh, 2004). It is insightful, if the fault tolerant approach is a reactive one and might not be able to handle the failures intelligently (Varghese, McKee, & Alexandrov, 2010). Henceforth, intelligent task failure detection and prediction is mainly challenging for scientific workflow applications which have lot of job and data dependencies such as Montage, Cybershake, Sipht, Inspiral, Epigenomics and Broadband. These workflows have used in scientific community for various applications such as Montage workflow for astronomical physics, Cybershake

981

workflow for earthquake hazards, Sipht workflow for bioinformatics, Inspiral workflows for detecting gravitational waves, Epigenome for genome sequence operations and Broadband to simulate the impact of an earthquake. Thus, the goal of proposed models is to predict the task failures intelligently using machine learning approaches before failure occurrence during the execution of scientific workflow applications. The task failures can occur due to overutilization of resources, unavailable resources, execution time or execution cost exceeds than threshold value, required libraries are not installed properly, system running out of memory or disk space and so on. In the present paper, task failures have been generated due to overutilization of resources such as CPU, RAM, Disk Storage and Network Bandwidth. The historical data of task failure parameters has been gathered for training and testing the prediction model in Weka by running multiple scientific workflow applications at different intervals in WorkflowSim. Then, the results of failure prediction model using evaluation metrics have also been compared using machine learning algorithms such as Naive Bayes, Random Forest, LR and ANN and evaluated that the Naive Bayes would be the best machine learning approach for task failure prediction of multiple scientific workflow applications. To validate the accuracy of proposed model, actual failures have been compared with predicted failures using Amazon EC2 and Pegasus. Finally, the experimental results of proposed model have also been compared with existing model after implementing Broadband, Epigenome and Montage. 1.1. Motivation for the work  The motivation of implementing task failure prediction illustrates their inspiration from research challenges of scientific workflows, applications and benefits of using Cloud services for these workflows.  Our work aspires at analyzing the problem of failure prediction so that Cloud systems would be capable of making autonomic fault tolerant decisions by predicting the task failures with various resource utilization parameters such as CPU utilization, RAM, Disk Storage and Bandwidth utilization.  Most of the existing works have implemented fault prediction using statistical approaches that would not be useful for predicting failures intelligently, therefore our proposed approach would be useful for predicting task failures proactively for

Fig. 1. Layered architecture of Cloud along with Waas.

982

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

scientific workflow applications using various machine learning approaches, which has not been implemented in the existing research works to the best of our knowledge.  Our proposed approach is more effective as task failures are predicted due to resource overutilization through machine learning based approaches. The rest of the paper has been organized as follows: Section 2 discusses the related work and Section 3 presents the methodology. Section 4 describes approaches used to implement task failure prediction models. Section 5 details the evaluation of various parameters. Section 6 reports the experimental results. Section 7 comprises a discussion on conclusion. 2. Related work Significant work has done by a number of researchers to perform software fault prediction using variety of approaches to obtain better prediction results and some of the authors have used machine learning approaches for software fault prediction and resource provisioning. But, these approaches have not employed for predicting task failures using scientific workflows data in Cloud environment. Therefore, in this section, we make an extensive survey of the work related to predicting and detecting failures using general approaches, specific approaches in Cloud, machine learning approaches by considering the state-of-the-art techniques and related approaches along with research challenges. 2.1. General approaches for detecting and predicting failures Fault detection and prediction approaches have proposed and implemented by many of the authors in distributed environment. Duan, Prodan, and Fahringer (2006) proposed a fault classification and data mining approach to predict different types of faults and another work of Jitsumoto, Endo, and Matsuoka (2007) used fault detector to differentiate between hardware, process, and transmission faults. Fu and Xu (2007) built a neural network to approximate the number of failures in a given time interval and Fu and Xu (2010) also implemented failure proactive prediction framework for predicting component failures based on the concept of temporal and spatial correlations using various features such as resource utilization, packet count and system information. They have used supervised learning approaches to predict failures with average accuracy of 70.1% with online prediction and 74% with offline prediction. Further, the work of Sindrilaru, Costan, and Cristea (2010) implemented the technique for detecting faults before they accomplish the actual workflow engine by intercepting SOAP (Simple Object Access Protocol) messages and to provide better reaction times. Another work of Yu, Wang, and Shi (2010) proposed failure aware workflow scheduling approach to by predicting online resource failures but needs to be implemented in Pegasus for predicting task failures also. Guan, Zhang, and Song (2012) have also proposed failure detection and prediction approach with health care data using decision trees and Bayesian networks. Although, a number of authors have proposed various approaches for fault prediction using health care data yet they have not implemented task failure predictions after considering multiple scientific workflows data which can be used to increase the performance of scheduling. 2.2. Specific approaches for detecting and predicting failures within Cloud environment Zhao et al. (2010) have implemented heartbeat message protocol for detecting replica failures and similar work has also

proposed by Jhawar, Piuri, and Santambrogio (2012) using heartbeat message protocol for detecting crash failures among VM instances in Cloud environment but they have not considered to detect task failures for scientific applications. Another work by Guan, Zhang, and Song (2011) has also presented fault prediction mechanism for building dependable Cloud Systems with Bayesian classifiers and decision trees using health data collection. Poola, Ramamohanarao, and Buyya (2014) have proposed fault tolerant workflow scheduling to reduce the cost by 70% and also suggested in their future work that failure prediction approaches can be used to reduce the cost for scientific workflows. Further, more machine learning and statistical techniques need to be enhanced to increase the performance of these approaches, henceforth; very few researchers have worked on failure prediction in Cloud. 2.3. Learning machine approaches for predicting task failures: ANN, Naive Bayes, Random Forrest and LR As the work of Catal and Diri (2009) shown in their review that the machine learning models have enhanced characteristics for prediction than statistical methods. Therefore, we need to explore machine learning algorithms for predicting task failures proactively. Malhotra and Jain (2012) verified that Random Forest gave better results for fault prediction. Firstly, the work of Ohlsson, Zhao, and Helander (1998) applied multivariate analysis techniques for predicting fault-prone classes using software design metric by getting data from Ericsson Telecom AB. Currently, the work of Suresh, Kumar, Ku, and Rath (2014) demonstrated that the multivariate logistic regression is an efficient approach for software fault prediction. Islam, Keunga, Lee, and Liu (2012) proved that the accuracy of ANN is superior for the prediction of resource utilization in Cloud environment. Another similar work of Kousiouris, Menychtasa, Kyriazis, Gogouvitis, and Varvarigou (2014) estimated resource provisioning using Artificial Neural Networks (ANN) in Cloud platforms and also confirmed the better accuracy of prediction with ANN. Naive Bayes Model is measured as the robust machine learning algorithm for software fault prediction by the work of Catal (2011). Although, most of the existing work have used machine learning approaches such as Naive Bayes, ANN, Random Forest and LR for resource provisioning and software fault prediction yet very few work from researchers have reported on predicting task failures of workflow applications. Only, one of the work by Samak et al. (2012) have revealed that the failure analysis of jobs with Naive Bayes model which can predict the failure probability of workflow jobs with the average accuracy of 85% with scientific applications such as Broadband, Epigenome and Montage. Moreover, they have not compared the performance of Naive Bayes with other failure prediction models and not being considered task failures due to resource overutilization or resource failures. Therefore, to the best of our knowledge, none of the above discussed approaches have predicted task failures intelligently using variety of machine learning approaches. 2.4. Research challenges along with its solutions in the state of art of failure detection and prediction in Cloud Computing Accurate failure predictions can help in mitigating the impact of failures for scientific applications and resources, applications, and services can be scheduled efficiently to edge the effect of failures. However, providing accurate predictions sufficiently ahead is a challenging task for intricate applications such as workflows and an accurate prediction of task failure is a pre-requisite for implementing intelligent fault tolerant approach for scientific workflows. The state of the art of existing approaches for failure prediction revealed that some of the research challenges can be resolved using intelligent failure prediction models on large

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

infrastructure such as Cloud and these challenges are sketched out as follows.  Although one of the authors (Samak et al., 2012) have implemented job failure prediction model for application failure and data management component failures yet they have not implemented the prediction during task failure occurs due to resource overutilization which can further enhance the accuracy of failure prediction models.  For efficient forecasting, enhanced failure prediction techniques are required to be implemented which can increase the accuracy of predictions among others. Further, it would also be useful to handle current research challenges of Cloud such as proactive fault tolerance, scientific data management and scheduling etc.  Cost reduction is an important challenge for scientific applications. Henceforth, efficient mechanisms are required that predict the resource failures along with execution cost proactively to reduce the maintenance cost for the scientific applications.  Scientific workflows may consist thousands of tasks along with their dependencies. The main challenge for these workflows is to schedule and execute workflow tasks on Cloud resources which are not currently overutilized or unavailable. Therefore, prior to schedule the task on these resources, prediction based techniques need to be used to intelligently predict overutilized or unavailable resources.  As the scientific workflows consist of heterogenous data, data management problems can be resolved by predicting resource utilization parameters. 2.5. Our contributions In contrast to the existing work, intelligent task failure prediction models have been proposed and implemented for workflow applications in Cloud environment. As proactive fault tolerant approach requires a prior knowledge of the task failures, therefore, we have implemented and compared the performance metrics of task failures prediction models such as Naive Bayes, Random Forest, Logistic Regression and ANN. Secondly, the predicted task failures have also been compared with actual task failures using Pegasus and Amazon EC2 for the proposed failure prediction models. Then, the experimental results show that Naive Bayes model outperforms with maximum accuracy on other models by predicting task failures intelligently. Finally, the results have been

983

compared with existing failure prediction models and validates that the proposed model using Naive Bayes is effective in terms of accuracy. Furthermore, the proposed models allows to handle resource failures as well as task failures before the occurrence of these failures. 3. Methodology The methodology is to design the proposed models depicted in Fig. 2. The proposed models have two modules where first module is used to predict the task failures with machine learning approaches and second module is used to locate the actual failures after executing workflow execution in Cloud testbed. 3.1. Task failure prediction module Task failure prediction unit is used to explore the task failures data which is gathered from workflow execution of scientific applications after the interval of 50 s. The historical data details are accumulated from data repository of the Cloud. Data pre-processing involves thorough assessment of raw data where distorted values or missing readings often become misleading in the formal analysis. Henceforth, to deal with missing values interpolation has been used. Principal Component Analysis is applied for selecting more relevant attribute and to reduce the dimensions which are required for intelligent fault prediction model. Then, tasks are classified which is based on the extracted data of the PCA as task failure if utilization parameters value exceed then the maximum threshold value otherwise class is classified as task not failure. Machine learning approaches such as Naive Bayes, ANN, LR and Random Forest are implemented to predict the task failures intelligently from the dataset of scientific workflows. Different evaluation measures are also used to examine and compare the accuracy of proposed models and predicted failures are saved into database. 3.2. Actual task failure module Actual Task Failure unit is used to assess the actual task failures using Pegasus and Amazon EC2 (Deelman et al., 2005; Juve et al., 2009, 2010). Pegasus maps the abstract workflows to concrete workflows that is deployed and executed on Amazon EC2 Cloud with the goal of making reasonable predictions for scientific applications and also utilized for actual prediction of task failures among scientific workflows. Then, actual tasks failure log files are monitored using pegasus monitored class and analyzed by Amazon

Fig. 2. Methodology of the proposed approach.

984

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

Cloud Watch and stored the data in a relational archive, which standardizes the log files into a close approximation and can sent back to the workflow management system. Then, actual failure records are compared with stored predicted failure to confirm the accuracy of proposed failure prediction models. 4. Approaches used for implementing task failure prediction models 4.1. Logistic regression Logistic regression is the commonly used approach for predicting the dependent variable from a set of independent variables. In this paper, we have predicted failure prone classes using multivariate with multiple independent variables of resource utilization. To find the optimal set of independent variables, there are two selection methods which are forward selection that checks all the variables at entry time and backward elimination method comprises the independent variables and deletes the variables one by one until stopping criteria is not satisfied. We have used the forward stepwise selection method. The multivariate logistic regression formula by Hosmer and Lemeshow (1989) is shown as follows in Eq. (1).

FðtÞ ¼

et

et 1 ¼ þ 1 1 þ et

ð1Þ

Logistic function F (t) is written below which take the values between 0 and 1 then the logistic function can be written in FðxÞ where x is the explanatory variable as in Eq. (2)

FðxÞ ¼

1 1 þ e ðb0 þ b1 x1 þ b2 x2 þ b3 x3 þ b4 x4 Þ

ð2Þ

The probability of the dependent variable can be ‘‘task success’’ or ‘‘task failure’’, so the inverse of logistic function is defined as shown below gðxÞ in Eq. (3).

gðxÞ ¼ ln

FðxÞ ; 1  FðxÞ

gðxÞ ¼ ðb0 þ b1 x1 þ b2 x2 þ b3 x3 þ b4 x4 Þ

ð3Þ

x1 represents the CPU utilization x2 represents the RAM utilization x3 represents the Bandwidth utilization x4 represents the Disk utilization where b0 ; b1 ; b2 ; b3 represent the constants. Hence, the probability of task failure based on four independent variables is calculated by the following formula in Eq. (4):

Probðx1 ; x2 ; x3 ; x4 Þ ¼

egðxÞ 1 þ egðxÞ

ð4Þ

4.2. Artificial Neural Network(ANN) ANN can trained using multivariate functions. In this approach, a Neural Network model with back-propagation method is trained,

with the given previous historical data from the scientific workflow execution in WorkflowSim. A typical three-layer Neural Network is shown in Fig. 3. The Neural Network consists of multiple layers such as input layer, hidden layer and output layer. Input layer has four input neurons, x ¼ ½x1; x2; x3; x4 and output layer having two output neurons, o ¼ ½o1; o2 and one hidden layer with three hidden neurons, h ¼ ½h1; h2; h3 in between them. The neurons at each layer are linked to the neurons of the next layer with a weight wi that is to be evaluated during training. With each training point xi and wi network computes the resultant output yj . If y (actual output) and yj (predicted output) are different, the network weights have been modified to reduce the calculated error. In this paper, weights are updated using learning rate p = 0.7 and momentum = 0.2. The sum of squared errors i.e. Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) is the performance measure that used a gradient-descent technique to minimize the error and to reach the local optima. When the calculated error during all training data is adequately small then the network has reached at local optima. Thus, the resultant Artificial Neural Network with backpropogation assured to generate a better prediction output after observing the input data in real time. The proposed algorithm has been shown in Algorithm 1. Algorithm 1. FPNN with backpropogation learning 1. 2. 3. 4. 5. 6. 7. 8. 9.

Initialize network with random weight vector wi For all training examples from j = 1 to m, do For i = 1 to n do P Evaluate output yj ¼ ni¼1 xi wi Compare network output with correct output y with yj Error e ¼ y  yj Use gradient descent to minimize the error Adapt weights in current layer Repeat until the RMSE and MAPE error has been minimized. 10. EndFor 11. EndFor

4.3. Random Forest Random Forest(RF) approach can be used to generate thousands of tree. RF joins the advantages of bagging and random selection methods proposed by Breiman (2001). Bagging techniques have been used to take the samples continually from various data sets with uniform probability distribution where Random feature selection explores at each node for the finest split over a random subset of the features. The Random Forest categorizes a new object from an input vector by penetrating the input vector on every tree in the forest. Each tree is used to cast a unit vote at the input vector with classification and forest selects the classification which have the maximum votes over all the trees in the forest (Guo, Ma, Cukic, & Singh, 2004). Each Random tree is constructed with the following steps:  For M number of cases in training set,sample M cases randomly.  For every node y fault predictors are randomly selected out of Y pffiffiffiffi input variables and y  Y where y ¼ Y .  Each Tree is developed to the large extent with no pruning. 4.4. Naive Bayes approach

Fig. 3. Three layer neural network.

Naive Bayes Model has been used for fault prediction which assumes independence of attributes to each other, therefore it is named as Naive. It uses Bayesian theorem to calculate the

985

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

probability of unknown instance Y is classified as class T with possible outcomes. Where T ¼ fT 1 ; T 2 g, T is considered as fSuccessjFailureg class and the probability of task failure or success is defined by Eq. (5).

PfTjYg ¼

PðYjTÞPðTÞ PðYÞ

ð5Þ

Because Naive assumes the conditional probabilities of the independent variables, therefore, PfTjYg can be decomposed into product of sums by using Eq. (6), where Y j represents n attributes as: Y 1 ; Y 2 . . . Y n represents n attributes which are conditionally independent on one another given T as output.

PðTjYÞ ¼ PðTÞ

n Y PðY j jTÞ

ð6Þ

 True positives (TP): Cloud users labeled as task failure but also evaluated as task failure.  True negatives (TN): Cloud users labeled as not failure task but evaluated as not failure.  False positives (FP): Cloud users labeled as failure task but evaluated as not failure.  False negatives (FN): Cloud users labeled as non failure task but evaluated as task failure. 5.1.2. Recall and precision Precision is defined as the ratio of correctly predictable failures to the number of all the recognized failures where Recall is define as the ratio of correctly predicted failures to the number of true failures which is defined as follows in Eq. (10).

j¼1

The probability that T will take on any point k is defined in the Eq. (7). Here sum is considered as all the possible values t j of T. The Eq. (7) can be rewritten in Eq. (8) which is a basic equation of Naive Bayes Classifier. This equation also describes to calculate the probability that T will take on any given value and can be estimated from training data with the distributions PðTÞ and PðY i jTÞ.

PðT ¼ tk ÞPðY 1 . . . Y n jT ¼ tk Þ PðT ¼ t k jY 1 . . . Y n Þ ¼ P j PðT ¼ t j ÞPðY 1 . . . Y n jT ¼ t j Þ Q PðT ¼ tk Þ i PðY i jT ¼ t k Þ Q PðT ¼ t k jY 1 . . . Y n Þ ¼ P j PðT ¼ t j Þ i PðY i jT ¼ t j Þ

ð7Þ ð8Þ

The objectives of the paper are twofold. Firstly, it needs to correctly identify which tasks could be failed on Virtual Machines due to overutilization of resources using ANN, Random Forest, LR and Naive Bayes models. Secondly, to validate the accuracy of prediction models, predicted failure tasks has been compared with actual task failure in Amazon EC2 Cloud. Finally, to evaluate and validate the performance of proposed failure prediction approaches, results have been compared with existing failure prediction model. 5.1. Predictor performance For predicting the performance of failure prediction models, data of scientific workflows with 25, 30, 50 and 100 tasks have collected for different scientific workflows after fixed interval in WorkflowSim. CloudSim classes have been used to read and trace the data values for the workload traces. Then, prediction accuracy of task failures has been evaluated using following evaluation metrics. 5.1.1. Sensitivity and specificity These metrics measures the correctness of the predicted model where Sensitivity specifies the percentage of actual faulty tasks (Task Failed) which are correctly classified whereas Specificity is the amount of non-faulty tasks (Task Success) which are correctly identified (Salfner, Lenk, & Malek, 2010). The relation between these metrics is depicted in Table 1. The measures are calculated using the formulas as given below in Eq. (9).

TP ; TP þ FN

Specificity ¼

TN FP þ TN

ð9Þ

Table 1 Relation between sensitivity and specificity.

TP ; TP þ FN

Precision ¼

TP TP þ FP

ð10Þ

5.1.3. MAPE and RMSE In our experiments MAPE and RMSE are evaluated to measure error in percentage. MAPE is used for evaluating the prediction accuracy (Malhotra & Jain, 2012) as percentage. Where yj is the actual output, b y j is the predicted output and m indicates total number of observations in the dataset and MAPE is defined in Eq. (11) and lower value of MAPE indicates a more precise prediction technique.

MAPE ¼

5. Evaluation criteria

Sensitiv ity ¼

Recall ¼

m b j y j  yj j 1X m j¼1 yj

ð11Þ

The metric RMSE (Aggarwal, Singh, Kaur, & Malhotra, 2009) is described by the following formula in Eq. (12) and smaller value of RMSE signifies a more effective prediction scheme.

vffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi u X u1 m 2 y j  yj Þ RMSE ¼ t ðb m j¼1

ð12Þ

5.1.4. F-Measure and ROC The F-Measure values are computed as a harmonic mean of the precision and recall as shown in Eq. (13). The highest value of FMeasure would be considered as the best case whereas lowest value is evaluated as worst case.

F  Measure ¼

2  Precision  Recall Precision þ Recall

ð13Þ

The accuracy of prediction approaches can also estimated by comparing their ROC curves in graphical approach. The area under curve method is used to access the ROC curve which is shown in Eq. (14) where tpr specifies the true positive rate and fpr indicates false positive rate (Salfner et al., 2010). The maximum value of area under curve signifies the best predictor.

AreaUnderCurv e ¼

Z

1

tpr  ðfprÞdfpr 2 0; 1

ð14Þ

0

5.2. Methods used to validate the accuracy We have used percentage split and Standard Error of Mean to validate the experimental results in Amazon EC2.

Classified class

Task failure

Task not failure

Task failure Task not failure

TP FN Sensitivity

FP TN Specificity

5.2.1. Percentage-split Percentage split is used to divide the data set into training set and test set, in this paper we have used 66 percentage split.

986

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

5.2.2. Standard Error of Mean (SEOM) The metric SEOM is evaluated by deviation of predicted task failures from the actual task failures, for high accuracy SEOM should be minimum which is expressed by the formula discussed in Eq. (15). SD indicates standard deviation and m is the number of samples.

SD SEOM ¼ pffiffiffiffiffi m

ð15Þ

Table 3 Comparison of Performance Measures using percentage split. Metric

Naive Bayes

Random Forest

LR

ANN

Sensitivity Specificity Recall Precision RMSE MAPE F-Measure ROC

0.935 0.849 0.935 0.94 19 29.63 0.937 0.983

0.891 0.417 0.891 0.876 27 72.0051 0.875 0.975

0.848 0.127 0.848 0.754 35.16 64.023 0.798 0.661

0.913 0.42 0.913 0.921 29.94 57.041 0.893 0.779

6. Experimental results and discussion The experimental results to measure the performance of prediction models have been evaluated by implementing all of the approaches mentioned in Section 4. WorkflowSim (Chen & Deelman, 2012) classes have been used to execute workflow applications and CloudSim (Calheiros, Ranjan, Beloglazov, Rose, & Buyya, 2011) classes have been utilized for storing the log files. WEKA (Hall et al., 2009) is used to implement machine learning approaches using the data of scientific workflow applications. The evaluated results have been validated after running scientific workflow applications on Amazon EC2 that affords a large selection of instance types such as CPU, Memory, storage and networking capacity to deploy and execute scientific workflows. We have used c3. xlarge instance type where c3 instances are the most recent generation of compute-optimized instances that provides highest performing processors at the lowest price. It is equipped with two quad core and 7.5 GB RAM, Intel Xeon E5–2680 v2, 1680 GB local storage and running Pegasus 4.2 with CentOS 6.0 on Virtual Box. Our experimental setup has been categorized into four steps:    

Data collection and extraction Performance Measures using percentage split Performance Measures using SEOM Comparison of proposed model with existing model.

6.1. Data collection and extraction Several attributes containing task failure information have been gathered using WorkflowSim and CloudSim classes and PCA has been applied to reduce the dimensions for nine attributes that are required for intelligent fault prediction model. Each attribute along with its description is defined in Table 2. Resource Utilization parameters have been evaluated using threshold value of CPU utilization, Bw Utilization, RAM and Disk utilization. The threshold value is selected which is based on the previous history of task failure due to overutilization of VMs. If the utilization parameter has value more than threshold then status would be classified as Task Failure otherwise Not Failure. Task id indicates which task id fails on the resource utilization and VM id signifies VM number on which task has been failed and similarly datacenter id also shows

Table 2 Attribute description. S. No.

Attribute name

Description

1 2 3 4 5 6 7 8 9

Task id VM id Datacenter id CPU utilization Bw Utilization RAM utilization Disk utilization Depth Status

Id of Task Id of VM Id of datacenter Utilization of CPU (%) Utilization of Bandwidth (%) Utilization of RAM (%) Utilization of Disk (%) Level of task Failure class or not failure class

datacenter where the task would fail or succeed. As there are multiple levels in the workflows therefore, depth specifies the level of tasks in workflows. 6.2. Performance Measures using 66 percentage split Table 3 summarizes the results of 66 percentage split of the model using prediction models and 34 percentage data is used as test data and 66 percentage is used for training data. shows the sensitivity, specificity, Recall, Precision, RMSE, MAPE, F-Measure and ROC values and also compared the performance metrics in tabular and graphical form. The experimental results have been shown that Naive bayes performs best on the task failures predictions of scientific workflow data and ANN also perform better than Random Forest and LR. 6.2.1. Analysis of failure prediction model using percentage split Fig. 4 also details the performance of machine learning approaches in graphical form with percentage split 66 and data from Table 3. We can infer from the Fig. 4(a) that Naive Bayes has highest specificity(0.849) and sensitivity(0.935) where ANN also performs better results than LR and Random Forest. it is apparent from Fig. 4(b) that Recall and Precision is maximum of Naive Bayes is 93% and 94% and minimum value is 84% and 75% for LR. Fig. 4(c) shows the calculated RMSE and MAPE is minimum (19%) and 29.63% for Naive Bayes and maximum value of RMSE (35.16%) for LR and MAPE for Random Forest is (64.023%). Fig. 4(d) indicates F-Measure and ROC curve value is also high 0.937 and 0.983 in Naive Bayes models whereas LR has minimum values(0.798 and 0.661). Hence, the results clearly validate that our model using Naive Bayes is accurate enough to predict task failures with minimum values of MAPE and RMSE. Hence, Naive Bayes based models renounce higher prediction accuracy with percentage split as compared to others such as Random Forest, Logistic Regression and ANN. Therefore, to achieve more accuracy in prediction, the task failure prediction for workflow type applications can be implemented using Naive Bayes approach. 6.3. Performance Measures using SEOM To evaluate the accuracy of prediction models, machine learning based models have used to predict failures precisely for different size of scientific workflows. Then, experimental results have been validated in Pegasus and Amazon EC2. We have focused on Montage workflow data with 25 tasks, Cybershake with 30 tasks, Inspiral with 50 tasks and Sipht with 100 tasks at different time intervals of 50,100, 150 and 200 s. Fig. 5 compares the actual task failures and predicted task failures to different size of scientific workflow based on the performance metrics at different time intervals. Fig. 5(a) depicts the results of failure predictions using Naive Bayes algorithm with more accuracy (94%) than Random Forest, LR and ANN for predicting task failures. The effect of failure

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

987

Fig. 5. Comparison of actual failures with predicted failures.

Fig. 4. Evaluation metrics using percentage split.

predictions using Random Forest is also displayed in Fig. 5(b). The accuracy of the LR is lower than Naive Bayes and Random Forest which is defined in Fig. 5(c). The accuracy of ANN is also high for all the workflows as compared to LR and Random Forest as shown in Fig. 5(d). Therefore, according to predictions as the workflow size grows, the number of correct predictions goes up for all the machine learning models. SEOM has been calculated using above Eq. (15). Fig. 6 compares the SEOM of fault prediction approaches and results of SEOM clearly validate that the SEOM is minimum (6%) for Naive Bayes

with accuracy of 94% approximately and maximum value (33%) of SEOM for LR with minimum accuracy of (75%) approximately. Random Forest and ANN has lower SEOM (20% and 11%) than LR and more than Naive Bayes. Therefore, it emerges that the predicted model based on Nave Bayes might lead to construction of the optimum prediction models for developing intelligent failure prediction in Cloud environment. 6.4. Comparison of proposed model with existing model A very few authors have worked on failure prediction for tasks in workflows. As discussed in Section 2.3, some of the authors (Samak et al., 2011, 2012) have used machine learning approaches to analyze the performance of various scientific applications such as Broadband, Epigenome and Montage. They have used Naive Bayes

988

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989

average accuracy of (93%) as compared to existing model for Epigenome, Broadband and Montage workflows. 7.1. Future directions We suggest the following future directions for research community.

Fig. 6. SEOM of failure prediction models.

Fig. 7. Comparison of existing and proposed failure prediction models.

classifier to evaluate the accuracy of job failures occur due to application failures and component failures with scientific workflows. Therefore, to evaluate the usefulness of our proposed model, we have compared the experimental results with existing ones by implementing Broadband with 81 tasks, Epigenome with 320 tasks and Montage with 10,429 tasks using proposed failure prediction models. Fig. 7 depicts that the proposed model using Naive Bayes has higher accuracy of (97%) with Epigenome workflows as these workflows are high CPU intensive than Broadband and Montage workflows where the existing model (NB1) has the accuracy of (96%). The Broadband has also the maximum accuracy of (87%) with Naive Bayes Model, whereas LR has minimum accuracy of (79%) among the proposed models and NB1 have the accuracy of (74%). The Montage workflows are I/O intensive which has the accuracy of (94%) with Nave Bayes and minimum accuracy (83%) with Random Forest where the NB1 has accuracy of (84%). Therefore, the average accuracy for all the workflows using Nave Bayes is maximum (93%) among other models ANN, LR and Random Forest whereas existing model have the average accuracy of (85%). Thus, it can be validated that the proposed model is effective in terms of accuracy for predicting task failures as well as resource failure by considering resource utilization for scientific applications.

7. Conclusion This paper provides an effective approach to implement an intelligent failure prediction models using scientific workflows data in Cloud environment. The proposed models have been exemplified with the dataset achieved after running Montage, Cybershake, Inspiral and Sipht workflow applications using WorkflowSim. Then, the experimental results for failure prediction models have also been evaluated through Naive Bayes, Random Forest, LR and ANN in Weka using the dataset of workflows and validated for the maximum accuracy of Naive Bayes. Secondly, the analysis of results in Pegasus and Amazon EC2 have also been proved the accuracy of task failure prediction models and verified that the maximum accuracy of 94% was achieved with Naive Bayes by comparing actual task failures and predicted task failures. Finally, the failure prediction model using Naive Bayes has shown the effectiveness with

 The incorporation of Naive Bayes model with other fault tolerant techniques would be able to enhance the efficiency of Cloud services and to improve the performance of existing proactive fault-tolerant approaches.  Failure prediction models can also be useful for resource provisioning and scheduling by predicting time and cost based parameters. Intelligent prediction models can be helpful to increase scheduling efficiency of different size workflow applications.  More QoS attributes including scalability, availability and reliability can be considered in the future and their impact on the overall performance can be examined.  The intelligent prediction models would also be valuable for predicting and analyzing security threats which can further increase the performance of various expert systems applications such as project management, risk assessment and engineering etc.

References Aggarwal, K. K., Singh, Y., Kaur, A., & Malhotra, R. (2009). Empirical analysis for investigating the effect of object-oriented metrics on fault proneness: A replicated case study. Software Process: Improvement and Practice, 16, 39–62. Bala, A., & Chana, I. (2013). VM migration approach for autonomic fault tolerance in cloud computing. In GCA’13, international conference, Las Vegas, USA (pp. 3–9). Bala, A., & Chana, I. (2012). Fault tolerance-challenges, techniques and implementation in cloud computing. IJCSI, 9(1), 288–293. Beloglazov, A., & Buyya, R. (2013). Managing overloaded hosts for dynamic consolidation of virtual machines in cloud data centers under quality of service constraints. IEEE Transactions on Parallel and Distributed Systems, 24(7), 1366–1379. Breiman, L. (2001). Random forests. Machine Learning, 45, 532. Buyya, R., Shin Yeo, C., Venugopal, S., Broberg, J., & Ivona, B. (2009). Cloud computing and emerging IT platforms: Vision, hype, and reality for delivering computing as the 5th utility. Future Generation Computer Systems, 25(6), 599–616. Calheiros, R. N., Ranjan, R., Beloglazov, A., Rose, C. A. F. D., & Buyya, R. (2011). CloudSim: A toolkit for modeling and simulation of cloud computing environments and evaluation of resource provisioning algorithms. Software: Practice and Experience, 41, 23–50. Catal, C. (2011). Software fault prediction: A literature review and current trends. Expert Systems With Applications, 38(4), 4626–4636. Catal, C., & Diri, B. (2009). A systematic review of software fault prediction studies. Expert Systems with Applications, 36, 7346–7354. Chen, W., & Deelman, E. (2012). WorkfowSim: A toolkit for simulating scientific workflows in distributed environments. In E-Science, IEEE 8th international conference (pp. 1–8). Cushing, R., Koulouzis, S., Belloum, A., & Bubak, M. (2014). Applying workflow as a service paradigm to application farming. Concurrency Computation: Practice and Experience, 26, 1297–1312. Deelman, E. et al. (2005). Pegasus: A framework for mapping complex scientific workflows onto distributed systems. Scientific Programming, 13, 219–237. Deelman, E. (2009). Grids and clouds: Making workflow applications work in heterogeneous distributed environments. International Journal of High Performance Computing Applications Online First, 1–15. Deelman, E., Singh G., Livny, M., Berriman, B., & Good, J. (2008). The cost of doing science on the cloud: The montage example. In Proceedings of the 2008 ACM/IEEE conference on supercomputing, NJ (pp. 1–12). Duan, R., Prodan, R., & Fahringer, T. (2006). Data mining-based fault prediction and detection on the grid. Proceedings of the 15th IEEE international symposium on high performance distributed computing. Los Alamitos, CA: IEEE Computer Society Press. Fu, S., & Xu, C.-Z. (2007). Exploring event correlation for failure prediction in coalitions of clusters. Proceedings of the 2007 ACM/IEEE conference on supercomputing, SC ’07 (41(1), pp. 1–12). New York, NY, USA: ACM. Fu, S., & Xu, C.-Z. (2010). Quantifying event correlations for proactive failure management in networked computing systems. Journal of Parallel and Distributed Computing, 70(11), 1100–1109.

A. Bala, I. Chana / Expert Systems with Applications 42 (2015) 980–989 Ganga, K., Karthik, S., & Christopher Paul, A. (2012). A Survey on fault tolerance in workflow management and scheduling. International Journal of Advanced Research in Computer Engineering & Technology, 1(8), 176–179. Gil1, Y., & Deelman, E. (2007). Examining the challenges of scientific workflows. IEEE Computer, 40(12), , 24–3. Guan, Q., Zhang, Z., & Song, F. (2011). Ensemble of bayesian predictors for autonomic failure management in cloud computing. IEEE, 1–6. Guan, Q., Zhang, Z., & Song, F. (2012). A failure detection and prediction mechanism for enhancing dependability of data centers. International Journal of Computer Theory and Engineering, 4(5), 726–730. Guo, L., Ma, Y., Cukic, B., & Singh, H. (2004). Robust prediction of fault-proneness by random forests. In Proceedings of the 15th international symposium on software reliability engineering (pp. 417–428). Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., & Witten, I. H. (2009). The WEKA data mining software: An update. SIGKDD Explorations, 11. Hosmer, D., & Lemeshow, S. (1989). Applied logistic regression. New York, NY, USA: John Wiley & Sons. Islam, A., Keunga, J., Lee, K., & Liu, A. (2012). Empirical prediction models for adaptive resource provisioning in the cloud. Future Generation Computer Systems, 28, 155–162. Jhawar, R., Piuri, V., & Santambrogio, M. D. (2012). A comprehensive conceptual system-level approach to fault tolerance in cloud computing. In IEEE international systems conference (pp. 1–5). Jitsumoto, H., Endo, T., & Matsuoka, S.ABARIS. (2007). An adaptable fault detection/ recovery component framework for MPI. In Proceedings of the IEEE international parallel and distributed processing symposium (pp. 1–8). Los Alamitos, CA: IEEE Computer Society Press. Juve, G., & Deelman, E. (2010). Scientific workflows in the cloud. Cloud Chapter, 56–74. Juve, G. et al. (2009). Scientific workflow applications on Amazon EC2. In E-science workshops, 5th IEEE international conference (pp. 59–66). Juve, G. et al. (2010). Data sharing options for scientific workflows on Amazon EC2. In ACM/IEEE international conference on HPC, networking, storage and analysis, SC 10, Washington, DC, USA (p. 19). Kousiouris, G., Menychtasa, A., Kyriazis, D., Gogouvitis, S., & Varvarigou, T. (2014). Dynamic behavioral-based estimation of resource provisioning based on highlevel application terms in cloud platforms. Future Generation Computer Systems, 32, 27–40. Malhotra, R., & Jain, A. (2012). Fault prediction using statistical and machine learning methods for improving software quality. Journal of Information Processing Systems, 8, 241–262. Ohlsson, N., Zhao, M., & Helander, M. (1998). Application of multivariate analysis for soft ware fault prediction. Software Quality Journal, 7, 51–66. Pandey, S., Karunamoorthy, D., & Buyya, R. (2011). Workflow engine for clouds. In R. Buyya, J. Broberg, & A. Goscinski (Eds.), Cloud computing: Principles and paradigms (pp. 321–334). Hoboken, NJ, USA: John Wiley & Sons Inc..

989

Pathirage, M., Perera, S, Kumara, I., & Weerawarana, S. (2011). A multi-tenant architecture for business process executions. In Proceedings 2011 IEEE international conference on web services (ICWS) (pp. 121–128). Poola, D., Ramamohanarao, K., & Buyya, R. (2014). Fault-tolerant workflow scheduling using spot instances on clouds. In Procedia computer science, ICCS 2014. 14th international conference on computational science (Vol. 29, pp. 523– 533). Ramakrishnan, S., Reutiman, R., Chandra, A., & Weissman, J. (2013). Accelerating distributed workflows with edge resources. In IEEE international conference, IPDPS workshop (pp. 2129–2138). Salfner, F., Lenk, M., & Malek, M. (2010). A survey of online failure prediction methods. ACM Computing Surveys, 42, 1–42. Samak, T., Gunter, D., Goode, M., Deelman, E., Juve, G., Fabio, S., & Karan, V. (2012). Failure analysis of distributed scientific workflows executing in the cloud. In 8th International conference on network and service management (CNSM), Las Vegas, Nevada (pp. 1–9). Samak, T., Gunter, D., Deelman, E., Juve, G., Mehta, G., Silva, F., et al. (2011). Online fault and anomaly detection for large-scale scientific workflows. 13th IEEE international conference on high performance computing and communications (HPCC-2011). Banff, Alberta, Canada: IEEE Computer Society. Sindrilaru, E., Costan, A., & Cristea, V. (2010). Fault tolerance and recovery in grid workflow management systems international conference on complex. In Intelligent and software intensive systems, Krakow (pp. 475–480). Suresh, Y., Kumar, L., Ku, S., & Rath (2014). Statistical and machine learning methods for software fault prediction using ck metric suite: A comparative analysis. ISRN Sofware Engineering, 1–16. Tan, W., Chard, K., Sulakhe, D., Madduri, R., Foster, I., Soiland-Reyes, S., & Goble, C. (2009). Scientific workflows as services in caGrid: A Taverna and gRAVI approach. In Web services, ICWS 2009. IEEE international conference (pp. 413– 420). Varghese, B., McKee, G., & Alexandrov, V. (2010). In Intelligent agents for fault tolerance: From multi-agent simulation to cluster-based implementation, 24th international conference. WA: IEEE. Wang, J., Korambath, P., Altintas, I., Davis, J., & Crawl, D. (2014). Workflow as a service in the cloud: Architecture and scheduling algorithms. In 14th International conference on computational science, procedia computer science (Vol. 29, pp. 546–556). Xie, M., Dai, Y. S., & Poh, K. L. (2004). Computing system reliability: Models and analysis. Kluwer: Academic Publishers. Yu, Z., Wang, C., & Shi, W. (2010). FLAW: FaiLure-Aware Workflow scheduling in high performance computing systems. Journal of Cluster Computing, 13(4), 421–434 [Hingham, MA, USA: Kluwer Academic Publishers. Zhao, W., Melliar, P., & Moser, L. E. (2010). Fault tolerance middleware for cloud computing. Proceedings of the 2010 IEEE 3rd international conference on cloud computing. Washington, DC, USA: IEEE Computer Society.