Data Cleansing during Data Collection from Wireless Sensor Networks

4 downloads 141 Views 937KB Size Report
corrupt data detection technique and a missing value im- putation method for ... The Wireless Sensor Network (WSN) has a wide range of applications in both ...
Proceedings of the Twelfth Australasian Data Mining Conference (AusDM 2014), Brisbane, Australia

Data Cleansing during Data Collection from Wireless Sensor Networks Md Zahidul Islam1∗ , Quazi Mamun2 and Md. Geaur Rahman1 1 School of Computing and Mathematics Charles Sturt University, Panorama Avenue, Bathurst, NSW 2795, Australia. 2 School of Computing and Mathematics Charles Sturt University, Locked Bag 588, Boorooma Street, Wagga Wagga, NSW 2678, Australia. Emails: {zislam, qmamun, grahman}@csu.edu.au

Abstract Quality of data in Wireless Sensor Networks (WSNs) is one of the major concerns for many applications. The data quality may drop due to various reasons including the existence of missing values and incorrect values (also known as noisy or corrupt values) that can be caused by factors such as interference and machine malfunctioning. A drop in data quality may seriously impact the performance of decision support systems. Thus, it is crucial to clean the data before using them. In this paper we analyze the impact of missing values in a WSN data set (which is collected using a Voronoi diagram based network architecture) for the data mining tasks such as classification and knowledge discovery. While the quality of the data mining output (classification accuracy) suffers from the existence of the missing values this study shows an improvement when the missing values are imputed through our data cleansing scheme. The proposed scheme uses a corrupt data detection technique and a missing value imputation method for cleaning the data being collected from the sensor nodes. Our empirical analysis indicates the effectiveness of the proposed approach. Keywords: WSN; data integrity; data cleansing; mobile data collector; Voronoi diagram 1

Introduction

The Wireless Sensor Network (WSN) has a wide range of applications in both military and civilian operations and therefore attracted huge attention. The WSNs are usually implemented in unattended and often hostile environment such as military and homeland security operations (Karlof & Wagner 2003, Douceur 2002, Newsome et al. 2004, Ye et al. 2005, Zhu, Setia, Jajodia & Ning 2004, Xiao et al. 2006, Mamun 2011). Various studies on WSN show that it is possible for an attacker to spread malicious code over the whole network by exploiting different mechanisms of sensor nodes without physical contact (Giannetsos et al. 2009, Sharma & Ghose 2011). This may disturb the data collection process from the sensor nodes and introduce incorrect and missing data. Another type of error in sensor data takes place when sensors’ energy levels are fading away (Ni et al. 2009). Usually, sensors are deployed in remote and unattended areas, and thus it is impractical to change the batteries of ∗ The

fist author would like to thank the Faculty of Business COMPACT Fund R4 P55 in Charles Sturt University, Australia. c Copyright 2014, Australian Computer Society, Inc. This paper appeared at Australasian Data Mining Conference (AusDM 2014), Brisbane, 27-28 November 2014. Conferences in Research and Practice in Information Technology, Vol. 158. Richi Nayak, Xue Li, Lin Liu, Kok-Leong Ong, Yanchang Zhao, Paul Kennedy Eds. Reproduction for academic, not-for profit purposes permitted provided this text is included.

the sensor nodes. Sometimes sensors (such as automatic weather stations (Khan et al. 2012)) use solar energy to recharge their battery. A long night followed by a cloudy day can cause a sensor to have a flat battery. The sensing capabilities reduce with the deterioration of the energy level. In this circumstance, sensor data produced by those sensors can be erroneous or even completely missing. Erroneous or missing data can also be produced because of interferences and malfunctions. The existence of missing values in a data set can seriously impact the performance of decision support systems. In order to explain it better, we empirically test the impact of the existence of missing values on the Intel Lab data set that is publicly available in the Intel Berkeley Research lab (IBRL-Web [online available: http://db.lcs.mit.edu/labdata/labdata.html] 2014). We artificially create missing values in the data set and then calculate the classification accuracy by applying the C4.5 classifier (Quinlan 1996) on the original data set and the data set having missing values. In this empirical test, 10% of the total attribute values are considered to be missing in the data set with missing values. The procedures of creating missing values and calculating classification accuracy are presented in details in Section 4. The classification accuracy on the data set without missing values and the data set with missing values are presented in Figure 1. We can see from the figure that the classification accuracy drops significantly on the data set with missing values. This suggests that there is a need to clean the erroneous sensor data set obtained from a WSN.

Figure 1: The classification accuracies of C4.5 classifiers on the data sets without missing values and with missing values. The traditional approach for data cleansing requires manual review of the data where a domain expert reviews the collected data looking for outliers and unusual events. Data cleansing using traditional methods can be slow, expensive and tedious. Often the manual review of collected data can take several months delaying the release of the

195

CRPIT Volume 158 - Data Mining and Analytics 2014

collected data. Moreover, it may not be able to scale up when a large amount of data is collected (Dereszynski & Dietterich 2007). Therefore, it is crucial to automate the data cleansing process. Automatic data cleansing is an active research area. A method was proposed by Deresynski and Dietterich to clean sensor data automatically (Dereszynski & Dietterich 2007). It first uses the long-term historical records of a single sensor and thereby derive a probabilistic model of the sensor’s behavior over time. Next, the quality of future readings of the sensor are assessed by computing their likelihood given the model. Based on the assessment of the future readings the state of the sensor is evaluated as either good or bad. If a state of a sensor is evaluated as bad then the method ignores the data produced by the sensor during the state. Instead the method then predicts the data based on the model and considers the predicted data for analysis. However, the method accepts the sensor data if the state of the sensor is evaluated to be good. The performance of the method (Dereszynski & Dietterich 2007) heavily depends on the probabilistic model which is built from the historical data of a single sensor of interest. In a sensor network typically there are a number sensors surrounding a sensor. The data produced by the surrounding sensors could be used to make the model more robust. The surrounding sensors that monitor neighboring and overlapping areas often provide redundant data. This redundancy can be exploited to learn the interdependencies of the senors which in turn can lead to more accurate predictions without requiring long-term historical records (Ramirez 2011). Multiple sensors are used by a cleansing method (Ramirez 2011) which predicts/assesses the data of a given sensor by considering the sensed data from a number of surrounding sensors. Due to the computational expense of the method it is applied as a post-processing step instead of an on-line monitoring step. Three learning algorithms namely artificial neural network (ANN), k-nearest neighbors (KNN), and locally weighted regression (LWR) are used for the prediction of sensor data. A sensor reading/datum is then compared with the predicted datum in order to compute the likelihood of the sensor reading being erroneous. If a sensor reading is considered to be erroneous then it is replaced by a new value which is estimated based on the predicted value and the actual sensed value. The learning algorithms used in the method have some limitations. For example, the ANN algorithm generally takes a long training time. LWR needs to deal with the whole training data set every time it processes a new instance. A possible problem in using historical data for the identification of corrupt values and imputation of missing values can be the dissimilarity between the current and historical data in terms of their patterns. Therefore, some data cleansing techniques (Rahman & Islam 2014, 2011, 2013a,b, Cheng et al. 2012, Rahman et al. 2012) first find the groups/clusters of similar data. They then clean a datum by using the properties of the data that are similar to the datum value being cleaned. There are some offline cleaning methods (Mayfield et al. 2010) that are used in a pre-processing step of sensor data analyses. Often the collected data are analyzed through various statistical and data mining techniques for knowledge discovery and future prediction. Therefore, it is common to prepare a static data set and then analyze the data off-line. In this paper we first examine the impact of missing values in a WSN data set on the data analysis using a classifier. We find that the classification accuracy drops when there are missing values in a WSN data set (see Figure 1). We then present a data cleansing scheme where we first identify incorrect data. The identified incorrect data are

196

then artificially considered to be missing. Finally, these artificial missing values along with any other natural missing values that exist in the data set are imputed. An advantage of the proposed scheme is the use of the data collected from neighboring sensors within a close time period. Therefore, due to the similarity of the geographic locations of the sensors and the time period when the data are collected the data are expected to be similar to each other. This similarity of the data supports the corrupt data detection and missing value imputation techniques to achieve better results. Our empirical analysis indicates that the classification accuracy improves when a classifier is built on the data set where missing values are imputed (see Figure 5). Additionally we present detailed experimental result indicating successful corrupt data detection and missing value imputation in a WSN data set. The best noise detection performance achieved in our experimentation on the Intel Lab data set is a detection of as high as 42.7% of the total incorrect values (i.e. 42.7% Error Recall) with 83% Error Precision (see Table 2). Similarly, we achieve 91.7% of the best possible imputation performance (see Table 4). The remainder of the paper is organized as follows. We present the network architecture model for the proposed data cleansing approach in Section 2. Section 3 presents our data cleansing scheme. In Section 4, we describe the experimental setup and present the simulation results. Finally in Section 5, we present the conclusion of the study. 2

Network Architecture Model

The proposed data cleansing approach can be deployed over all hierarchical networks such as cluster based, tree based and chain oriented network. We in this study consider the chain oriented topology where multiple chains can be constructed. All the chains are restricted to Voronoi cells (Mamun 2013, Mamun et al. 2013). Additionally mobile data collectors (MDC) are used to collect data from the deployed sensor nodes (Mamun 2011). Figure 2 presents the architectural model of the chain oriented topology. The leader nodes are presented using the dots that collect data from the sensor nodes within a Voronoi cell that the leader node represents. The leader nodes then send the data to the mobile data collectors (MDCs) when the MDCs visit the polling points. The MDCs visit the polling points regularly. Various approaches in the literature (Mamun 2011) extend the data gathering scheme for large scaled wireless sensor networks where they use multiple MDCs and the spatial division multiple access (SDMA) technique. Figure 2 shows an example where two MDCs simultaneously travel within the network in order to collect data from the leaders. Since the Base Station (BS) is generally situated outside the sensing field the long distant sensor nodes are likely to deplete energy much faster than some short distant nodes if the long distant sensor nodes were transmitting data directly to the BS (Zhao & Yang 2012a, Chen et al. 2011, Zhao & Yang 2012b). Therefore, some recent studies(Liang et al. 2013, Zhang & Chen 2011, Fei et al. 2011) proposed the use of mobile devices (sink mobility) for the data gathering purpose. The mobile data collection devices can prolong the network lifetime to a great extent by supporting a balance energy consumption among sensor nodes (Zhi et al. 2010, Ma & Yang 2008). Therefore, we utilize multiple MDCs and use the spatial division multiple access (SDMA) technique (Mamun 2011) by dividing the sensing field into a number of non-overlapping regions. Each region is taken care of by using an MDC. An MDC gathers data from the leaders in a region by traversing through the region. A Voronoi diagram is con-

Proceedings of the Twelfth Australasian Data Mining Conference (AusDM 2014), Brisbane, Australia

Figure 2: The network architecture model for the proposed anomaly node detection technique (Mamun et al. 2013).

structed with respect to the leader nodes in order to determine the traversal paths of the MDCs. Each MDC is equipped with two antennas in order to allow it to take advantage of the SDMA technique for collecting data from two distinct compatible leader nodes in the same region concurrently. This can reduce the data collection time allowing an MDC to travel to a region more frequently. We consider a number of polling points (presented in Figure 2 as a dot within a circle) where an MDC stops to collect data from the leader nodes. Polling points are supposed to be in the middle of the leader nodes in a region so that an MDC can take the full advantage of the SDMA technique. An MDC can decode the multiplexing signals concurrently transmitted by the leader nodes of a region. A detailed discussion on the physical layer for concurrent data uploading is provided in the literature (Mamun 2011). We assume that MDCs have access to power supply through the BSs and are not power poor. Therefore, the MDCs regularly collect data from the leader nodes and transfer the data to the bases stations periodically. Hence, the base station collects all the data from the sensor nodes through the leader nodes and MDCs. It then prepares a data set for further data analysis through statistical approaches and data mining techniques in order to discover knowledge and predict future. 3

The Data Cleansing Approach in the Proposed Scheme

Based on the assumption that the MDCs have access to sufficient power supply we consider them to be computationally powerful enough to run data cleansing techniques when they collect the data from the leader nodes. An MDC uses the following five steps to carry out the data cleansing tasks as shown in Figure 3. Step 1: Step 2: Step 3: Step 4: Step 5:

Collect the Dataset Do and Make a Copy Dc . Identify Corrupt Values in Dc . Consider the Corrupt Data as Missing Values in Dc . Impute All Missing Values in Dc . Return Do , Dc and a report R to the BS.

Step 1: Collect the Dataset Do and Make a Copy Dc .

An MDC travels to a polling point and collects data from the leader nodes that are allocated to the polling point. The collected data are stored in a two dimensional dataset Do where rows represent records and columns represent attributes. Each record represents the data collected by a sensor at a particular time. A sensor may collect data on a number of attributes such as temperature, humidity, light and voltage. Each column represents the data on a particular attribute. Therefore, a record ri contains the data on the attributes as collected by a sensor at a particular time. The notation rij represents the j-th attribute value of the i-th record. A data set contains a set of attributes a = {a1 , a2 , · · · an } where ak represents the k-th attribute. The k-th attribute can have a range of possible values called the domain of the attribute. For example, the domain of the attribute temperature can be [-30, 50] meaning that the lowest possible value in this example is -30 degree Celsius and the highest possible value is 50 degree Celsius. The continuous numerical values can be discretized into categories such as [-30, 0], [1, 30] and [31, 60]. The notation akp represents the p-th category of the k-th attribute. Every sensor collects data at a regular interval such as once every 30 seconds. Hence, if there are 50 sensors allocated to a polling point and an MDC travels to the polling point once every 30 minutes then each time it travels to the polling point it collects data from all 50 sensors for this period and thereby creates an original dataset Do that has 50 X 30 X 2 = 3000 records. Each record contains values on the attributes. Some of the values can be incorrect and some of the values can be completely missing. The proposed scheme makes a copy of the dataset Do into Dc . Step 2: Identify Corrupt Values in Dc . We then use an existing corrupt data detection technique such as CAIRAD (Rahman et al. 2012) that identifies any record having a possible incorrect value. It also identifies the value which is suspected to be corrupt. Note that, the proposed scheme can use any suitable corrupt data detection technique and is not limited to CAIRAD only. CAIRAD first discretizes Dc . It then computes the actual co-appearances of a pair of attribute values (dis-

197

CRPIT Volume 158 - Data Mining and Analytics 2014

Figure 3: The Data Cleansing Steps of an MDC.

cretized) belonging to two different attributes. For example, the actual co-appearance of akp and aln is the actual number of times the p-th category of the k-th attribute coappears in with the n-th category of the l-th attribute in the same record; for all records over the whole dataset. It also calculates an expected number of co-appearances of the pair of values considering that each value of an attribute is equally likely to appear. Both actual and expected coappearances are computed for all possible pair of values. Now for each record ri , CAIRAD explores any incorrect attribute value. Each value of a pair having a significantly lower actual number of co-appearances than the expected number of co-appearances of the values receives a score of 1. On the other hand if the actual number of co-appearance is not significantly less than the expected number of co-appearances then each value of the pair receives the score of 0. If there are M attributes in a dataset then each attribute value of the record ri is tested for all other (M − 1) attributes. Finally, each value of ri receives a total score. If the score of an attribute value exceeds a threshold then the value is considered to be corrupt. Step 3: Consider the Corrupt Data as Missing Values in Dc . The attribute values rij ; ∀i, j that are identified to be corrupt in Step 2 are now considered to be missing/unavailable. Additionally, it is possible that there are some other attribute values in Dc that are originally missing. That is, the values were missing when the data from the nodes were first collected. Both type of missing values are considered to be missing in Dc . Step 4: Impute All Missing Values in Dc . All the missing values are then imputed using an existing imputation technique such as FIMUS (Rahman & Islam 2014). Note that, the proposed scheme can use any existing techniques and is not limited to FIMUS only. FIMUS imputes the missing values of Dc based on the co-appearances of values belonging to two attributes, similarities of values belonging to an attribute and correlations of attributes. The basic concept of the technique is impute a missing value rij with a domain value (of the j-th attribute) which has a high co-appearance with other attribute values (i.e. rik ; ∀k) of the record ri , and the correlations of other attributes with the j-th attribute. Moreover, the technique also considers the similarity between the value rik ; ∀k and other domain values of the k-th attribute. Step 5: Return Do , Dc and a report R to the BS. Finally, when the missing values are imputed in Dc the proposed scheme returns Do and Dc to the BS. It also prepares a report R that contains a flag for the values that have been modified in Dc either because it was originally missing or it was identified to be corrupt. Therefore, analysts can double check whether the modifications are sensible or not, if necessary. 4 4.1

Simulation and Analysis Data Set

In this section we use the publicly available Intel Lab data (IBRL-Web [online available: http://db.lcs.mit.edu/labdata/labdata.html] 2014) to demonstrate the usefulness of the proposed data cleansing

198

scheme to improve the quality of a Wireless Sensor Network dataset. In the Intel Lab data set there are altogether 54 sensor nodes each of which collects data on temperature, humidity, light and voltage once every 30 seconds. We consider that all these 54 nodes are allocated to a polling point and the MDC travels to the polling point once every 30 minutes. Therefore, we copy 60 records from each of the 54 sensors for the same 30 minutes period and thereby produce a dataset with 3240 records. We call the dataset D. However, we realize that in D we have 120 records with missing values. We first remove the records that originally have missing values and thereby obtain a pure data set having 3120 records without any missing values. We use the pure data set in the experimentation. 4.2

Simulation of Corrupt Data

We then artificially create some corrupt values for which the actual values are known from the dataset. We use the following assumptions while artificially creating the corrupt values (Rahman et al. 2012). We consider four noise patterns namely Simple, Medium, Complex and Blended in which if a record has a noisy value then in the Simple pattern the record can have at most one noisy value. In the Medium pattern a record can have 2 to 50% of attributes with noisy values, whereas in the Complex pattern 50% to 80% of attributes of a record can have noisy values. However, the Blended pattern is the mixture of the three patterns. The Blended pattern contains 25% records having noisy values in simple pattern, 50% in medium pattern and 25% in complex pattern. Since the Blended pattern combines of all three noise patterns we may expect a natural scenario based on the Blended pattern. For each of noise pattern, we use various noise levels (1%, 3%, 5% and 10%) where x% noise level means x% of the total attribute values (not records) of a data set are noisy. There are altogether 3120 records and each record has four attribute values. Therefore, there are 3120 X 4 = 12480 values out of which say 1% values (i.e. 124 values) are made noisy. Since for the Simple pattern each record can have at most one noisy value there are 124 noisy records. Moreover, we use three different noise outside ranges 10%, 30%, and 50% based on the domain of an attribute. For example, 10% of the total noisy values (i.e. 12 values) can have the noisy value outside the original domain of the attributes. If the domain of an attribute is [0, 10] then 10% of the noisy values are outside the domain range. We also consider two noise models namely uniformly distributed (UD) and Overall while creating the noisy values. In the UD model noisy values are equally distributed among the attributes, whereas in the Overall model noisy values are not equally distributed among the attributes. That is, in the worst case scenario all corrupt values may belong to a single attribute only. Based on the 4 noise levels, 3 noise outside ranges, 2 noise models and 4 noise patterns we have altogether 96 (4 × 3 × 2 × 4) noise combinations. For each of the combinations we create 10 data sets with noisy values. For example, for the combination having “simple” noise pattern, “1%” noisy values, “10%” noise range, and “overall”

Proceedings of the Twelfth Australasian Data Mining Conference (AusDM 2014), Brisbane, Australia

model (see Table 5) we generate 10 data sets with noisy values. We therefore create altogether 960 data sets (96 combinations × 10 data sets/combination) for the Intell Lab data set. We then apply CAIRAD (Rahman et al. 2012) on the noisy dataset and identify the noisy values. The accuracy of the identification is evaluated through Error Recall (ER) (Zhu, Wu & Yang 2004) and Error Precision (EP) (Zhu, Wu & Yang 2004). ER is the ratio of the correctly identified corrupt values to the total number of the corrupt values. EP is the ratio of the total number of correctly identified corrupt values to the total number of identified corrupt values. The values of both ER and EP vary between 0 and 1, where a higher value indicates a better noise detection. 4.3

The results that we present in Table 1 and Table 2 are obtained based on the default value. However, it may be possible to achieve even higher ER and EP by adjusting the λ values. Thus we calculate the ER and EP values for different λ values for the noise combination having “1%” noisy values, “50%” noise range, “UD” noise model and “medium” noise pattern as shown in Figure 4. From the figure we can see that CAIRAD achieves a high ER value (which is 67.7%) for λ = 0.1. The ER value indicates that 67.7% of the total noisy values are detected by CAIRAD. The noise detection is expected to increase the data quality and subsequent data analysis in the sensor network data set.

Analysis of the Performance of a Corrupt Data Detection Technique

The overall average noise detection performance of the Intel Lab data set is presented in Table 1. The table shows that the noise detection performance in terms of ER and EP are 0.130 and 0.514, respectively, in which each of them is the average value of the performance indicators on 960 data sets having noisy values. The EP value 0.514 means that 51.4% of the total noisy values identified by CAIRAD are originally noisy. Considering the EP values typically achieved by different noise detection techniques on the regular data sets the EP value obtained by CAIRAD on the Intel Lab data set is very encouraging. For example, the EP value achieved by three different techniques namely EDIR, RDCL and CAIRAD on the Adult data set (publicly available in the UCI machine learning repository (Frank & Asuncion 2010)) is 0.171, 0.348 and 0.545, respectively (Rahman et al. 2012).

Figure 4: The noise detection performance for different λ values.

4.4

Simulation of Missing Values

Table 2: Maximum achievable noise detection performance by CAIRAD on Intel Lab data set.

After the noise detection we now aim to correct the noisy values. We consider the artificially created noisy values as missing and then impute them by using an imputation technique. Note that a data set may also originally contain missing values. For example, there are 120 records of the Intel Lab data set that originally have missing values. For the 4 noise levels, 3 noise outside ranges, 2 noise models and 4 noise patterns we have altogether 96 (4×3× 2×4) missing combinations. For each of the combinations 10 data sets with missing values are created. For the combination having “simple” noise pattern, “1%” noisy values, “10%” noise range, and “overall” model (see Table 6) 10 data sets with missing values are generated. Therefore, altogether 960 data sets (96 combinations × 10 data sets/combination) with missing values are created for the Intell Lab data set. The original values of the artificially created missing values are known. We then apply FIMUS on the dataset to impute the missing values. After the imputation the accuracy of the imputation is evaluated through two commonly used metrics; Index of Agreement (d2 ) (Willmott 1982) and Root Mean Square Error (RM SE) (Junninen et al. 2004). Both metrics estimate the difference between an original value and an imputed value, where the closer the values the better the imputation. The values of d2 vary between 0 and 1, where a higher value indicates a better imputation. The RM SE values vary between 0 and infinity, where a lower RM SE value indicates a better imputation.

Data Cleansing Technique

4.5

Table 1: Overall average noise detection performance on Intel Lab data set. Data Cleansing Technique CAIRAD

ER

EP

0.130

0.514

In Table 2 we also present the maximum ER and EP values achieved by CAIRAD in our experiments on the Intel Lab data set. CAIRAD achieves an ER value 0.427 for the combination having “1%” noisy values, “50%” noise range, “UD” noise model and “medium” noise pattern. On the other hand, CAIRAD achieves an EP value 0.830 for the combination having “10%” noisy values, “50%” noise range, “UD” noise model and “complex” noise pattern. This indicates that 83% of the identified noisy/incorrect values are actually noisy. Moreover, 42.7% of the actually noisy values are identified in the data set. This is a reasonably high achievement in noise detection which is expected to increase the quality of any data analyses including data mining and statistical analyses.

CAIRAD

ER

EP

0.427

0.830

It is also worth noting that CAIRAD uses a user defined threshold called a co-appearance score threshold (λ) that is used to determine whether a value is noisy or correct. CAIRAD uses a default value for λ which is 0.3.

Analysis of the Performance of a Missing Value Imputation Technique

We now present an overall (i.e. the average value of the performance indicators on 960 data sets having missing values) imputation performance of the Intel Lab data set in Table 3. The d2 of the imputed values is 0.888 which is very high meaning that the imputed values are very close to the original values. It is worth mentioning that

199

CRPIT Volume 158 - Data Mining and Analytics 2014

the maximum d2 value could be 1.000 which would also indicate that the original and imputed values are the same. Therefore, the d2 value of 0.888 means that we achieve 88.8% of the maximum possible imputation accuracy, By the maximum possible imputation accuracy we mean the case where all values would be imputed accurately. Similarly, the RM SE of the imputed values is 0.121 which again indicates a good imputation, where the minimum possible RM SE value could be zero and the maximum possible RM SE value could be infinity. A lower RM SE value indicates a better imputation.

We next calculate the classification accuracies of the DTs, on the testing data set Dtesting as shown in Figure 5. We can see that the classification accuracy of the DTC , which is built from the data set DC (with missing values), is much lower than the classification accuracy of the DTO , which is built from the original data set (without missing values). We achieve a higher classification accuracy by the DTF0 , which is built from the imputed data set DF0 (with imputed values), than the DTC , which is built from the DC . The improvement in accuracy indicates the usefulness of the imputation approach in wireless sensor networks data sets.

Table 3: Overall average imputation performance on Intel Lab data set. Data Cleansing Technique FIMUS

d2

RM SE

0.888

0.121

In Table 4 we present the maximum d2 and RM SE values achieved by FIMUS in our experiments on the Intel Lab data set. FIMUS achieves a d2 value 0.917 for the combination having “1%” noisy values, “50%” noise range, “UD” noise model and “simple” noise pattern. The high d2 value indicates a high agreement between the original and imputed values. Besides, for the same combination FIMUS achieves a low RM SE value which is 0.104 that also indicates a high imputation accuracy by FIMUS on the Intel Lab data set.

Figure 5: The classification accuracy of C4.5 classifier on the data sets without missing values, with missing values and with imputed values.

Table 4: Maximum achievable imputation performance by FIMUS on Intel Lab data set. 4.7 Data Cleansing Technique FIMUS

4.6

d2

RM SE

0.917

0.104

Analysis of the Effectiveness of Imputation based on the Prediction Accuracy

We now analyze the effectiveness of imputation based on the prediction accuracy of a decision tree algorithm such as C4.5 (Quinlan 1996). Note that since all attributes of the Intel Lab data set are numerical we first categorize the values of the attribute “voltage” by applying an existing discretization algorithm called PD (Yang & Webb 2009) so that the attribute can be considered as a class attribute while building a decision tree. We then use a 10 fold cross validation to evaluate the classification accuracy without missing values, with missing values and with imputed values. We now explain the procedure as follows. For each fold, the data set D having n records is divided into two sub data sets, namely the testing data set Dtesting n and training data set DO . The testing data set contains 10 records of D and the training data set contains the remaining 9n 10 records of D. From the training data set we then create three data sets as follows. The first data set is the Original training data set DO in which there are no records with missing values. The second data set DC is obtained as follows. Using 10% missing ratios, Overall missing model and Blended missing pattern we artificially create missing values in DO and get a data set DF having missing values. We then remove the records having missing values from DF and get a data set DC that contains records without any missing values. The third data set DF0 is obtained by imputing the missing values of DF . We then build decision trees (DTs), namely DTO , DTC and DTF0 by applying the C4.5 algorithm on DO , DC and DF0 , respectively.

200

Details Experimental Results on Corrupt Data Detection and Imputation

We also present the detail noise detection and imputation performances for all 96 combinations in Table 5 and Table 6. We present the noise detection performance of CAIRAD based on ER and EP for 96 noise combinations in Table 5. The average values of the performance indicators on 10 data sets having noisy values for each combination of noise level, noise range, noise model and noise pattern is presented in the table. For example, there are 10 data sets having noisy values with the combination Com2 of “1%” noise level, “10%” noise range, “Overall” noise model and “Medium” noise pattern. In Table 5 we can see that the average of ER and EP for the data sets having Com2 is 0.297 and 0.276, respectively. Moreover, in Table 6 we present the imputation performance of FIMUS based on d2 and RM SE for 96 missing combinations. The average values of the performance indicators on 10 data sets having missing values for each combination of noise level, noise range, noise model and noise pattern is presented in the table. For example, there are 10 data sets having missing values with the combination Com1 of “1%” noise level, “10%” noise range, “Overall” noise model and “Simple” noise pattern. In Table 6 we can see that the average of d2 and RM SE for the data sets having Com1 is 0.913 and 0.107, respectively. 5

Conclusion

In this study we first discuss an Mobile Data Collector (MDC) based approach of the sensor network data collection. Due to a number of reasons including flat battery, equipment malfunctioning and compromise wireless sensor network data may contain incorrect and missing values. We therefore discuss the quality of data mining results obtained from an uncleaned data set and compare it

Proceedings of the Twelfth Australasian Data Mining Conference (AusDM 2014), Brisbane, Australia

Table 5: Noise detection performance of a data cleansing technique (such as CAIRAD) on the Intel Lab data set. Noise Level

Noise combinations Noise Range Noise Model Overall 10% UD

Overall 1%

30% UD

Overall 50% UD

Overall 10% UD

Overall 3%

30% UD

Overall 50% UD

Noise Pattern Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended

ER

EP

0.108 0.297 0.300 0.223 0.087 0.248 0.238 0.154 0.128 0.306 0.327 0.239 0.122 0.349 0.315 0.240 0.121 0.419 0.369 0.264 0.152 0.427 0.402 0.265 0.035 0.150 0.143 0.093 0.032 0.128 0.158 0.087 0.045 0.195 0.200 0.126 0.049 0.190 0.204 0.130 0.054 0.266 0.231 0.173 0.054 0.270 0.246 0.168

0.232 0.276 0.292 0.273 0.217 0.300 0.301 0.254 0.278 0.302 0.334 0.317 0.266 0.331 0.322 0.308 0.268 0.410 0.382 0.379 0.350 0.406 0.400 0.359 0.285 0.445 0.446 0.371 0.260 0.436 0.460 0.387 0.418 0.587 0.599 0.560 0.442 0.547 0.589 0.550 0.466 0.671 0.655 0.643 0.467 0.678 0.643 0.640

with the result obtained from a clean data set (see Figure 1). It shows that the data mining quality can significantly drop in the presence of incorrect and missing data. We then present a data cleansing scheme to identify the incorrect data and impute all missing values. We suggest to apply the data cleansing techniques within an MDC even before it transmits the data to the base station. An advantage of the approach can be the utilization of the traveling time when the MDC is moving between two polling points. Once the missing values are imputed we carry out the data mining analysis on the imputed data set and show the improvement in prediction accuracy of the classifier built from the imputed data set (see Figure 5). Moreover, we also present the accuracy of incorrect data identification and missing value imputation that suggest the effectiveness of the data cleansing approaches used in this study for the sensor network data set (see Table 1 to Table 6). As part of our future work, we aim to propose novel corrupt data detection and missing value imputation techniques that are specifically catered for the wireless sensor network data set. For example, although the proposed data cleansing scheme uses data from neighboring sensors for a close time period the actual cleansing techniques (CAIRAD and FIMUS) do not take further advantage of

Noise Level

Noise combinations Noise Range Noise Model Overall 10% UD

Overall 5%

30% UD

Overall 50% UD

Overall 10% UD

Overall 10%

30% UD

Overall 50% UD

Noise Pattern Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended

ER

EP

0.030 0.088 0.083 0.048 0.022 0.083 0.097 0.042 0.027 0.151 0.148 0.081 0.029 0.136 0.133 0.079 0.025 0.149 0.166 0.095 0.027 0.160 0.162 0.103 0.015 0.040 0.044 0.020 0.014 0.044 0.041 0.023 0.013 0.054 0.061 0.024 0.013 0.061 0.062 0.024 0.009 0.056 0.053 0.025 0.007 0.059 0.055 0.023

0.419 0.503 0.478 0.444 0.359 0.475 0.516 0.378 0.449 0.682 0.685 0.601 0.472 0.654 0.653 0.590 0.511 0.751 0.737 0.677 0.499 0.735 0.725 0.706 0.597 0.564 0.577 0.453 0.529 0.592 0.582 0.517 0.599 0.742 0.763 0.586 0.656 0.748 0.715 0.608 0.663 0.818 0.828 0.770 0.631 0.819 0.830 0.694

time series data. It would be interesting to investigate whether further consideration of the time could increase the accuracy of incorrect data detection and missing value imputation. Since the sensors in the wireless network have limited energy, it would also be interesting to investigate the efficiency of our proposed algorithm in such domains. Moreover, we also aim to compare a number of existing techniques with proposed techniques in future. References Chen, Y., Tang, Y., Xu, G., Qian, H. & Xu, Y. (2011), A data gathering algorithm based on swarm intelligence and load balancing strategy for mobile sink, in ‘Intelligent Control and Automation (WCICA), 2011 9th World Congress on’, IEEE, pp. 1002–1007. Cheng, K., Law, N. & Siu, W. (2012), ‘Iterative biclusterbased least square framework for estimation of missing values in microarray gene expression data’, Pattern Recognition 45(4), 1281–1289. Dereszynski, E. W. & Dietterich, T. G. (2007), ‘Probabilistic models for anomaly detection in remote sensor data streams’, in Proceedings of the Twenty-Third

201

CRPIT Volume 158 - Data Mining and Analytics 2014

Table 6: Imputation performance of a data cleansing technique (such as FIMUS) on the Intel Lab data set. Noise Level

Missing combinations Noise Range Noise Model Overall 10% UD

Overall 1%

30% UD

Overall 50% UD

Overall 10% UD

Overall 3%

30% UD

Overall 50% UD

Noise Pattern Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended

d2

RM SE

0.913 0.878 0.861 0.896 0.896 0.850 0.899 0.880 0.888 0.887 0.867 0.882 0.880 0.884 0.879 0.880 0.906 0.859 0.889 0.883 0.917 0.863 0.885 0.908 0.894 0.873 0.870 0.880 0.896 0.884 0.874 0.887 0.908 0.876 0.886 0.889 0.911 0.885 0.899 0.897 0.912 0.888 0.879 0.895 0.909 0.872 0.882 0.891

0.107 0.123 0.130 0.117 0.117 0.141 0.113 0.122 0.122 0.123 0.134 0.121 0.122 0.117 0.128 0.128 0.108 0.136 0.120 0.125 0.104 0.132 0.121 0.115 0.117 0.129 0.134 0.128 0.121 0.120 0.129 0.120 0.112 0.127 0.120 0.123 0.108 0.121 0.115 0.116 0.107 0.123 0.125 0.120 0.111 0.129 0.124 0.123

Conference on Uncertainty in Artificial Intelligence (UAI2007) . Douceur, J. R. (2002), The sybil attack, in ‘Peer-to-peer Systems’, Springer, pp. 251–260. Fei, X., Boukerche, A. & Yu, R. (2011), An efficient markov decision process based mobile data gathering protocol for wireless sensor networks, in ‘Wireless Communications and Networking Conference (WCNC), 2011 IEEE’, IEEE, pp. 1032–1037. Frank, A. & Asuncion, A. (2010), ‘UCI machine learning repository [online available: http://archive.ics.uci.edu/ml]’. Accessed July 7, 2013. URL: http://archive.ics.uci.edu/ml Giannetsos, T., Dimitriou, T. & Prasad, N. R. (2009), Self-propagating worms in wireless sensor networks, in ‘Proceedings of the 5th international student workshop on Emerging networking experiments and technologies’, ACM, pp. 31–32. IBRL-Web [online available: http://db.lcs.mit.edu/labdata/labdata.html] (2014).

202

Noise Level

Missing combinations Noise Range Noise Model Overall 10% UD

Overall 5%

30% UD

Overall 50% UD

Overall 10% UD

Overall 10%

30% UD

Overall 50% UD

Noise Pattern Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended Simple Medium Complex Blended

d2

RM SE

0.901 0.882 0.871 0.893 0.889 0.882 0.883 0.881 0.911 0.879 0.879 0.879 0.908 0.874 0.886 0.890 0.910 0.880 0.870 0.888 0.904 0.881 0.887 0.881 0.904 0.886 0.886 0.892 0.910 0.880 0.883 0.887 0.901 0.888 0.879 0.890 0.909 0.882 0.886 0.892 0.903 0.880 0.886 0.894 0.900 0.878 0.885 0.878

0.114 0.124 0.129 0.119 0.122 0.124 0.123 0.125 0.109 0.125 0.126 0.127 0.112 0.129 0.122 0.122 0.111 0.123 0.132 0.122 0.114 0.123 0.120 0.125 0.115 0.122 0.123 0.119 0.111 0.124 0.125 0.122 0.116 0.121 0.123 0.120 0.113 0.123 0.120 0.120 0.116 0.125 0.122 0.120 0.116 0.123 0.122 0.127

Accessed August 7, 2014. URL: http://db.lcs.mit.edu/labdata/labdata.html Junninen, H., Niska, H., Tuppurainen, K., Ruuskanen, J. & Kolehmainen, M. (2004), ‘Methods for imputation of missing values in air quality data sets’, Atmospheric Environment 38(18), 2895–2907. Karlof, C. & Wagner, D. (2003), ‘Secure routing in wireless sensor networks: Attacks and countermeasures’, Ad hoc networks 1(2), 293–315. Khan, M. A., Islam, M. Z. & Hafeez, M. (2012), Evaluating the performance of several data mining methods for predicting irrigation water requirement, in ‘Proceedings of the Tenth Australasian Data Mining Conference-Volume 134’, Australian Computer Society, Inc., pp. 199–207. Liang, W., Schweitzer, P. & Xu, Z. (2013), ‘Approximation algorithms for capacitated minimum forest problems in wireless sensor networks with a mobile sink’, Computers, IEEE Transactions on 62(10), 1932–1944. Ma, M. & Yang, Y. (2008), Data gathering in wireless sensor networks with mobile collectors, in ‘Parallel and

Proceedings of the Twelfth Australasian Data Mining Conference (AusDM 2014), Brisbane, Australia

Distributed Processing, 2008. IPDPS 2008. IEEE International Symposium on’, IEEE, pp. 1–9. Mamun, Q. (2011), Constraint-Minimizing Logical Topology for Wireless Sensor Networks, PhD thesis, Monash University. Mamun, Q. (2013), ‘A tessellation-based localized chain construction scheme for chain-oriented sensor networks’, Sensors Journal, IEEE 13(7), 2648–2658. Mamun, Q., Islam, R. & Kaosar, M. (2013), Ensuring data integrity by anomaly node detection during data gathering in wsns, in ‘Security and Privacy in Communication Networks’, Springer, pp. 367–379. Mayfield, C., Neville, J. & Prabhakar, S. (2010), Eracer: a database approach for statistical inference and data cleaning, in ‘Proceedings of the 2010 ACM SIGMOD International Conference on Management of data’, ACM, pp. 75–86. Newsome, J., Shi, E., Song, D. & Perrig, A. (2004), The sybil attack in sensor networks: analysis & defenses, in ‘Proceedings of the 3rd international symposium on Information processing in sensor networks’, ACM, pp. 259–268. Ni, K., Ramanathan, N., Chehade, M. N. H., Balzano, L., Nair, S., Zahedi, S., Kohler, E., Pottie, G., Hansen, M. & Srivastava, M. (2009), ‘Sensor network data fault types’, ACM Transactions on Sensor Networks (TOSN) 5(3), 25. Quinlan, J. R. (1996), ‘Improved use of continuous attributes in C4.5’, Journal of Artificial Intelligence Research 4, 77–90.

Xiao, M., Wang, X. & Yang, G. (2006), Cross-layer design for the security of wireless sensor networks, in ‘Intelligent Control and Automation, 2006. WCICA 2006. The Sixth World Congress on’, Vol. 1, IEEE, pp. 104–108. Yang, Y. & Webb, G. I. (2009), ‘Discretization for naivebayes learning: managing discretization bias and variance’, Machine learning 74(1), 39–74. Ye, F., Luo, H., Lu, S. & Zhang, L. (2005), ‘Statistical enroute filtering of injected false data in sensor networks’, Selected Areas in Communications, IEEE Journal on 23(4), 839–850. Zhang, X. & Chen, G. (2011), Energy-efficient platform designed for sdma applications in mobile wireless sensor networks, in ‘Wireless Communications and Networking Conference (WCNC), 2011 IEEE’, IEEE, pp. 2089–2094. Zhao, M. & Yang, Y. (2012a), ‘Bounded relay hop mobile data gathering in wireless sensor networks’, Computers, IEEE Transactions on 61(2), 265–277. Zhao, M. & Yang, Y. (2012b), ‘Optimization-based distributed algorithms for mobile data gathering in wireless sensor networks’, Mobile Computing, IEEE Transactions on 11(10), 1464–1477. Zhi, Z., Dayong, L., Shaoqiang, L., Xiaoping, F. & Zhihua, Q. (2010), Data gathering strategies in wireless sensor networks using a mobile sink, in ‘Control Conference (CCC), 2010 29th Chinese’, IEEE, pp. 4826– 4830. Zhu, S., Setia, S., Jajodia, S. & Ning, P. (2004), An interleaved hop-by-hop authentication scheme for filtering of injected false data in sensor networks, in ‘Security and Privacy, 2004. Proceedings. 2004 IEEE Symposium on’, IEEE, pp. 259–271.

Rahman, M. G. & Islam, M. Z. (2011), A decision tree-based missing value imputation technique for data pre-processing, in ‘Australasian Data Mining Conference (AusDM 11)’, Vol. 121 of CRPIT, ACS, Zhu, X., Wu, X. & Yang, Y. (2004), Error detection Ballarat, Australia, pp. 41–50. and impact-sensitive instance ranking in noisy datasets, URL: http://crpit.com/confpapers/CRPITV121Rahman.pdf in ‘PROCEEDINGS OF THE NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE’, Menlo Rahman, M. G. & Islam, M. Z. (2013a), kdmi: A novel Park, CA; Cambridge, MA; London; AAAI Press; MIT method for missing values imputation using two levPress; 1999, pp. 378–384. els of horizontal partitioning in a data set, in ‘The 9th International Conference on Advanced Data Mining and Applications (ADMA 2013), Part II, LNAI 8347’, Hangzhou, China, pp. 250 – 263. Rahman, M. G. & Islam, M. Z. (2013b), ‘Missing value imputation using decision trees and decision forests by splitting and merging records: Two novel techniques’, Knowledge-Based Systems 53, 51–65. Rahman, M. G. & Islam, M. Z. (2014), ‘Fimus: A framework for imputing missing values using co-appearance, correlation and similarity analysis’, Knowledge-Based Systems 56, 311–327. Rahman, M. G., Islam, M. Z., Bossomaier, T. & Gao, J. (2012), Cairad: a co-appearance based analysis for incorrect records and attribute-values detection, in ‘Neural Networks (IJCNN), The 2012 International Joint Conference on’, IEEE, pp. 1–10. Ramirez, G. (2011), ‘Assessing data quality in a sensor network for environmental monitoring’. Sharma, K. & Ghose, M. (2011), ‘Cross layer security framework for wireless sensor networks’, International Journal of Security and Its Applications 5(1), 39–52. Willmott, C. (1982), ‘Some comments on the evaluation of model performance.’, Bulletin of the American Meteorological Society 63, 1309–1369.

203