Verification of Detection of Principal Components in

0 downloads 0 Views 80KB Size Report
principal component), the second greatest variance on the second coordinate, and so on. ..... [5] J. E. Jackson, A User's Guide to Principal Components, 1st ed.
Review Paper Int. J. of Recent Trends in Engineering and Technology, Vol. 4, No. 2, Nov 2010

Verification of Detection of Principal Components in Low Interaction Honeypots using StatistiXL Tool First A.Geeta Sharma1 , Second B.Mandeep Kaur2 1

ME 2nd year IT deptt. Panjab University,Chandigarh,India Email:[email protected] 2 Lecturer,IT deptt Panjab University, Chandigarh,India Email:[email protected] Abstract: Honeypots are used for detecting the malicious activity by the hacker. Generally it is considered that new type of attacks cannot be detected through Low Interaction Honeypots. Now, PCA technique can be used with low interaction honeypots to detect the known as well as unknown attacks. The PCA technique requires no prior knowledge of attack types and has low computational requirements that make it suitable for online detection systems. The aim of our proposed research work is to verify the detection of number of principal components using StatistiXL tool. PCA has been already used for detecting the new attacks in low interaction honeypots. Our aim is to verify it using the tool StatistiXL. Our aim is to compare the results of PCA and statistiXL tool. PCA is based on measuring changes in the residual space using square prediction error (SPE) statistics. When attack vectors are projected onto the residual space, attacks that are not presented by the main hyperspace will create new directions with high SPE values. StatistiXL is a tool that acts as plugin for Excel. Kfsensor software has been used for making a system a honeypot and for data collection. The seven main parameters have been extracted from logs. The parameters used for PCA are source port number, destination port number, protocol used,severity of attack,number of bytes received, source IP and destination IP. The paper contains verification of PCA analysis of StatistiXL tool with PCA in Matlab. Keywords: Honeypot, Standard Deviation, Component Analysis, Correlation, Covariance.

whole form an orthogonal basis for the space of the data. PCA is mathematically defined as an orthogonal linear transformation that transforms the data to a new coordinate system such that the greatest variance by any projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. PCA is theoretically the optimum transform for given data in least square terms. The rest of the paper is organized as follows. Section II overviews related work. Section III provides a brief summary of principal component analysis. The dataset used in this study and the pre-processing are described in Section IV. The statistiXL tool is discussed and the process of applying Principal Components of StatstiXL tool to the preprocessed honeypot data are presented in Section V. The results of statistiXL are compared with PCA in Matlab in Section VI. Finally, the paper is concluded in Section Section VII. II.

Honeypots are very valuable for collecting different types of attack traffic. However, characterizing attackers’ activities present in honeypot traffic data can be challenging due to the high dimensionality of the data, (or large number of variables,) and the large volumes of traffic data collected. A honeypot is a security resource whose value lies in being probed, attacked, or compromised [1]. Honeypots are very valuable for collecting different types of attack traffic. However, characterizing attackers’ activities present in honeypot traffic data can be challenging due to the high dimensionality of the data, (or large number of variables,) and the large volumes of traffic data collected. N.Provos proposed a virtual framework for designing a low-interaction honeypot daemon [2]. M.Dacier proposed a script generator tool which improves the interactivities of the honeypot [3]. Principal component analysis (PCA) is a widely used multivariate statistical technique for reducing the dimensionality of variables and unveiling latent structures and detecting outliers in data sets [4], [5]. It has been widely used in multi-disciplinary research areas such as Internet traffic analysis, economics, image processing, and genetics. PCA is

Principal

I. INTRODUCTION Honeypots are fake information servers strategically positioned in a test network to capture new viruses or worms. A honeypot is a security resource whose value lies in being probed, attacked, or compromised [1]. Honeypots are very valuable for collecting different types of attack traffic. However, characterizing attackers’ activities present in honeypot traffic data can be challenging due to the high dimensionality of the data and the large volumes of traffic data collected. Principal components analysis is a quantitatively rigorous method for achieving the simplification. The method generates a new set of variables, called principal components. Each principal component is a linear combination of the original variables. All the principal components are orthogonal to each other so there is no redundant information. The principal components as a © 2010 ACEEE DOI: 01.IJRTET 04.02.79

RELATED WORK

11

Review Paper Int. J. of Recent Trends in Engineering and Technology, Vol. 4, No. 2, Nov 2010 destination port number, source IP address ,destination IP address, severity level and size of received bytes . For Analysis, data is required in numeric form. But data is in the form of IP addresses and protocol names. So number of unique IP addresses used as source and destination are determined and assigned a unique number instead of IP address. Further, the major protocols used are also TCP and UDP. So these are also assigned the numeric integers (i.e. 1, 2 resp.). Data is stored in the excel sheet.

mainly used to reduce the dimensionality of a data set into a few uncorrelated variables, principal components (PCs), which retain most of the variation in the original data. The resulting principal components are a linear combination of the original variables, are orthogonal, and ordered with the first principal component having the largest variance. The use of PCA to structure network traffic flow was introduced by Lakhina [6] where PCA is used to decompose the structure of Origin-Destination flows, from two backbone networks, into three main constituents, namely periodic trends, bursts and noise. S. Almotairi [7] applied principal component analysis (PCA) to traffic flows of lowinteraction honeypots to detect the structure of attackers’ activities and to break honeypot traffic into seven dominant clusters.

V. STATISTIXL TOOL StatistiXL is an excel based tool capable of performing many statistical functions. These include Analysis of Variance (ANOVA) , Cluster Analysis , Contingency Tables , Simple, Partial, Multiple and Canonical Correlation , Linear and Circular Descriptive Statistics, Classification and Grouping Discriminant Analysis, Factor Analysis , Goodness of Fit Tests , Simple and Multiple Linear Regression , Nonparametric Tests , Principal Component Analysis (PCA) and Univariate and Multivariate t-Tests. For verification we have used the Principal Components phase of the tool. Data is provided in the numeric form in Excel sheet. Here a complex data sets containin D variables is transformed into a smaller set of new variables which maximize the variance of the original data set. All of the new variables are independent. The tool is simpler to use. We provide the whole of the data set containing thousands of records as input variable range. And provided the output variable range for each column as the no. of distinct values.The output of the tool is in the form of tables and the Scree plot for eigen values and principal components. The tool provides the option for calculating the Principal Components using the Correlation Matrix and covariance matrix method.

III. PRINCIPAL COMPONENT ANALYSIS PCA involves a mathematical procedure that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components. Principal component analysis (PCA) is a multivariate statistical technique that has been widely used in multi-disciplinary research areas such as internet traffic analysis, economics, image processing, and genetics, to name only a few. PCA is mainly used to reduce the dimensionality of a data set into a few uncorrelated variables, principal components (PCs), which retain most of the variation in the original data [1, 7-9]. The resulting principal components are a linear combination of the original variables, are orthogonal, and ordered with the first principal component having the largest variance. Although the number of resulting principal components is equal to the number of original variables, much of the variance in the original set of p variables can be retained by the first k PCs, where k< p. Thus, the original p variables can be replaced by the new k principal components.Principal component analysis has the following advantages: • It does not require any distributional assumptions and can be used with many types of data. • The extracted principal components are uncorrelated. • The first few principal components retain most of the variation in the original data.

A. Principal Components using Correlation Matrix The tool provides the descriptive analysis and computes the correlaton matrix, eigen values and eigen vectors shows in table1. TABLE I DESCRIPTIVE STATISTICS Descriptive Statistics

IV. DATA COLLECTION AND PREPROCESSING A network machine is made low interaction honeypot using Kfsensor software. Kfsensor tcp dump is collected ie Logs. Logs have been collected from 18-11-09 to 2-12-09. Logs are the daily reports in the form of XML documents. These log files contain the information about protocol used, number of bytes transferred, source IP address, destination IP address, source port, destination port, time of attack etc. We have considered a total of 3100 records for the analysis. Daily logs are combined into a single Log file for preprocessing and analysis. The major five parameters are extracted from the XML log files using Linux operating system. Linux shell programming have been used for extracting the features from each log. The features which are mainly extracted are protocol used, source port number, © 2010 ACEEE DOI: 01.IJRTET 04.02.79

Variable

Mean

Std Dev.

Std Err

N

Protocol

1.004

0.065

0.001

3099

Severity

1.948

0.250

0.004

3099

Bytes

606.452

2315.671

41.597

3099

ClientIP

141.621

90.063

1.618

3099

ClientPort

5841.183

11152.049

200.329

3099

HostIP

7.571

1.590

0.029

3099

HostPort

4590.187

2929.687

52.627

3099

StatistiXLcomputes the mean,standard deviation in order to compute the correlation matrix. It computes the eigen values and eigen vectors for all the five components.

12

Review Paper Int. J. of Recent Trends in Engineering and Technology, Vol. 4, No. 2, Nov 2010 TABLE II

VI. APPLICATION OF PCA USING MATLAB

EIGEN VALUES(CORRELATION METHOD)

MATLAB is a high-performance language for technical computing. It integrates computation, visualization, and programming in an easy-to-use environment where problems and solutions are expressed in familiar mathematical notation. MATLAB is an interactive system whose basic data element is an array that does not require dimensioning. This allows to solve many technical computing problems, especially those with matrix and vector formulations, in a fraction of the time it would take to write a program in a scalar non interactive language such as C or Fortran.

Explained Variance (Eigenvalues) Value Evalu e % Var Cum. %

PC 1

PC 2

PC 3

PC 4

PC 5

PC 6

PC 7

2.169

1.149

1.095

0.935

0.762

0.702

0.188

30.97 9 30.97 9

16.41 3 47.39 1

15.64 2 63.03 4

13.36 0 76.39 3

10.88 6 87.27 9

10.03 2 97.31 1

100.00 0

2.689

The outcome of the explained variance i.e. eigen values and the principal components is shown in the form of Scree Plot by the tool itself.

A.Principal Component Analysis using Correlation Matrix Before applying the PCA, data set is first standardized. Standardization is done because dataset consists of different data items. Standardization is done by computing the standard deviation and then dividing each column by the standard deviation. PCA computes the values of pc, zscores and pcvars ( called eigen faces). The computed values of pcvars are mentioned below.

Scree Plot 2.5 2.0

Eigenvalue 1.5 1.0

TABLE IV

0.5

PRINCIPAL COMPONENTS

0.0 1

2

3

4

5

6

7

Component Number Figure1 Scree Plot(correlation method)

B.Principal components using Covariance Matrix Method: Covariance matrix can also be computed and Principal components can be calculated through the sxl tool. TABLE III

PC 1

Pcvars 2.1686

2

1.1489

3

1.0951

4

0.9351

5

0.7619

6

0.7021

7

0.1882

The scree plot between the various principal components and pcvars is generated in Matlab.

EIGEN VALUES(COVARIANCE METHOD)

Scree Plot 2.5

Explained Variance (Eigenvalues) PC 1 1252720 60.655

PC 2 76498 19.75

PC 3 535514 9.823

90.590

5.532

3.873

90.590

96.122

99.994

PC 5 2.27 8 0.00 0 100. 000

PC 4 7949 .457 0.00 6 100. 000

PC 6 0.02 8 0.00 0 100. 000

PC 7 0.00 3 0.00 0 100. 000

2

1.5 pcvars

Value Eigen value % of Var. Cum. %

1

0.5

Scree Plot 0

140000000

1

2

3

4 5 Principal Component

6

7

Eigenvalue

120000000 100000000

Figure3. Scree Plot(matlab)

80000000 60000000

B.Principal Component Analysis using Covariance Matrix Method In Matlab, for applying the PCA using covariance method, we do not calculate the standard deviation as data is not required to be standardized. We directly compute the princomp function on the data.

40000000 20000000 0 1

2

3

4

5

6

7

Component Number

Figure2 Scree Plot(covariance method)

© 2010 ACEEE DOI: 01.IJRTET 04.02.79

13

Review Paper Int. J. of Recent Trends in Engineering and Technology, Vol. 4, No. 2, Nov 2010 TABLE V PRINCIPAL COMPONENTS PC 1

Pcvars 125270000

2

7650000

3

5360000

4

10000

5

0

6

0

7

0

The eigen values closer to or above one contribute for the Principal Components.Here the number of principal components are calculated as four. The first component (PC1) is highly correlated with the total number of basic flows, total number of TCP ports targeted,total duration of basic flows, total number of source packets, and total number of source bytes. The first component indicates high interactions between attackers and the honeypot on open ports and as the variance suggests, is the most important component. PC2 is highly correlated with closed TCP ports. This component suggests vertical and horizontal scan activities which focus on very specific ports. In PC3, activities target closed UDP ports and could be interpreted as spam, worm activities, or mis-configured servers. PC4 is a subset of the first component and represents short attacks against specific open ports.

The plot for the principal componets and eigen valus is given below showing the number of principal components. 7

14

Scree Plot

x 10

12

pcvars

10

VII. CONCLUSION

8

6

The eigen values closer to or above one contribute for the Principal Components. Here the number of principal components are calculated as four for correlation matrix method .StatistiXL shows three principal components as it has considered the eigen value greater than one is contributing for the Principal Component. The covariance matrix method results are also comaparable in both the cases. Hence the tool is verified. Also we concluded here that as our honeypot data is heterogeneous in nature. So correlation matrix method would work rather than covariance method for further analysis. PCA of StatistiXL tool shows 100% accuracy with the Matlab PCA. Only the round off is performed in the computation in both the cases. StatistiXL acts as a plug-in to MSExcel and provides graphical interface for the computation. It is easy to use and takes lesser memory than Matlab. So if we have to compute the statistical functions upon the excel data, then StatistiXL is preferable than Matlab.

4

2

0

1

2

3

4 5 Principal Component

6

7

Figure4 Screeplot(matlab)

VII. COMPARISON A. Comparison of sxl with Matlab using Correlation Matrix Method Upon comparison the results and round off upto three decimal places is there,also the sxl is providing 100% accuracy with pca in Matlab using correlation matrix method. Comparison is shown in the table6 given below. TABLE VI. CORRELATION MEHOD COMPARISON

REFERENCES [1] L. Spitzner, Honeypots: Tracking Hackers. Addison-Wesley, 2003. [2] N. Provos, “A virtual honeypot framework,” in 13th USENIX Security Symosium, Aug 2004. [3] C. Leita, K. Mermoud, and M. Dacier, “Script gen: An automated script generation tool for honeyd,” in 21st Annual Computer Security Applications Conference (ACSA), Dec 2005. [4] I.T.Jollif, Principal Component Analysis, 2nd ed., ser. Springer Series in Statistics. New York: Springer, 2002. [5] J. E. Jackson, A User’s Guide to Principal Components, 1st ed. Wiley- Interscience, 2003. [6] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. Kolaczyk, and N. Taft, “Structural analysis of network traffic flows,” in ACM SIGMETRICS, 2004. [7] S. Almotairi, A. Clark, G. Mohay, and J. Zimmermann, “Characterization of attackers’ activities in honeypot traffic using principal component analysis,” in Network and System Security NSS 2008. IEEE Computer Society Proceedings, Oct 2008.

B. Comparison of sxl with Matlab using Covariance Matrix Method On comparison, the values seem to be quite different, but when computed in the exponent form and compare the corresponding graphs, the results are approximately same. Moreover, the calculated number of components are also same i.e. 4. TABLE VII. COVARIANCE METHOD COMPARISON

© 2010 ACEEE DOI: 01.IJRTET 04.02.79

14