Profiling Phishing Email Based on Clustering Approach - IEEE Xplore

2013 12th IEEE International Conference on Trust, Security and Privacy in Computing and Communications

Profiling Phishing Email Based on Clustering Approach IsredzaRahmi A Hamid and Jemal H. Abawajy Parallel and Distributed Computing Lab School of Information Technology Deakin University, 3217, Victoria, Australia {iraha, jemal}@deakin.edu.au developed. A favorable approach is the use of content-based filtering [6], [8], [19] where it enables to discriminate normal and phishing email inevitably. Several researches on content based email classification have been focused on classifier-related issues. Other than that, the behavior-based email classification approaches [7], [5], [9], [10] are capable to learn the attacker activities in order to classify phishing and normal emails. A hybrid feature selection approach based on a combination of content-based and behaviorbased is proposed [28]. Despite the increasing development of anti-phishing services and technologies, the number of phishing email messages continues to rise rapidly. Phishing has become more and more complicated and the phishers continually change their ways of perpetrating phishing attack to defeat the anti-phishing techniques. There are recent approaches to exploit the social networking sites by sending a malicious or bogus link. Furthermore, most phishing emails are nearly identical to the normal email. Therefore, existing antiphishing techniques such as content-based approaches are not effective for detection even though they may be effective for blocking. Also, most of the existing emails filtering approaches are static where it can easily be defeated by modifying the contents of emails and link strings. The main problem addressed in the literature is the detection of phishing emails based on significant features such as hyperlink, number of words, subject of emails and others. This paper investigates different part of phishing that is profiling phishing email. Phishers normally have their own signatures or techniques. Thus, a phisher’s profile can be expected to show a collection of different activities. Profiles can be understood as a collection of information on the activities of a related individual or a group involved in the activity. Hence, profiles can be determined to provide information on different phishers involved in the activity. By building profiles, phishing activities can be better predicted as well as observed. In this paper, we proposed an algorithm for profiling email-born phishing (ProEP) attacks. The aim of the profiling is to detect phishing emails, predict phishing profiles and identify the phishers. Our proposed approach increases the accuracy as compared to profiling techniques proposed in [14], [15], [16]. Different machine learning language techniques has been compared based on their accuracies. Our major contributions are summarized as follows:

Abstract—In this paper, an approach for profiling email-born phishing activities is proposed. Profiling phishing activities are useful in determining the activity of an individual or a particular group of phishers. By generating profiles, phishing activities can be well understood and observed. Typically, work in the area of phishing is intended at detection of phishing emails, whereas we concentrate on profiling the phishing email. We formulate the profiling problem as a clustering problem using the various features in the phishing emails as feature vectors. Further, we generate profiles based on clustering predictions. These predictions are further utilized to generate complete profiles of these emails. The performance of the clustering algorithms at the earlier stage is crucial for the effectiveness of this model. We carried out an experimental evaluation to determine the performance of many classification algorithms by incorporating clustering approach in our model. Our proposed profiling email-born phishing algorithm (ProEP) demonstrates promising results with the RatioSize rules for selecting the optimal number of clusters Keywords-Profiling, Phishing, Clustering Algorithm

I.

INTRODUCTION

Email phishing is a type of semantic attack in which victims are sent emails that mislead them into providing credential information such as account numbers, passwords, or other personal information to an attacker. Phishers attract their victims by using fake emails pretend to be from a legitimate company and well-known business organizations such as eBay, PayPal and Suntrust. Phishing can take several forms, but the main goal of the phishers is always to lure people into giving up important information. In some cases, they implant malicious software that controls a computer so that it can participate in future phishing scams. Phishing emails are one of the fastest growing scams nowadays and solving this problem is very challenging. A report by the Anti-Phishing Working Group (APWG) [2224], found that number of unique phishing emails reported by consumers rose 20% in the first half of 2012 compared to the same period in 2011. There was also an increase by 73% in the number of unique phishing website detected, in the first half of 2012 as compared to 2011. Although, there are many sources of phishing attacks such as websites, SMS, social network, forum posts and comments. In this paper, we focused on email-born phishing attacks. Detection of phishing emails from the email system allows users to regain a valuable means of communication. In order to address the growing phishing email problem, there are numbers of possible countermeasures have been 978-0-7695-5022-0/13 $26.00 © 2013 IEEE DOI 10.1109/TrustCom.2013.76

628

1.

We propose a method for extracting the features of phishing email based on a weighting of email features and select the features according to priority ranking by Information gain algorithm. 2. We propose profiling phishing email model and profiling of email-born phishing (ProEP) algorithm for filtering phishing emails. 3. We examine the impact the number of clusters has on the clustering algorithms in filtering phishing email and to find out the optimum number of clusters.We show that it is very important that the optimal number of clusters is selected to ensure that profile generated is generalized. We prove that, general profiles can improve accuracy of phishing detection. 4. We provide empirical evidence that our proposed approach reduces the false positive and false negative with regards to under fitting and over fitting issues. The rest of the paper is organized as follows: Section II describes the background and related work of profiling and classification model. Section III describes proposed classification model for phishing email detection where cluster prediction become elements of profiles. The profiles constructed are then used to train the classifier algorithm. Section IV shows the evaluation methodologies, and experimental results are shown and discussed in Section V. Finally, Section VI concludes the work and highlights a direction for future research. II.

classification problem. They used hyperlinks in the phishing emails as features and structural properties of emails along with the WHOIS information on hyperlinks as profile classes. On the other hand, our work represented phishing email by using the vector space model. In this model, each phishing email is considered as a vector in binary and continuous value.Work by [15] concentrates on several clustering approaches to determine the optimal number of phishing group. Then, this aid in the development of profiles based on the individual cluster. However, the actual profiling is not carried out in this paper. Paper [15] differs from our work such that we determine the optimal number of clusters based on ratio size. Kmeans clustering algorithm is used to cluster the data in [16]. Then, several consensus functions and algorithm are used to combine this independent clustering into final consensus clustering. The final consensus clustering is used to train the classification algorithm. Finally, they are used to classify the whole data set. They used tenfold cross validation to evaluate the accuracy of these algorithms. As Kmeans is the user-defined cluster, number of clusters is set pre-fixed as an input parameter for their algorithm. Our work considers using various types of data: continuous and categorical data. Besides that, we used profiling of emailborn phishing (ProEP)algorithm to profile the activities of phishing groups. We then developed techniques to automatically obtain the optimal number of group profiles. A. Overview of Classification Model Classification technique is a methodical approach to building classification model from an input set. The model generated by a learning algorithm ought to predict the unknown class label based on the input data. Fig. 1 shows the basic classification model approach. The training data set which consists of known label data educated the classification algorithm to generate a classification model. This classification model will apply to testing dataset, which consists of data with an unknown class label. A number of the initial clustering algorithm was obtained for the new profiles discussed in the next section.

PROBLEM OVERVIEW AND RELATED WORK

Profiling is categorizing suspects or prospects from a large data set, and gathering a set of characteristics of a particular group based from past knowledge. The potential usage of profiling has been identified to profile customer buying patterns and market products accordingly. Market Basket Analysis [20] profiles customer based on their buying pattern. This information is then used by companies to learn the nature of competitive markets. Generally, customer profiling deals with non-sensitive data such as age, buying pattern, payment method and others. It is very important tool to better understand their customer need. There are many potential areas that have been identified for using profiling techniques such as genetic profiling [11]. It analyses a person’s entire genome in order to reveal the majority of their genetic variations. Recently, mobile user browsing behavior patterns is used to improve service performance, thus increasing user satisfaction. It offers valuable insights about how to enhance the mobile user experience by providing dynamic content personalization and recommendation, or location-aware services [12]. Also, there have been studies in profiling behaviour of Internet backbone traffic where significant behaviour from massive traffic is discovered and anomalous events such as large scale scanning activities, worm outbreaks, and denial of service attacks are identified [13]. Our work is motivated by [14], [15],[16] studies. [14] formulates the profiling problem as a multi-label

Training Data

Classification Algorithm

Learn Model

Model Model Model

Apply Model

Testing Data

Figure 1. Basic classification approach B. Initial Clustering Algorithm Clustering is a suitable technique for determining data distribution and patterns in the fundamental data. The objective of clustering is to determine density and parse regions in the data set. Data clustering has been studied in diverse domains with various emphases. It is noted that basic principle of clustering is on a concept of distance metric or similarity metric where mainly stressed on numerical data. The problem of similarity becomes complex when the data are the categorical type which do not have a

629

natural ordering of values. These techniques unfortunately generate final data partition based on incomplete information [18]. There are drawbacks where the nature of the features is varied where some of the features are continuous or categorical valued. Thus combining the features together in a single clustering algorithm would produce an inaccurate group of clusters. Some researchers proposed cluster ensemble to overcome this issue where several independent clustering are performed based on different subsets of the complete feature vector [17],[18]. Then, they are combined in a cluster ensemble to form a final, consensus clustering. We compare the Kmeans and the TwoStep clustering algorithms in Table 1.

1) The first stage is profiling email-born phishing activities, where profiles are generated based on clustering algorithm's predictions. Thus, clusters become elements of profiles. We then employ a clustering algorithm as clustering predictions in phishing emails. These predictions are further utilized to generate complete profiles of these emails. The number of profiles to be used in the earlier stage is crucial for the effectiveness of this model. 2) The second stage is a classification of emails. The profiles generated from the profiling stage are used to train the classification algorithm where it supposed to predict the unknown class label based on the input data. Therefore, the key objective of the profiles is to ‘train’ the classification algorithm for better generalization capability where it can predict the class label accurately for unknown records.

TABLE1. KMEANS AND TWOSTEP CLUSTERING ALGORITHM. Analyze large data files Type of dataset Number of clusters

Kmeans Clustering Yes

TwoStep Clustering Yes

Continuous User-specified number of clusters

Categorical and continuous Number of clusters selected but needs automatically, termination condition.

Existing Classification Model

Firstly in Kmeans algorithm, points are assigned to the initial centroids where means value is assigned as the centroid. Then, point is assigned to the updated centroids, and the centroids are updated once more. This algorithm terminated when there are no more changes occurred. At this time, the centroids have identified the natural grouping of points.The major drawback of Kmeans algorithm is it is sensitive to the selection of the first partition. It may group to a local minimum of the criterion function value if the first partition is not properly chosen. The TwoStep clustering algorithm consists of two steps. In the pre-clustering step, it scans the data record in sequence and decides if the recent record can be added to one of the earlier formed clusters or it starts a new cluster, based on the log-likelihood or Euclidean distance matrix measure. Euclidean distance is used if the variable consists of continuous attributes while the log-likelihood distance similarity for both continuous and categorical attributes. Then, the cluster step takes sub-clusters resulting from the pre-cluster step as input and then groups them into the desired number of clusters. As the number of sub-clusters is less than the number of original records, traditional clustering methods can be used well. III.

Email

Feature Extraction

Email

Feature Extraction

Feature selection

Detection Algorithm

% accuracy

Proposed Classification Model

Feature selection

Profiling Algorithm

Training data

Detection Algorithm

% accuracy

Testing data

Figure 2. Phishing profiling Model TABLE 2.NOTATION FOR ProEP ALGORITHM Notation

The total number of continuous variables used in the data set.

Total number of categorical variables used in the data set.

Number of categories for the ℎ categorical variable.

The range of the ℎ continuous variable.

Number of total data set.

Number of data set in cluster .

̂

The estimated mean of the ℎ continuous variable across the entire dataset.

The estimated variance of the ℎ continuous variable across the entire dataset.

̂

The estimated mean of the ℎ continuous variable in cluster .

̂

Number of data records in cluster j whose ℎ categorical variable takes the ℎ category.

Number of data records in the ℎ categorical variable that take the ℎ category.

< , >

In this section, we discuss the proposed phishing profiling classification model as shown in Fig. 2. When developing profiling model, the number of cluster to be used, the profiling algorithm used, and how consistency of the profiling model is maintained are important decisions to be made. The notation in Table 2 is used in Profiling Emailborn Phishing (ProEP) algorithms. The proposed phishing profiling approach consists of two stages:

The estimated variance of the ℎ continuous variable in cluster .

(, )

PROFILING MODEL

Description

Distance between clusters and . An index that represents the cluster formed by combining clusters and .

A. Profiling Email-born Phishing Activities For data clustering, an important issue that must be addressed is how many cluster models are appropriate and diverse for one data set. Note that in our profiling phishing model, data are trained using profiles generated from clustering algorithm. In this section, we proposed a new method of selecting the optimal number of clusters based on

630

cluster with the smallest distance (, ) are merged in each step. The log-likelihood function for the step with cluster is calculated as = ∑&2 F& where, the function can be interpreted as distribution within clusters. If continuous variables are used, is the dispersion of continuous variables.

TwoStep clustering algorithm ‘auto-clustering’ approach. Note that the algorithm in Algorithm 1 is different from the one discussed in [25], [27]. Unlike the previous algorithm, optimal number of clusters is based on ratio size value. Algorithm 1 shows the pseudo-code of profiling emailborn phishing (ProEP) algorithm. The algorithm takes as the input all emails, = [ , , ⋯ , ] where is the value of attribute in the email, and is the total number of attribute. The attribute values for each email could be in categorical and/or continuous data. The goal of the ProEP algorithm is to determine the optimal number of clusters to be the profiles. Therefore, step 7 to 12 is the first step performed by the ProEP algorithm to identify the optimal number of clusters that will possibly be serving as profiles for the profiling phishing classification model.

2) Perform Bayesian Information Criterion (BIC) TwoStep clustering algorithms used the agglomerative hierarchical clustering method in the cluster step to automatically determine the optimal number of clusters in the input data. It produces a sequence of partitions in one run: 1, 2, 3, … clusters. To determine the number of clusters automatically, TwoStep clustering algorithm uses a two-stage procedure that works effectively with the hierarchical clustering method. First, the BIC for each number of clusters within a specified range is calculated to discover the initial estimate for the number of clusters. The BIC is computed as, GHI = −2 + K log A, Where K is the number of independent parameter. The ratio of change in BIC at each consecutive merging relative to the first merging determines the initial estimate.

1:Algorithm 1: ProEP Algorithm 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:

Input BEGIN numcluster=2; ratiosize=0 FOR (each incoming EMAIL) DO Compute the log-likelihood function ( ); Perform Bayesian Information Criterion ( ); Calculate the Change Ratio ( ); Calculate Ratio of the Distance Measure ( ); Calculate the Ratiosize ( ); Until the () ≈ ( + 1) Profiles generation ENDFOR END ProEP

3) Calculate the Change Ratio Let GHI(L) be the difference in GHI between the model with L clusters and that with ( L + 1 ) clusters where, GHI(L) = GHI(L) − GHI(L + 1). Then the change MNO(P) ratio for model L is, (L) =

1) Compute the Log-likelihood function

MNO()

The log-likelihood distance similarity is a probability based distance. The distance among two clusters is correlated to the decline in log-likelihood as they are joined into one cluster. In calculating log-likelihood, common distributions for continuous variables and multinomial distributions of categorical variables are expected. It is also assumed that the variables and the cases are independent of each other. The distance between clusters i and j is defined as, (, ) = "# + " − "$#,% , where, 0 - - /5 4 "& = −& '∑/ 2 log ' + & . + ∑2 3& .

And, 6789 ;8 6789 log . 34& = − ∑2 6:

4) Calculate the Ratio Size In this stage, the model I indicated by the BIC criterion by taking the ratio of minimum inter-cluster distance for that model and the next larger model IQ , (O ) M

() = RST 8 , where I is the cluster model MRST (O8UV )

containing k clusters and W#X (I) is the minimum intercluster distance for cluster model I . Then, model IY compute the same ratio with the following model I . Repeat for each subsequent model until the ratio (2) is reached. Next, the initial estimate is refined by finding the ratio size value for each cluster. The minimum size for each

(1) (2)

6:

?@ in algorithm (1) can be interpreted as variance between two clusters A , where cluster ( = , , 〈, 〉) . The 0 log '- + - . measures the first part, ∑/ /2

W#X\

8 where, ÂO8 is cluster is indicated by, ZW#X (I ) = 6 the minimum size of sub-cluster for cluster model I containing k clusters. While is number of total dataset. Then, the maximum value for each cluster is calculated as, W_\8 where ^DO8 is maximum size of subZW_ (I ) = 6 cluster for cluster I . Then, () is indicated by, () = `Rab (O8 ) where, W_ (I ) and W#X (I ) are the maximum

dispersion of continuous variables D within cluster . The term - is added to avoid a deteriorating situation for = 0. As if is used, (, ) would be decreasing &

&

in the log-likelihood function after merging cluster A . 6789 ;8 6789 log used as a The second part, 34 = − ∑2 6:

`RST (O8 )

6:

and minimum size of sub-cluster for cluster model I

measure of dispersion for the categorical variables. Then,

631

respectively. Let ()be the difference between

() and ( + 1) ratios where () =

( + 1) − () If () ≈ ( + 1) , then select the model with ( + 1) as the optimal number of clusters.

All this set consists of both continuous and categorical. Continuous variables are standardized by default as all the data has been normalized. We then used log-likelihood for distance measure. In the first running, we set the number of clusters to 2 clusters and ratiosize to zero. If the ratio size value becomes constant, we assumed that the cluster has achieved balance in clustering the data, where the size of each cluster is equitably spread. This algorithm is run to Mod15 and bad17 data set as well where profiles for each data set are constructed.

5) Profiles Generation Result To generate profiles, the result generated by the clustering algorithm is used. The profiling technique for selecting partition attribute is based on the log-likelihood distance matrix because it can handle different type of variables. Let us denote the dataset being investigated by, c = {D , D , … , D6 } where, D is the whole dataset contains N total number of dataset. All emails, = { , , … , }, where is the value attribute ith in the email, and k is the total number of attributes. The clustering on this dataset indicated by () will b denoted # e, for all = Ff. by,I # = dI# , I# , … , I# IV.

TABLE 3. SUMMARY DATASET

Set

Top15

EXPERIMENTAL EVALUATION

In this section, we use simulation to analyse the performance of the proposed profiling phishing model and profiling algorithm. First, we will discuss the experimental setup and then present the result of the experiment and the discussion. A. Profiling Process The data used for the evaluation of our method comes from Khonji’s anti-phishing studies website [26]. This study extracts 47 features as proposed in [1]. The datasets were publicly available from SpamAssassin’s ham corpus [3] and Jose Nazarios phishing corpus [2]. The feature subset selection and classification methods were automated using Weka [4]. This paper centres on supervised data where we assumed that the data sets are classified into ham and phishing data. The data set consists of 6837 data, where 4150 is ham emails while 2687 is phishing email. In order to account for the widely differing values of different features, the data was normalized before the ratio gain algorithm was run for feature ranking. Features with continuous values are normalized using the quotient of the actual value over the maximum value of that feature so that continuous values are limited to the range [0, 1]. This will split the data more evenly and improves the results achieved with the information gain algorithm.

Mod15

Bad17

B. Feature Ranking and Feature Selection Table 3 shows the ranking of 47 features for phishing email and the corresponding ratio-gain values. In the following, we then considered three sets of data based on the Information Gain ratio value: 1) top 15 features with the highest IG value (top15), 2) The next 15 highest IG value (mod15), and 3) All remaining features with the lowest IG value (bad17).

Rank 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47

IG Value 0.863473 0.774079 0.707139 0.669168 0.609389 0.413923 0.410465 0.388836 0.305396 0.26253 0.239928 0.236486 0.223582 0.209478 0.188858 0.167831 0.162627 0.152108 0.138833 0.131354 0.079708 0.071806 0.064012 0.062774 0.060004 0.054802 0.04644 0.040429 0.039988 0.033942 0.029681 0.029681 0.027261 0.025421 0.02096 0.02096 0.017061 0.007951 0.007692 0.007576 0.005868 0.005767 0.003931 0.003555 0.000949 0.000792 0

Attributes Externalsascore Externalsabinary Urlnumlink Bodyhtml Urlnumperiods Sendnumwords Urlnumexternallink Bodynumfunctionwords Bodydearword Subjectreplyword Sendunmodaldomain Bodynumwords Bodynumchars Bodymultipart Urlnumip Subjectrichness Urlnumimagelink Urlip Bodynumuniqwords Urlbaglink Subjectbankword Urlwordloginlink Bodyrichness Urlwordherelink Subjectnumchars Senddiffreplyto Urlwordclicklink Urltwodoains Scriptonclick Bodysuspensionword Urlport Urlnumport Subjectnumwords Bodyverifyyouraccountphrase Scriptstatuschange Subjectverifyword Urlnumdomains Urlatchar Urlunmodalbaglink Urlwordupdatelink Scriptpopup Scriptjavascript Urlnuminternallink Subjectdebitword Subjectfwdword Bodyform Scriptunmodalload

Type Continuous Categorical Continuous Categorical Continuous Continuous Continuous Continuous Categorical Categorical Categorical Continuous Continuous Categorical Continuous Continuous Continuous Categorical Continuous Continuous Categorical Categorical Continuous Categorical Continuous Categorical Categorical Categorical Categorical Categorical Categorical Continuous Continuous Categorical Categorical Categorical Continuous Categorical Categorical Categorical Categorical Categorical Continuous Categorical Categorical Categorical Categorical

C. Classification Experiment In order to show that our proposed model can improve the performance of the existing classification approach: Kmeans and random in phishing classification, we ran the data testing using classification algorithm with (1) profiles which are selected randomly (profilerandom) (2) profiles which are generated from Kmeans clustering approach

632

(profilekmeans), and (3) profiles generated from ProEP algorithm (profileproep). For the evaluation of the accuracy and effectiveness of the profiles, all experiments used the same split percentage number of testing and training which 90:10 respectively from the overall datasets is. Table 4 shows that, 90% from the overall datasets are set as testing data, while the remaining as training data. The training set is used to train the learning algorithms, while the testing set is used to obtain the performance characteristics such as Receiver Operating Characteristics (ROC), Accuracy, True Positive (TP) and False Positive (FP) rate.

algorithm generates 2 profiles compared to our proposed method, 10 profiles. These profiles are then used to train 5 different types of classification. Our proposed algorithm outperformed the TwoStep clustering for most classification algorithm. TABLE 5.COMPARATIVE ANALYSIS NUMBERS OF CLUSTERS ASSIGNED BY TWOSTEP AND PROEP CLUSTERING ALGORITHM

Top 15

Total 4150 2687 6837

90% (Testing) 3735 2418 6153

10% (Training) 415 269 684

D. Classification Algorithm Initially, we ran a series of complete tests for different classifiers. All the robust classifiers were chosen because they represent essential types of classifiers available in Weka [4]. We used the default setting for all classification algorithms in our initial preliminary testing: 1) Naïve Bayes (NB) - using estimator classes, 2) AdaBoost (AB) boosting a nominal class classifier using the AdaBoost M1 method, 3) Random Forest (RF) - constructing a forest of random trees, 4) Decision Tree (DT)- using a simple decision table majority classifier and 5) OneR- uses the minimum-error attribute for prediction, discretizing numeric attributes. E. Optimal Number Of Cluster Fig. 3 shows the relationship between the ratio size of the cluster and number of clusters executed on Top15 datasets. This dataset contains 6 categorical variables and 9 continuous variables. We calculate the ratio sizes for each cluster, where ratio size is the split value between the size of the largest and the smallest cluster. The graph shows that the ratio size value is constant when the number of clusters is 10. These figures indicate that a good clustering where it has balance ratio size can be achieved with 10 clusters. Ratio Size

40 30 20

Optimal number of cluster where, i|H~ZH3 () ==

i|H~ZH3 ( + 1)

10 0 0

1

2

3

4

5

6

7

8

9

10

# of cluster

AB

DT

NB

OneR

RF

2 10

80.5 95.1

39.3 95.1

90.5 98.2

39.3 39.3

73.0 96.7

F. Evaluation Criteria Evaluation of the performance of a profiling phishing email model is based on the counts of test data correctly and incorrectly predicted by the model. To give us an indication of the “most accurate” final clustering, we employed the use of ROC value, accuracy, TP rate and TN rate. We compared profiling phishing email model for each of the learning algorithms. The performance metric for accuracy is defined rs trqqptu vqpwStuSrTx . Equivalently, the as, ijjfKjk = TmRnpq urua9 TmRnpq rs vqpwStuSrT performance of a model can be expressed in terms of its error rate, which is given by, rsyqrTz vqpwStuSrTx . 3KKK K = TmRnpq urua9 TmRnpq rs vqpwStuSrT First, we compute clusters for the whole datasets, following the procedure described in Algorithm 1. Then, this clustering was used as a training data to determine the performance of several classification algorithms incorporated into the scheme. In order to maintain the equality among three types of profiles, we generate the same number of profiles from each clustering algorithm; profilerandom, profilekmeans and profileproep. Profilerandom is profiles generated by an average value of 10 sets of randomly selected profiles from the training dataset, while the profilekmeans is generated by Kmeans clustering algorithm by setting the number of cluster to 10 clusters. Finally, profileproep is profiles generated from the ProEP algorithm. We compared the performance of three sets of testing data sets in Fig. 4 using Area under Curve (AUC) because it is a robust measure. It is also known as the Receiver Operating Characteristic (ROC). The ROC figure also can be defined as a curve graphically displaying the trade-off between sensitivity and specificity for each cutoff value. It shows weighted average values of ROC of two classes of emails: phishing email and normal emails for top15, mod15 and bad17 data sets. The larger ROC values correspond to better predictability of the classes. Hence, we used ROC value to evaluate the effectiveness of profiles in all experiments. We compared several profiling algorithms in their ability to improve the classification result. Preliminary tests demonstrated that ROC values for profileproep clustering algorithm outperformed other algorithm.The outcome shows improvement in all classifiers when they are trained using profileproep as a training data.

TABLE 4. DATASET DISTRIBUTION Email Ham Phishing Grand Total

Algorithm Two Step ProEP

11

Number of cluster

. Figure 3. Ratio sizes value for Top15 dataset. The comparative analysis between TwoStep clustering algorithm and ProEP algorithm are shown in Table 5. The TwoStep clustering selects the optimal number of cluster based on the ratio of distance measure while ProEP algorithm used ratio size value. Based on the simulation result, run on Top15 datasets; the TwoStep clustering

633

Table 6 and Table 7 show the accuracy and True Positive (TP) and True Negative (TN) rate of all datasets. The TP rate of a classifier is estimated as: | K = `##` ``##M , and the FP rate of the classifier `##` 6#` #X ``##M

is: K =

X#`

involved. The first step is profiling the phishing data set based on clustering algorithm prediction. Next step is classification, where the data set is split based on training:testing ratios. Later, the profile generated in profiling stage is used to train the classification algorithm where it supposed to predict the unknown class label based on the input data using classifier algorithm. This proposed algorithm is then compared with the Kmeans clustering and random selection approach on phishing email data set.

. Both tables

show that, profileproep has the highest accuracy, highest TP value rate and the lowest FP value rate for all datasets when tested on all classifiers. G. Model Over Fitting and Under Fitting

ROC value

The errors committed by a classification model are generally caused by training errors and generalization errors. Training errors are the number of misclassification errors committed in training records, while generalization error is the expected error of the model on previous unseen records. A good classification model must fit the training data well and classify the unseen record accurately. So, a good classification model must have low error rate as well as generalization error. To investigate the effect of over fitting and under fitting, various numbers of testing data are applied from smallest value to the largest. Fig. 5 shows the testing error rate for Top15’s feature sets using a Naïve Bayes (NB) classification algorithm with a different set of training profiles. As a result, our proposed profile, profileproep performs well with the lowest error rate as compared to profilekmeans and profilerandom for Top15’s dataset. As the node in the testing data increased, the training error rate maintains low. Such node did not degrade the performance of the training profiles because it generalizes well in the testingdata and improved the clustering accuracy. Yet, we believe that the optimal number of cluster chosen is crucial in order to increase the accuracy of the filtering results.

Top15's ROC value

1

0 AB

ROC value

DT

OneR

RF

Mod15's ROC value

1 0.5 0 AB

NB

DT

OneR

RF

OneR

RF

Bad17's ROC value

1 ROC value

NB

0.5 0 AB

NB

profilekmeans

DT profilerandom

profileproep

Figure 4. ROC value of top15, mod15 and bad17 Top15's feature set error rate

Error rate

100 80 60 40 20 0

VI CONCLUSION

Profilerandom Profileproep Profilekmeans

50

In this paper, we proposed different classification model by formulating the profiling problem as a clustering problem using the various features in the phishing emails as feature vectors. Further, we generate profiles based on clustering predictions. Thus, cluster becomes elements of profiles. In our proposed model, there are two steps

100 150 200 250 300 6153 Number of Testing Data

Figure 5. Number of testing data vs Error rates.

TABLE 6. ACCURACY OF DATASET AB DT NB OneR RF

Profilekmeans 30.64 60.70 40.79 42.50 48.69

Top15 Profilerandom 90.61 90.61 86.96 39.30 93.72

Profileproep 95.06 95.06 98.23 39.30 96.72

Profilekmeans 41.93 39.30 39.98 41.93 39.36

Mod15 Profilerandom 59.66 51.84 66.43 60.70 63.91

Profileproep 69.61 70.08 49.63 69.22 69.23

Profilekmeans 46.60 46.55 52.59 61.40 47.00

Bad17 Profilerandom 51.83 42.63 53.12 56.42 49.75

Profileproep 94.44 92.39 40.09 44.68 92.07

TABLE 7. TP AND FP RATE FOR 3 PROFILES Algorithm AB DT NB OneR RF

Profilekmeans TP FP 0.31 0.57 0.61 0.61 0.41 0.40 0.43 0.72 0.49 0.40

Top15 Profilerandom TP FP 0.91 0.13 0.91 0.13 0.87 0.10 0.39 0.39 0.94 0.08

Profileproep TP FP 0.95 0.04 0.95 0.04 0.98 0.02 0.39 0.39 0.97 0.04


Mod15 Profilerandom TP FP 0.60 0.37 0.52 0.38 0.66 0.34 0.61 0.61 0.64 0.33

634



Bad17 Profilerandom TP FP 0.52 0.53 0.43 0.43 0.53 0.53 0.56 0.56 0.50 0.52


The experiment results showed that the classification accuracy is improved by adopting ProEP algorithm for selecting the number of clusters. First, the test result proved that the optimal number of clusters used is critical to the clustering accuracy of the TwoStep cluster. The optimal number of clusters that selected by our method are well distributed as compared to TwoStepalgorithm. Then, the test result showed that our proposed algorithm demonstrate highest accuracy and ROC value compared to Kmeans and random selection approach. This paper has explored the clustering approach as profile generator in order to detect phishing email. We plan to investigate how we can integrate Kmeans and TwoStep approach to improve clustering accuracy.

[15]

[16]

[17]

[18]

REFERENCES [1]

[2]

[3] [4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[19]

M. Khonji, A. Jones, and Y. Iraqi, (2011). “A Study Of Feature Subset Evaluators And Feature Subset Searching Methods For Phishing Classification”, Proceedings of the 8th Annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference (CEAS '11), ACM New York, USA, pp. 135-144, 2011. J. Nazario, Phishing corpus,http://monkey.org/~jose/wiki/doku.php?id=phishingcorpus, Accessed July 2012. SpamAssassin, Public corpus, http://spamassassin.apache.org/publiccorpus/, Accessed July 2012. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. Witten, “The Weka Data Mining Software: An Update”, SIGKDD Explorations 11, pp.10-18, 2009. F. Toolan, and J. Carthy, “Feature Selection For Spam And Phishing Detection”, Anti-Phishing Working Groups 2nd Annual eCrime Researchers Summit (eCrime), pp. 1-12, 2010. I. Fette, N. Sadeh, and A. Tomasic, “Learning To Detect Phishing Emails”, Proceedings of the 16th international conference on World Wide Web, ACM New York, NY, USA, pp. 649-656, 2007. I.R.A. Hamid, and J.H. Abawajy, “Hybrid Feature Selection For Phishing Email Detection” The 11th International Conference on Algorithms and Architectures for Parallel Processing (ICA3PP), Springer, Berlin, Germany, pp. 266-275, 2011. M. Bazarganigilani, “Phishing E-Mail Detection Using Ontology Concept And Naïve Bayes Algorithm”, International Journal of Research and Reviews in Computer Science (IJRRCS), Science Academy Publisher, United Kingdom, Vol. 2., No. 2, pp. 1-4, 2011. M. Chandrasekaran, V. Shankaranarayanan, and S. Upadhyaya, (2011). “CUSP: Customizable And Usable Spam Filters For Detecting Phishing Emails”, NYS Symposium, Albany, NY, pp. 1-8, 2011. I.R.A, Hamid, and J.H. Abawajy, “Phishing Email Feature Selection Approach”, IEEE 10th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom), pp. 916921, 2011. Human Genetics Commission And The UK National Screening Committee, “Profiling The Newborn: A Prospective Gene Technology?” Retrieved from http://www.hgc.gov.uk/UploadDocs/Contents/Documents/Final%20D raft%20of%20Profiling%20Newborn%20Report%2003%2005.pdf, pp. 1-43, 2005. R. Keralapura, R., A.N. Antonio and Z. Zhang, “Profiling Users In A 3G Network Using Hourglass Co-Clustering”, ACM, pp. 341-352, 2010. K. Xu, and Z. Zhang, and S. Bhattacharyya “Internet Traffic Behavior Profiling For Network Security Monitoring”, IEEE/ACM Transactions On Networking, Vol. 16, No. 6, IEEE Press Piscataway, NJ, USA, pp. 1241-1252, 2008. J.L. Yearwood, M. Mammadov, and D. Webb, “Profiling Phishing Activity Based On Hyperlinks Extracted From Phishing Email”

[20]

[21]

[22]

[23]

[24]

[25]

[26] [27]

[28]

635

Social Network Analysis and Mining, Springer Vienna, Vol. 2, Issue 1, pp 5-16, 2012. J.L. Yearwood, D. Webb, L. Ma, P. Vamplew, B. Ofoghi and A.V. Kalarev, “Applying Clustering And Ensemble Clustering Approaches To Phishing Profiling”, Proc. of the 8th Australasian Data Mining Conference (AusDM”09), pp. 25-34, 2009. B.H Kang, J.L. Yearwood, A.V. Kelarev, and D. Richards, “Consensus Clustering And Supervised Classifications For Profiling Phishing Emails In Internet Commerce Security”, Knowledge Management and Acquisition for Smart Systems and Services, Lecture Notes in Computer Science,Springer Berlin Heidelberg, Vol. 6232, pp. 235-246, doi10.1007/978-3-642-15037-1_20 , 2010. A.V. Kelarev, A. Stranieri, J.L. Yearwood, H. Jelinek, “Empirical Investigation Of Consensus Clustering For Large ECG Data”, Conference in 25th International Symposium on Computer-Based Medical Systems (CBMS), pp. 1-4, 2012. N.I. Om, T. Boongoen, S. Garrett, and C. Price, “A Link-Based Cluster Ensemble Approach For Categorical Data Clustering”, IEEE Transactions On Knowledge And Data Engineering, Vol. 24, No. 3, pp. 413-425, 2012. R.B. Basnet, and A.H. Sung, A. H. (2011) “Classifying Phishing Emails Using Confidence-Weighted Linear Classifiers”, International Conference on Information Security and Artificial Intelligence (ISAI 2010), IEEE, pp 108-112, 2010. B. Hoanca, and K. Mock, “Using Market Basket Analysis To Estimate Potential Revenue Increases For A Small University Bookstore”, Conference for Information Systems Applied Research. 2011 (CONISAR), Wilmington North Carolina, USA, vol. 4, no. 1822, pp. 1-11, 2011. K.S. Xu, M.M. Kliger, C. Yilun, P.J. Woolf, and A. Hero, “Revealing Social Networks Of Spammers Through Spectral Clustering”, Proc. of the IEEE International Conference of Communications, Dresden, Germany. Doi:10.1109/ICC.2009.5199418, 2009. Anti Phishing Working Group, “Phishing Activity Trends Report, Technical Report 1st Half 2011”, [Online]. Available:http://docs.apwg.org/reports/apwg_trends_report_h1_2011. pdf. Anti Phishing Working Group “Phishing Activity Trends Report, Technical Report 1st Quarter 2012”, July 2012. [Online].Available:http://docs.apwg.org/reports/apwg_trends_report_ q1_2012.pdf. Anti Phishing Working Group “Phishing Activity Trends Report, Technical Report 2nd Quarter 2012”, September 2012.[Online].Available:http://docs.apwg.org/reports/apwg_trends_re port_q2_2012.pdf. J. Bacher, K. Wenzig, and M. Vogler, “SPSS Twostep Cluster: A First Evaluation”,Erlange—Nurnberg, (Retrieved February 12, 2013), Available from: http://www.statisticalinnovations.com/products/twostep.pdf, 2004. M. Khonji, Anti-Phishing Studies, (http://khonji.org/phishing_studies.html), Accessed July 2012. Cluster (TwoStep clustering algorithms) - IBM - United States. (n.d.). Retrieved from http://publib.boulder.ibm.com/infocenter/spssstat/v20r0m0/topic/com. ibm.spss.sta. I.R. A. Hamid, J. H Abawajy and T. Kim, “Using Feature Selection and Classification Scheme for Automating Phishing Email Detection”, Studies In Informatics and Control, ISSN 1220-1766, vol. 22(1), pp. 61-70, 2013.