Phishing Email Feature Selection Approach - IEEE Computer Society

2011 International Joint Conference of IEEE TrustCom-11/IEEE ICESS-11/FCST-11

Phishing Email Feature Selection Approach Isredza Rahmi A Hamid and Jemal Abawajy, Senior Member, IEEE School Information Technology, Deakin University, Melbourne, Australia. {iraha, jemal}@deakin.edu.au

Abstract—Phishing emails are more dynamic and cause high risk of significant data, brand and financial loss to average computer user and organizations. To address this problem, we propose a hybrid feature selection approach based on combination of contentbased and behavior-based. Our proposed hybrid features selections are able to achieve 93% accuracy rate as compared to other approaches. In addition, we successfully tested the quality of our proposed behavior-based feature using the Information Gain, Gain Ratio and Symmetrical Uncertainty

in phishing emails which cannot be disguised by an attacker. We analyzed email header which is usually neglected by others. We considered examining the email’s header specifically the message-ID tag and sender email in order to mine the attacker’s behavior. This study applies the proposed hybrid feature selection to 6923 datasets which come from Nazario [15] phishing email collection ranging from 2004 to 2007 and SpamAssassin [18] as ham emails. The result shows that the proposed hybrid feature selection approach is effective in identifying and classifying phishing email. The remainder of this paper is organized as follows. Section II describes related research regarding phishing email detection approaches proposed in recently. Section III examines the hybrid feature selection approach pertaining the data and feature set used in the experiment. Section IV gives the performance analysis result of the proposed approach. Section V concludes the work and direction for future work is discussed.

Keywords- Internet Security, Behavior-based, Feature Selection, Phishing.

I.

INTRODUCTION

Protecting Internet users from phishing attacks is a significant research issue [20]. Phishing emails are on the rise as attackers could send phishing message from a very simple to sophisticated phishing messages, which lure Internet user to give their credential information. Fraudulent emails harm their victims by claiming they are from trusted companies that deceive them into providing sensitive information such as account numbers or passwords. Generally, the phishing email content urges the victim to update personal information to avoid losing access rights to services provided by the organization. Unfortunately, they lure user to a bogus web site implemented by the attacker or plant malware into user’s computer which could spy user’s activity. According to Anti-Phishing Working Group phishing trend report, the number of phishing attacks through email doubled between 2005 to 2009 from about 170000 to about 440000 [2]. Based on Gartner survey, adult in U.S have received an average of 112 phishing e-mail with average loss per victim estimated to be $4,362. There are a number of good phishing email detection approaches [6], [5], [8], [12], [17], [14], [20]. Although, there are clear advantages to filter phishing on email level, still the number of phishing attacks are kept increasing. This is because attackers always change their modus operandi to discard the anti-phishing techniques. Moreover, most phishing emails are nearly similar to the normal email. So, existing anti-phishing techniques are not able to fully detect phishing attacks. Furthermore, most of the existing email filtering approaches is static. However, static approaches can easily be defeated by modifying contents of the emails and link strings. In this paper, we present a hybrid feature selection approach that combine content-based and behavior-based to detect phishing email. We discuss behavior-based features 978-0-7695-4600-1/11 $26.00 © 2011 IEEE DOI 10.1109/TrustCom.2011.126

II.

RELATED WORK

A number of anti-phishing techniques have been proposed in recent years to combat phishing attacks. Various feature selection approaches have been recently introduced to detect phishing. Most of the prior researches [6], [12], [17] were focused on the email content in order to classify the emails as either abnormal or normal. Previous attempt by [12] presents an approach based on natural structural characteristics in emails. The features included richness of words in the email. They tested on 400 data using Support Vector Machine classifier. Their result shows that the detection increases when more features used. However, the significance of the results is hard to assess because of the small size of the email collection used in the experiment. Fette et. al. [6] considered URL-based approach and the presence of JavaScript in emails as their feature selection. 9 features were extracted from the email and the last features obtained from WHOIS query. They follow similar approach as [12] but using larger datasets about 7000 normal emails and 860 phishing emails. URL-based feature is not the best approach because the phishers could use free tools available to obfuscate URL and make it look useable. An approach focused on keyword to classify phishing emails is discussed in [17]. The features represent the frequency of 43 keywords that appear in phishing and legitimate emails. They tested on a public collection of about 1700 phishing emails and 916

and additional two features which are extracted based on sender behavior. The features are listed as below: 1) Domain_email_sender (DES): This binary feature represents the similarity of domain names extracted from email sender with domain message-ID. We think the email is normal if it is similar and set the value 0. If not, we set the value 1 to indicate the email is abnormal. This feature has been proposed by [5]. 2) Subject_blacklist_words (SBW): This binary feature represents the appearance of blacklist words in the subject of an email which included in bags of words in [12]. If the email subject contains the blacklist word, the email is abnormal and set the value 1. This feature has been used in [9]. 3) URL_IP (URLIP): This numerical data shows number of links that are using IP address. This feature has been used in [1]. 4) URL_dots (URLD): This numerical data represent number of links in email that contains dots more than 5. This feature has been used in [12] but they calculate maximum number of dots in every link. 5) URL_symbol (URLS): This numerical data represent the occurrence of links in emails that present symbol. This has been used in [19] but we incorporate other symbol such as “%” and “&”to detect obfuscation url. In [7], we presented the basic system model for phishing detection. In this paper, we extend our previous work proposing a hybrid feature selection approach based on combination of content-based and behavior-based. The behavior features proposed in this paper include the sender behavior (i.e., whether the sender sends emails from more than a single domain); and if the domain name is used by more than one email sender.

1700 legitimate emails from private mailboxes. As phishing emails always look similar to normal email, this approach might not be reliable anymore. Behavior-based phishing detection approaches has been proposed by [8], [5], [14]. Zhang et. al.[8] detects abnormal mass mailing host in network layer by mining the traffic in session layer. Toolan et. al. [5] investigated 40 features and proposed behavioral features: the number of words in send field, the total number of characters in the sender field, differences between the sender’s domain and the reply-to domain and the differences between the sender’s domains from the email’s modal domain. Ahmed Syed at. al.[14] proposed behavioral blacklisting using 4 features which is log-based on live data. Ma et. al. [9] claimed they classify phishing emails based on hybrid features. They used 7 features derived from 3 types of email features that are content feature, orthographic feature and derived feature which also can be considered as content-based approach as well. It has been shown observed that the outputs of the different classifiers vary from one another with same corpora [20]. Also, the impact of classifier rescheduling of multi-tier classification of phishing email to observe the best scheduling in the classification process is discussed in [20]. Empirical evidence shows that the impact of rescheduling of classifiers gives diverse outcomes in terms of accuracy as well as number of false positive instances. In terms of detecting phishing using content, text-based classification is not the best approach. This is because phishing messages are nearly similar to the normal emails. Content-based filtering might be more effective technique if the messages have a long lifetime and a large amount of duplication. However, attackers tend to come out with new contents to detour from being detected. They overcome this challenge by compiling phishing pages with non-HTML components, such as images, Flash objects, and Java applets. Yet, the updating rate of filters is often defeated by the changing rate of the attacks because phishing e-mails are continuously modifying senders and link strings. Therefore, this remains an open problem to be solved. Although there are clear advantages to filtering phishing attacks at the email level, there are at present very little research specifically designed to target phishing emails based on phishing behavior. This paper proposes a hybrid phishing email feature selection approach. We mine email header information to evaluate the attacker behaviors. We explored attacker behaviors through email’s header by examining the sender’s emails and sender’s domain name. Email messages have two basic parts that are the header and body parts. The header contains information about who the message was sent from, the recipients date and the route which contains optional fields such as received, reply-to, subject and message-ID. This is then followed by the body of the message. In our analysis, we considered the “message-ID” and the “from tag” in email header. We experimented with five features belong to email structure

III.

HYBRID FEATURE SELECTION APPROACH

In this section, we describe the proposed hybrid feature selection (HFS) algorithm. In the algorithm, we use domain_email_sender (DES), subject_blacklist_word (SBW), URL_dots (URLD), URL_symbol (URLS), URL_IP (URLIP), Unique_sender (US) and Unique_domain (UD) feature values. Table 1 contains description commonly used notation in this algorithm. TABLE 1. Lists of notations

Notation ES SE DES MID DMID URL

Description Email sender Email’s subject Domain email sender Message id Domain message id Uniform Resource Locator

The HFS algorithm aims to determine feature matrix for predicting an email message is a phishing message or not. We then developed a methodology to extract seven features from each email [7].

917

And if f ij > 0 , the email’s subject is labeled as phishing

First, the email messages is partitioned into four components containing ES, SE, MID and URL. Then, DES value is extracted from ES while DMID is extracted from MID as shown in Table 2. The inputs to HFS algorithm are DES, SBW, URLD, URLS, URLIP, US and UD as shown in Fig. 1. In step 1, reading count for each email is done. For step 2 to 5, each incoming emails will run functions to verify sender domain, identify blacklist word, URL feature matching and identify sender behavior to extract features and finally construct the feature matrix.

email; if f ij ≤ 0 , the email’s subject is labeled as normal email.

3) URL Feature Matching The URL feature matching determines whether a URL message is malicious or not. URL based features are IP address, dots more than 5, and symbol: ‘%’, ‘&’ and ‘@’. We check these 3 features of a tested URL messages, and give the corresponding score. Finally, total score for each feature is calculated. If F ≤ 0 , the email message is labeled as phishing message, otherwise it is a normal message Firstly, we construct the frequency distribution of each feature Fi ,i = 1," ,3, for all datasets. FDi denotes the frequency distribution of features values that Fi appears in all datasets. We decide a set Ri which request to score, where it could contain one value or multiple values. If the features does not exist in the Ri , we set the value to zero. The URL feature matching algorithm: Score(FDi , Ri ) have FDi and Ri as inputs, and the output is the scoring tabulation of feature values within the Ri . For each elements in Ri , recursively calculate the corresponding score as follows: a) Calculate the number of all datasets ( ni ) that have the

Algorithm HSF INPUT: DES, SBW, URLD, URLS,URLIP,US, UD BEGIN 1: FOR (each incoming EMAIL) DO 2: FOR (i to K) DO 3: Verify sender domain; 4: Identify blacklist word; 5: Perform URL feature matching; 6: Identify sender behavior; 7: Constructing feature matrix; 8: ENDFOR 9: ENDFOR END HSF Figure 1. Hybrid Feature Selection Algorithm (HFAS) TABLE 2. Extracted data from email sender and message-id ES

DES

service@payp al.com support@pay pal.com mark@yahoo. com mark@talios. com

paypal. com paypal. com yahoo.c om talios.c om

MID

DMID

q4-6$c--0--$-w@ yahoo.com

yahoo.com

feature value Ri . Where Rij is an element in Ri which

YYBTIESSYZXKLGFQVFPTNK [email protected] [email protected] n.org [email protected] n.org

hotmail.co m spawn.se7 en.org spawn.se7 en.org

maybe contain feature value zero, one or multiple value. b) After j rounds, the number of element within Ri , a scoring model ( Rij ; S ij ) of the feature Fi is produced. If a URL message has the feature value Rij , it gets the

1) Verify sender domain.

corresponding score S ij . Then the total score is (3):

In the verify sender domain, we compare the value of DES and DMID. If DES value is the same with DMID, f i = 0 , otherwise f i = 1 . The relationship DES and DMID is defined as follows (1):

0 if fi = ® ¯1 if

DES ≡ DMID DES ≠ DMID

3

S=

ij

(3)

t =1

Because the number of each features will be varies. For example, the URL dots could be multiple link value. Therefore, each feature value needs to be normalized before the classification process. Features with numerical values are normalized using the quotient of the actual value over the maximum value of that feature so that numerical values are limited to the range [0,1]. Then the value for each feature (4): S ij Fij = (4) max S ij

(1)

2) Identify blacklist word In the identify email’s subject contain blacklist word, we analyze the email subject whether it contains 18 blacklists word as proposed in [11]. If the subject email has the feature value SBW ij , it gets the corresponding score f ij . Then the total score is (2):

x∈ni

4) Identify sender behavior In this section, we present sender behavior (SB) algorithm to determine unique sender and unique domain

18

f ij =

¦S

¦ SBW

ij

(2)

t =1

918

behavior. The data mining for sender behavior is analyzed from email header has a structure as shown in Table 3.

1, both US and UD are set to 1. If it did not satisfy both conditions, both US and UD are set to 0.

TABLE 3. Datasets for sender behavior ES

DES

service@payp al.com service@payp al.com mark@talios. com mark@talios. com

paypal.co m paypal.co m talios.com talios.com

MID

DMID

q4-6$c--0--$-w@ qmb02q

qmb02q

YYBTIESSYZXKLGFQVFPTN [email protected] [email protected] 7en.org [email protected] 7en.org

hotmail.c om spawn.se7 en.org spawn.se7 en.org

U S 1

U D 0

1

0

0

0

0

0

5) Constructing Feature Matrix In this section, we construct the feature matrix of 7 features Fi ,i = 1," ,7 , for all phishing and normal datasets. Note that F1 , F2 , F6 and F7 are in binary while F3 , F4 and F5 are in numerical value ranging from 0 to 1. The Ri value for each features are summarized in Table 4. TABLE 4. Feature matrix used in hybrid scheme

Fig. 2 shows the pseudo-code of the SB algorithm. The inputs to the SB are lists of email sender (ES) and domain message-id (DMID). In step 2 to 15, each incoming email will compare email sender value and domain message-id for all email messages.

Features

Figure 2.

Ri value

F2 :

Similarity of domain email sender and domain message-id Email’s subject blacklist word

R2 ={0,1}

F3 :

IP-Based URL

R3 ={0~1}

F4 :

URL contains dots more than 5

R4 ={0~1}

F5 :

URL contains symbol

R5 ={0~1}

F6 :

Unique sender behavior

R6 ={0,1}

F7 :

Unique domain behavior

R7 ={0,1}

F1 :

Algorithm SB INPUT: ES, DMID BEGIN 1: FOR (each incoming EMAIL) DO 2: FOR (i to K) DO 3: FOR (j to K) DO 4: IF (ES[i] = = ES[j] && DMID[i] = = DMID[j]) THEN 5: GIVE US value US 6: GIVE UD value UD 7: ELSEIF (ES[i] = = ES[j] && DMID[i] DMID[j]) THEN 8: GIVE US value US++ 9: GIVE UD value UD++ 10: ELSEIF (ES[i] ES[j] && DMID[i] = = DMID[j]) THEN 11: GIVE US value US 12: GIVE UD value UD++ 13: ELSE 14: GIVE US value US 15: GIVE UD value UD 16: ENDIF 17: ENDFOR 18: IF US >0 THEN 19: GIVE US VALUE 1 20: ELSEIF UD>0 THEN 21: GIVE UD VALUE 1 22: ELSE 23: GIVE US VALUE 0 24: GIVE UD VALUE 0 25: END FOR END SBA

Descriptions

{

}

R1 ={0,1}

{

}

Let E = e1 , e 2 ,............, e|E| and F = f1 , f 2 ,! , f|F| denotes all the emails and feature vector space respectively. So, | E | is a total email and | F | refer to size of feature vector. Let aik be the value of kth feature of ith document. Therefore, the presentation of each document is Ai = ai1 , ai 2 ,! , ai|E| , and each document is A = {aik }

(

)

where i = 1,2 ,! | F | and k = 1,2,.... | E | . Where each document consists of A={ES, SE, DSE, MID, DMID, URL} IV.

PERFORMANCE ANALYSIS

A. Experiment This section presents our experimental setup. For our preliminary experiment, we used freely available preclassified phishing datasets from [13] which was performed using WEKA (Waikato Environment for Knowledge Analysis). We used 4 phishing data files that have been used in prior research including work by [3],[4],[5],[6],[10],[11],[13], and [17]. As for ham collection, we used the SpamAssassin Project [18] from the easy ham directory. This collection provides 2364 hams emails. We generated 3 sets of datasets containing approximately 3000 data of phishing and ham emails from the overall datasets. Set 1 consists of all ham collections and phishing email selected from the phishing datasets ranging from November 2004 to November 2005. Set 2 is taken from the third phishing collection and all ham collection. Finally, set 3 contains the last phishing datasets and the whole ham collection as well. Datasets for set 2 and set 3

Algorithm for mining sender behavior

If the same email sender sends email using the same domain, the unique sender (US) and unique domain (UD) value are set to current US and UD value. If the same email sender sends email using different domain, both US and UD value are incremented. Moreover, if different email sender sends email using the same domain, US are set to current US value, while UD value is incremented by 1. If it did not satisfy all condition, we set US and UD to current US and UD value. In step 18 to 24, if US and UD value is more than

919

which contained unreadable symbol, Chinese language and Nigerian online scam are neglected. Table 5 shows 4 sets of phishing datasets files summary

which is only 80%. For data size of 70:30, Random Forest achieved highest accuracy for set 2 compared with other three algorithms. AdaBoost yet again showed the lowest accuracy specifically in set 2 with only 75%. This result recommends that Bayes Net works well because of its manipulating capabilities of tokens and associated probabilities according to the user’s classification decisions and empirical performance which is agreed by [1].

TABLE 5. Phishing datasets files summary Duration 27 nov 2004 – 13 june 2005 14 june 2005 – 14 nov 2005 15 nov 2005 – 7 august 2006 7 august 2006 – 7 august 2007 Total phishing datasets

Email messages 414 443 1423 2279 4559

TABLE 6. Classification result of 3 sets of data Classifications of Performance (%) 60:40 70:30

For each sets, the data was divided into two randomly generated disjoint sets to form the training set and testing set. The classifier was trained using the training data while the accuracy is measured by the testing data. We constructed 2 sets of data size of 60:40 (training and testing) and 70:30 (training and testing). The size of each training data is 1800 and each testing set is approximately 1200 for set of data size of 60:40. On the other hand, the size of each training data is 2100 and each testing set is 900 for set of data size of 70:30.

Set 1 88%

Set 2 92%

Set 3 87%

Set 1 84%

Set 2 87%

Set 3 92%

AdaBoost DT

91% 88%

91% 89%

80% 86%

84% 86%

75% 91%

89% 91%

RF

89%

91%

87%

86%

93%

92%

B. Comparative Analysis In table 7 we compare our result with existing works that used the same dataset from [12] and achieving at least 80% accuracy. Fette et. al. [6] proposed 10 features mostly based on URL and script presence achieved 96% accuracy. They used Random Forest as a classifier.

B. Performance Metric In order to measure the effectiveness of the classification, we refer to the four possible outcomes as: (1) True positive (TP): a classifier correctly identifies an instance as being positive. (2) False positive (FP): a classifier incorrectly identified an instance as being negative, in fact an instance is instances hypothetical to be positive. (3) True negative (TN): a classifier correctly identifies an instance as being negative; and (4) False negative (FN): a classifier incorrectly identifies an instance as being positive, in fact an instances hypothetical to be negative. To measure the effectiveness of our approach, we measure the accuracy of several classification algorithm Accuracy (A) measures the percentage of all decisions that were correct. V.

BN

TABLE 7. Comparison our approach with existing works

Fette et. al (2006) [6]

RESULTS AND DISCUSSIONS

This section presents the classification outcome of the extracted features.

A. Feature Selection

Number of feature 10

Abu-Nimeh et.al. (2007) [15]

43

Toolan et. al. (2009) [5]

22

Our approach

7

Feature Approach

Sample

Accuracy

URL-based and scriptbased. Keywordbased.

7810

96%

2889

Behavioralbased and content-based. Hybrid feature.

6097

NN (94.5%) RF (94.4%) SVM (94%) LR (93.8%) BART (93.2%) CART (91.6%) Dataset 1 (97%) Dataset 2 (84%) Dataset 3 (79%) Set 1 (91%) Set 2 (93%) Set 3 (92%)

6923

Abu Nimeh [15] examined 43 keywords generated using TF-IDF (Term Frequency-Inverse Document Frequency) as an indicator to determine the best machine learning technique for phishing email detection. They compare the accuracy between several machine learning methods including Logistic Regression (LR), Classification and Regression Trees (CART), Bayesian Additive Regression Trees (BART), Support Vector Machines (SVM), Random Forests (RF), and Neural Networks (NNet) for predicting phishing emails. They found that Neural Net algorithm performs the best among others with 94.5% accuracy. Toolan et. al [5] used content-based and behavior-based approach to classify phishing email similar to the one

Table 6 presents the experimental results according to 4 different classifiers that are Bayes Net (BN), AdaBoost, Decision Table (DT) and Random Forest (RF) for three sets of data. Our result shows that, the hybrid based feature selection by combining content-based and behavior-based feature selection shows quite promising result. This indicates that features based on sender and domain behavior could be considered to determine phishing email. We tested 3 sets of data on two types of data size 60:40 and 70:30 (training and testing). For data size of 60:40, Bayes Net outperformed other classifier and achieved the highest accuracy in set 2 which is 92%. In contrast, AdaBoost showed the lowest accuracy precisely for set 3

920

[2] [3]

describe in the current paper. They used 22 features to test on 3 datasets comprising 6097 samples. They achieved approximately 97% accuracy. Finally, we include our work who aimed to proposed hybrid feature selection using 7 features. We successfully achieved 93% accuracy covering 6923 samples. Eventhough the accuracy is quite low; we manage to test it only by using least feature selection compared to others.

[4] [5] [6]

C. Performance of each Feature

[7]

In our experiment, we used 7 features to determine whether the email is phishing or not. However, not every feature presents a good indicator. Some of them perform well and some are weak features. Therefore, we proof the quality of our proposed behaviour-based feature using three algorithms Information Gain (IG), Gain Ratio (GR) and Symmetrical Uncertainty (SU) tested on set 1.

[8]

[9]

[10]

TABLE 8. Ranking for each feature for set 1 Rank

IG

Feature

GA

Feature

SU

Feature

1

0.38

SBW

0.54

SBW

0.52

SBW

2

0.23

URLIP

0.52

URLIP

0.38

URLIP

3

0.19

US

0.27

URLD

0.24

US

4

0.11

URLD

0.24

US

0.19

URLD

5

0.03

UD

0.05

URLS

0.04

UD

6

0.02

DES

0.03

UD

0.02

DES

7

0.01

URLS

0.02

DES

0.02

URLS

[11]

[12]

[13]

[14]

[15]

Table 8 shows ranking for each feature where “subject_blacklist_word” is the best feature proof by all algorithms. “URL_symbols” is the least effective features using IG and SU while “verify domain” shows least significance features by GA algorithm. Our proposed behavior features are among the top six features evaluated by three algorithms. This shows that, our proposed hybrid features are effective and could be considered to determine phishing emails. VI.

[16]

[17]

[18]

CONCLUSION AND FUTURE DIRECTION

[19]

In this paper, we propose behavior-based features to detect phishing emails by observing sender behavior. We mine the sender behavior to identify whether the email came from legitimate sender or not. This hybrid feature selection approach achieved 93% accuracy. The work presented in this paper only used part of potential features. As future works, we will improve the normalization process by identifying features for phishing-profiling which could profile the attacker.

[20]

REFERENCES [1]

A.Bergholz, G. Paab, F. Reichartz, S. Strobel, J.H Chung.: Improved Phishing Detection using Model-Based Features. In Proceedings of the International Conference on E-mail and Anti-Spam(2008)

921

The Anti-Phishing work Group. Available: http://www.apwg.org/ C. Liu.: Fighting Unicode-Obfuscated Spam. Proceedings of E-Crime Research (2007) F. Toolan, J. Carthy.: Phishing Detection using Classifier Ensemble. In E-Crime Researchers Summit (2009) F. Toolan, J.Carthy.: Feature Selection for Spam and Phishing Detection. In E-Crime Researchers Summit (2010) I. Fette, N. Sadeh, A. Tomasic.: Learning to Detect Phishing Emails. Technical report, School of Computer Science, Carneige Melon University (2006) I.R. A Hamid, J. Abbawajy: Hybrid Feature Selection for Phishing Email Detection. The 11th International Conference on Algorithms and Architectures for Parallel Processing, Melbourne, Australia (2011) J. Zhang, Z. Du, W. Liu.: A Behavior-based Detection Approach To Mass-Mailing Host. In Proceedings of the Sixth International Conference on Machine Learning and Cybernetics (2007) L. Ma, B. Ofoghani, P. Watters, S. Brown, S.: Detecting Phishing Emails Using Hybrid Features. Symposia and Workshops on Ubiquitous, Autonomic and Trusted Computing (2009) L. Zhou, Y. Shi, D. Zhang.: A Statistical Language Modeling Approach to Online Deception Detection. IEEE Transactions on Knowledge and Data Engineering (2007) M. Bazarganigilani.:Phishing E-Mail Detection Using Ontology Conceptand Naïve Bayes Algorithm. International Journal of Research and Reviews in Computer Science (IJRRCS) (2011) M. Chandrasekaran, K. Narayanan, S. Upadyaya. Phishing Email Detection Based on Structural Properties. Proceeding of the NYS Cyber Security Conference (2006) M. Chandrasekaran, V. Shankaranarayanan, S. Upadhyaya.: CUSP: Customizable and Usable Spam Filters for Detecting Phishing Emails. NYS Symposium, Albany, NY (2008) N. Ahmed Syed, N. Feamster, A. Gray.: Learning To Predict Bad Behavior. NIPS 2007Workshop on Machine Learning in Adversarial Environments for Computer Security (2008) Nazario, J. Phishing Corpus. Available: http://www.monkey.org/jose/wiki/doku.php?id=phishingcorpus. R. B. Basnet, A. H. Sung.: Classifying Phishing Emails Using Confidence-Weighted Linear Classifiers.2010 International Conference on Information Security and Artificial Intelligence (ISAI 2010) S. Abu-Nimeh, D. Nappa, X. Wang, S. Nair.: Comparison of Machine Learning Techniques for Phishing Detection. Proceeding of APWG eCrime Researchers Summit, Pittsburgh, USA (2007) Spamassassin public corpus. Available: http://spamassassin.apache.org/publiccorpus. W. N. Gansterer, D. Polz.: E-Mail Classification for Phishing Defense, in LNCS Advances (2009) Islam, Md. R., Abawajy, J. and Warren, M.: Multi-tier phishing email classification with an impact of classifier rescheduling, Proceedings of the 10th International Symposium on the Pervasive Systems, Algorithms and Network (2009).