A survey of Learning Based Techniques of Phishing Email Filtering ...

8 downloads 21045 Views 873KB Size Report
Sep 16, 2015 - incoming email contents with embedded links with attached HTML forms. ... code), two of spam filter features (such as Spamassassin Boolean ...
A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

A survey of Learning Based Techniques of Phishing Email Filtering 1

Ammar ALmomani,1,2Tat-Chee Wan, 3Ahmad Manasrah,1Altyeb Altaher,2Eman Almomani, 1 Karim Al-Saedi,1Ahmad ALnajjar and 1Sureswaran Ramadass 1 National Advanced IPv6 Centre (NAV6), UniversitiSains Malaysia, 11800 USM, Penang, Malaysia ammarali,tcwan,altyeb,karim,alnajjar,[email protected] 2 School of Computer Sciences, UniversitiSains Malaysia, 11800 USM, Penang, Malaysia [email protected] 3 Faculty of Information Technology and Computer Sciences, Yarmouk University, 21163,Irbid, Jordan [email protected]

Abstract Phishing email is one of the major issues of the web nowadays ensuing in monetary losses for organizations and individual users. Varied approaches are developed to filter phishing emails. The current paper focuses on machine learning applications used to detect and predict phishing emails. Moreover, a comparative study and analysis of the assorted filtering ways on the market for organizations and individual users is additionally undertaken. The conducted survey is an organized guide to support the current state of the literature, in read of the subject’s wide-ranging scope of papers.

Keywords: Phishing Email Filtering, Machine Learning 1. Introduction A phishing email is a special type of spam message. Such an email is a criminal mechanism related to social engineering schemes that rely on forged email claims originating from a legitimate company or bank. Subsequently, through an embedded link within the email, the phisher attempts to redirect users to fake Web sites. These fake Web sites are designed to obtain financial data from victims fraudulently, including usernames, passwords, and credit card numbers [1-3] Phishing emails pose a serious threat to electronic commerce because they are used broadly to defraud both individuals and financial organizations on the Internet. The problem on phishing emails has increasingly grown. A survey by Gartner [4]on phishing attacks shows that approximately 3.6 million clients in the US have lost money to phishing attacks. Total losses have reached approximately US$3.2 billion, whereas the number of user victims increased from 2.3 million in 2006 to 3.6 million in 2007, an increase of 56.5%. One of the newest reports by the Anti-Phising Work Group (APWG) eCrime Trends Report[5].cited that phishing attacks have been increasing yearly. Statistically, phishing in Q1 2011 grew by 12% over Q1 2010. Several problems result from phishing email, including the defrauding of banks, financial companies, and users. The current paper presents an overview of the machine learning applications used to detect and predict phishing emails. The evaluation and comparison of different approaches in the literature on phishing email filtering are given a great deal of attention. The current paper is organized as follows: Section 2 details the types of phishing attacks. Section 3 presents the machine-learning techniques for phishing email detection and prediction. Section 4 details the comparative results. Finally, Section 5 Summaries.

International Journal of Digital Content Technology and its Applications(JDCTA) Volume6,Number18,October 2012 doi:10.4156/jdcta.vol6.issue18.14

119

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

2 .Types of Phishing Attack Phishing is a particular type of spam that employs two techniques, deceptive phishing and malware-based phishing. The first technique is related to social engineering schemes, which depend on forged email claims that originate from a legitimate company or bank. Subsequently, through an embedded link within the email, the phisher attempts to redirect users to fake Web sites. These fake Web sites are designed to obtain financial data from victims fraudulently, including usernames, passwords, credit card numbers, or personal information. The second technique involves technical deception schemes that rely on malicious software programs spread through deceptive emails or by detecting and using security holes in the user’s computer to obtain the victim’s online account information directly. Sometimes, the phisher attempts to misdirect the user to a fake Web site or to a legitimate one monitored by proxies [6]. The current study focuses on deceptive phishing using social engineering schemes. Figure 1 explains the place of phishing email in phishing attack techniques.

phishing

social engineering

technical subterfuge

phishing email embedded  phishing URL website

crimeware onto PCs or  proxies

Figure 1: Types of Phishing Attacks

3. Machine Learning techniques for phishing email filtering. In this paper, machine-learning techniques for phishing email detection and prediction was divided into five groups aptly explained below. Descriptions of the on-hand filtering methods will be given.

3.1 Methods based on Bag-of-Words model This method is a learning-based phishing email filter that cares about the input data as a formless set of words that can be implemented either on a portion or the entire email message[7] Support vector machine (SVM is one of the most commonly used classifier in phishing email detection. In 2006, the SVM classifier was proposed for phishing email filtering (Chandrasekaran, Narayanan et al. 2006). SVM worked based on training email samples and a predefined transformation : Rs → F, which builds a map from features to produce a transformed feature space, storing the email samples of the two classes with a hyperplane in the transformed feature space. The decision rules appear in the following formula. Ӯ

sign

αi xi K ū, ǭ

S

Where K ū, ǭ =Ө ū .Ө ǭ . Is the kernel function and αi ,i = 1..n and S maximize margin space separate between hyperplane[8] Naïve bays classifiers: is a simple probabilistic classifier which works based on Bayes' theorem with powerful “naïve” independence assumptions. In 2006, Naïve Bayes classifier was proposed for phishing email filtering in Microsoft (Ganger 2006). This classifier which was used in text

120

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

classification can be a learning-based variant of keyword filtering. To ensure preciseness, all features are statistically independent. Term frequency-inverse document frequency (TF-IDF): it is used for word weights, as feature for the clustering. The document frequency of the word w is implemented by DF(w) which is defined as the number of email messages in the collected data set where the word w appears in the document at least once as shown in the formula [9]. Wxy= TFxy. Log Where wxy is the weight of xth Word in the yth document (email), TFxy is the occurrences number of the xth word (w) in the yth document (email), DFx is the number of email messages in which the ith word(w) occurs, and S, as above, is the total number of messages in the training dataset Boosting: boosting algorithm is combining many hypotheses like “One-level decision trees”. The main idea of this algorithm depends on each phase of the classification process where a fragile (not very accurate) learner is trained, then the output results are used to reweigh the data for the future stages, while the larger weight is assigned to the input samples which are misclassified. This algorithm was implemented in [10]. k-Nearest neighbor(k-NN) is a classifier and was proposed for phishing email filtering by Gansterer [11]. Using this classifier, the decision is made as follows: Based on k nearest training input, samples are chosen using a pre-defined similarity function; after that, the email x is labeled as belonging to the same class as the bulk among this k samples. Some approaches based on the last algorithms related to phishing emails filtering appear below.

3.2 Compared multi Classifiers algorithms These approaches in general depend on comparison between sets of classifiers. Presently, more and more research has used new classifier algorithms like Random Forests (RF). RFs are classifiers that merge several tree predictors, where each tree depends on the values of a random Vector sampled separately can handle large numbers of variables in a data set. Another algorithm like Logistic Regression (LR) is one of the most widely used statistical model in several fields for binary data prediction. It used because of simplicity. Neural Networks (NNet) classifiers which consist of three layers (input layer, hidden layer, and output layer) . The power of NNet comes from the nonlinearity of the hidden neuron layers. In consequence, nonlinearity is good for the network to learn complex mappings. Sigmoid function is the commonly-used function in neural networks [12]. Some of the authors whose approaches are related to this technique appear below[12]. compared six classifiers relating to machine learning technique for phishing prediction, namely, Bayesian Additive Regression Trees (BART), LR [13]SVM, RF, NNet, and Classification and Regression Trees (CART). He used 43 features for training and testing six classifiers in the data set. Most of the classifiers used supervised learning and performed 10-fold-cross-validation. The technique implemented by cross validation is a method to estimate the error rate efficiently and without bias. It measures two outputs of email either phishing =1 or legitimate =0. Dataset contains 2,889 emails, 59.5% accounting for legitimate message. The author suggests that the use of cost-sensitive measures penalizes the false positives nine times more than the false negatives. For this setup the RF classifier produces the best results with an F-measure of 90%. The results indicated that there is no standard classifier for phishing email prediction. For example, if some classifiers have low levels of FP, they will have a high level of FN. LR whose FP 04.89%, but obtained a large number of FN at17.04%.The number of features is so big which means that ‘classification noise’ is affected on the accuracy level. [11], proposed classification based approach to detect phishing email in an email stream. This approach tries to make Comparisons between binary classification approaches (spam vs. not spam) and ternary classification approach (ham, spam, and phishing). Authors used 30 features, 15 features from other researchers and they established various new 15 “online and offline” features for filtering phishing

121

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

email. Ranking of features are based on the ratio-gain algorithm. Authors used 4000 emails for each class ”ham, spam, phishing message”, then by training the 1000 email .The result of Comparisons between binary and ternary classification shows that the new features increase the accuracy of classification with ternary classification .the accuracy reached up to 99.7 with a false positive rate of 0.2% by adaptive a support vector machine (SVM). In addition, author increased the costs for misclassified in false positives to five times the costs of other types of misclassifications, author divides the features in three groups as follow . Feature set F1 contains all features proposed by authors of this approach, feature set F2 contains the features F1Without the spamassessian, and feature set f3 proposed only from other researcher. However this technique suffers from higher cost because online features working based on the status of the internet connection, so the extraction of many online features may effect on performance and scalability of the e-mail filtering system, especially in large-scale business email servers, this cause a bottleneck of online features, furthermore this technique needs large mail servers.

3.3 hybrid system These approaches work based on combining different algorithms of classifiers working together to enhance results. Some of the authors‟ approaches appear below. FRALECis a hybrid system proposed by [14]to classify e-mails into two classes, ham email and phishing email. This system consists of three filters: first, Naïve Bayes classifier which classifies the textual content of e-mails to assign them to the economic (legitimate) or non-economic (phishing) categories, second, a rule-based classifier classifies the non-grammatical features of emails into three categories: fraudulent, legitimate, and suspicious, third, emulator-based classifier -filter of virtual accesses, which classifies the responses from URL Web sites referenced by hyper links enclosed in emails. Emulator filter works on two rules: first rule, the email is assigned to the fraud class if the body of an economic email includes forms. The second rule: the email is assigned to the legitimate class category if the body of an economic email does not include forms, links, or images as shown in. The author used 1,038 emails (10 emails as legitimate and 1,028 as phishing emails) . The precision in the best result was 96%. However, from the data set, that the number of legitimate emails that can be seen is not enough to give us clear results. This technique consumes time because it has to go through many layers before it give us the final decision. Multi-tier classifications a method using three classifiers proposed by,[15]. He suggests that his multi-tier classification method of classifying phishing emails has the best arrangement in the classification process. In his method, phishing email features will be extracted and classified in a sequential fashion by an adaptive multi-tier classifier while the outputs will be sent to the decision classifier process where c1, c2, and c3 are classifiers in three tiers. If the message is misclassified by any of the tiers of the classifiers, the third tier will make the final decision on the classification process. The best result came from c1 (SVM), c2 (AdaBoost), and c3 (Naive Bayes) adaptive algorithm. The average accuracy of the three tiers was 97%. However, this technique is characterized by lengthy time consumption and complexity of analysis since this technique requires many stages before arriving at the final decision with a note that 3% of its data set suffers from misclassification. profiling of phishing e-mailis a new method proposed by[10]. The authors concentrated on embedded hyperlink information by extracting structural features and WHOIS information [16]to extract 12 features representing phishing emails. These features are divided into two classes. Extraction features are denoted by a binary value. An email is classified as “1” if it has any of the 12 features, and “0” if none. Classifier algorithm SVM and boosting algorithm were then used to generate multi-label class predictions, scheduled on three datasets formed from embedded hyperlink information in phishing emails. The classes that were generated are hacked site, hosted site, and legitimate site. In addition, authors tested 2,038 real emails with four–fold cross-validation. The accuracy of classification was high. However, this technique depended on the phishing emails with

122

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

embedded hyperlinks only, whereas, many of the phishing email attacks are created without hyperlinks, therefore, this technique still manifests weakness in its classification role.

3.4 Classifiers Models-Based Features These approaches work based on building full models that are able to create new features with many adaptive algorithms and classifiers to produce the final results[17]. Some of the approaches appear below. PHONEY: mimicking user response was proposed by [18]as a novel approach. This technique detects phishing email attacks using sham responses which mimic real users, basically, reversing The character of the victim and the enemy. The PHONEY technique is installed between a user‟s MTA and MUA and processes all arriving emails to prevent phishing attacks. The PHONEY analyzes the incoming email contents with embedded links with attached HTML forms. Thereafter, the control is passed to the content scanner which receives the Web page for analysis, then, extracts the data from the Web page. The extracted data are compared against the entries in hashDB including all information like password or username, among others. The hashDB has two fields representing the token names along with their fake values. The PHONEY technique has tested and evaluated 20 different phishing emails over eight months. It was found out that the collected data are so small a size of real data to address a big problem like phishing emails and this technique requires time because of the need to reverse the characters of the victim and the phisher as shown in (Figure 2).

 

 

 

 

 

Figure 2: Block diagram of PHONEY Architecture[18] Training SmartScreen phishing filter by “Microsoft”, uses the feedback data of more than 300,000 hotmail users, [19]. This technique works based on extraction of more than 100,000 email attributes using the learning algorithm based on Bayesian statistics. The expert team of Microsoft adapted the latest spamming and phishing techniques. However, after testing the technique via internet explorer, the best score of recall was 89% without false positives [20]. However, the huge number of features selected in this technique will consume time, memory, and „classification noise‟. PILFERS is a proposed method to detect phishing emails by[21]. This technique works based on 10 different features representing phishing emails. Nine features extracted from the email itself, while the tenth feature represents the age of linked-to-domain names, which can be extracted from a WHOIS query at the time the email is received [16]. The S.A. tool[22], was used to identify if this email has spam features or not. This technique works based on 10-fold cross-validation with random forest and SVM as A classifiers to train and test the dataset. They tested 860 phishing emails and 6,950 legitimate emails. The result of the PILFER with S.A. features was 0.12% false positive rate, and 7.35% false negative rate, respectively, which means that a sizeable number of phishing and ham emails were not well classified. Adaptively trained dynamic markov chains and novel latent class topic model was proposed by[23]. The proposed model is able to study the statistical filtering of phishing emails and train the characteristic features of emails based on classifiers, then identify new phishing emails with different contents. The author proposed new features trained by machine-learning techniques using chains algorithm with novel latent class-topic models . First, extract a total of 27 basic features; these

123

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

features include four of structural features (such as the total number of body parts), eight of link features (such as the total number of links), four of element features (such as HTML or JavaScript code), two of spam filter features (such as Spamassassin Boolean results: spam or ham), nine of word list features (such as nine-word stems: account, update, confirm, verify, secure, notify, log, click, inconvenient). Second, features extracted by dynamic Markov chain depended on the likelihood of capturing a message belonging to a specific class. Third, 50 latent topic model features (clusters of words appear together in emails)[24]. Applying this algorithm on many of the phishing email features and by studying the contents of the original emails, this model was able to reduce the memory consumed compared with other research. This system reduced the number of false positive up to 0.03%, false negative to 1.89%, and accuracy at 99.77%. Furthermore, this model is developed in a real-life environment at a commercial ISP [25]. However, this technique selects a big number of features, about 81, and it has many algorithms for classification which means it is time-consuming and in the analytical process, a large number of dynamic Markov chain suffers from high memory requirement. Robust classif er model is proposed by[26]. This model can detect phishing emails by hybrid features with selected features based on the information gain algorithms. The author used seven features selected after ranking many features by the information gain algorithm. These features represent the more powerful features in his study which include links, total number of invisible links, nonmatching URLs, forms, scripts, and body, blacklist words. The author used email presentation based on converting all features to numerical values on a different range like the “number of invisible links” may be under five, while body blacklist words should contain hundreds of words. The values of features are normalized before the classification process is limited to the range [0, 1]. To detect phishing emails, the author proposed five stages: first , feature generator including the seven features mentioned above, second: machine learning method selection by an adaptive five machine learning algorithm, three: information gain is created by induction, four: feature evaluation selects a smaller vector space of feature. Finally, refined feature matrix to optimize feature sets. The accuracy of the short feature data result was the best using decision tree algorithm for small vector space using four of the seven features which attained an accuracy of 99.8%. However, the author depends on a few numbers of features only, which means several phishing attacks used other features. The data set used by the author related with live emails received by WestPac only; this is not used by other authors and not benchmarked. There is no t fixed aspect of this technique because the ranking features change depending on the email streaming ,other researcher used hybrid features [27].

3.5 Clustering of phishing email Clustering is the process of defining data grouped together according to similarity. It is usually an unsupervised machine learning algorithms. This group of machine-learning techniques depends on filtering phishing emails based on clustering emails via online or offline mode. One of the most used clustering techniques is k-means clustering. K-means clustering is an offline and unsupervised algorithm which begins by determining cluster k as the assumed center of this cluster. Any random email object or features vector can be selected as initial center then undertake the three steps provided below: Determine the centre coordinate; Determine the distance of each email object (vector) to the centre group of email objects based on a minimum distance[28]. Determine the centre coordinate 1- Determine the distance of each email object (vector) to the centre 2- Group of email object based on minimum distance [28] Some of the approaches related with this technique appear below. cluster phishing emails automatically proposed by [29]based on orthographic features that include HTML features, size of document, text content, and other elements. This is achieved by eliminating redundant features at the same time. The system collects the possible features based on examination with adaptive k-means clustering algorithm to produce the objective function values (the

124

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

summation of distance between all features vector and cluster center) over a range of acceptance values across many feature subsets. The best cluster detected by the final value which determines the distribution of the objective function. The author tested 2,048 emails from an Australian bank over five months, without any knowledge about the emails. However, this technique suffers from general emails that are unknown. More discussions are explored on the relationship between cluster value and objective function values. However, this is not enough most especially that k-means algorithm works via offline technique only. Consensus Clustering proposed by [9]depends on a combined method of unsupervised clustering algorithms and supervised classification algorithms. First, use independent clustering to randomize data, second, build one Consensus Clustering by combining the independent clusters, third, train the data by Consensus Clustering, and finally, use the supervised classification algorithms to classify all data in clustering. This system adapted many algorithms including Term Frequency Inverse Document Frequency (TF-IDF) word weights, Global k-Means algorithm (GKM)[30], Multiple Start k-Means algorithm (MSKM)[31], Nearest Neighbor clustering (NN), and then discovered an optimal consensus among versatile clustering by Consensus Aggregation algorithm again to build a full model called Consensus Multiple Start Nearest Neighbor clustering algorithm (CMSNN). This technique increased the speed of classification with better accuracy than kmeans algorithm. However, because of the large number of algorithms used, the system’s performance is affected and the level of accuracy is still less than 94%. Anomaly detection identifies to detect and filter phishing emails was proposed by [32]. This technique works in real life environment using Stochastic Learning-Based Weak Estimators (SLWE)[33], and Maximum Likelihood Estimator (MLE) filters. The author used anomaly detection identifier to complement the normal behavior of a system. His model classifies emails to a normal or abnormal document derived from the characteristics of data learnt. This model classifies emails based on the text classification (TC). Every email should be assigned to either category Ci, whereas legitimate emails are considered as normal observations, while phishing emails work as a complement, thus, they are classified under anomaly. In this manner, the two classes can be distinguished: ham and phishing emails. Information Gain algorithm selected 200 features from 1,000 features. This technique needs training to build rules while the estimate is updated at each timed input. The author used two dataset tests: 1,200 ham emails with 600 phishing messages from the user’s inbox. The result was 97% for true positive, and 98% for false positive based on two filter algorithms (SLWE, MLE) . However, this technique suffers from a huge number of features which affect system performance. Moreover, this technique needs unlimited training that can consume huge storage.

3.6 Evolving Connectionist System (ECOS) for phishing email detection Evolving Connectionist System (ECOS) is a connectionist architecture that simplifies the evolution processes using knowledge discovery. It can be a neural network or a set of networks that run continuously and change their structure and functionality through a continuous relationship with the environment and with other systems. This system, like traditional expert systems, works with unfixed number of rules used to develop the artificial intelligence (AI)[34]. It is flexible with respect to the dynamic rule, works on either online or offline mode, and interacts dynamically with the changing environment. Such a system can solve the complexity and changeability of many real world problems. It grows throughout the process and adopts many techniques. ECOS has the following characteristics. They evolve in an open space, not fixed dimensions, This system is able to work online, incremental, fast learning - possibly through data propagation in memory when data are passed, Works on lifelong learning mode, ECOS learn as an individual system, as well as part of an evolutionary population of such systems, ECOS has evolving structures using constructive learning, ECOS learns locally via local partitioning from the problem space, which allows fast adaptation and process tracing over time, They simplify different kinds of knowledge demonstration and extraction,

125

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

generally–memory-based, statistical, and symbolic knowledge[35, 36].an approache related to this method appear below. Phishing Evolving Clustering Method (PECM) is proposed by [37, 38]propose a novel concept that adapts the Evolving Clustering Method for Classification (ECMC). PECM functions are based on the level of similarity between two groups of features of phishing e-mails. PECM model proved highly effective in terms of classifying e-mails into phishing e-mails or ham e-mails in online mode, speed and use of a one-pass algorithm. PECM also proved its capability to classify e-mail by decreasing the level of false positive and false negative rates while increasing the level of accuracy to 99.7%. The model was built to work in online mode and can learn continuously without consuming too much memory because it works on a one-pass algorithm. Therefore, data are accessed one time from the memory and then the rule is created depending on the evolution of the profile if the characteristics of the phishing e-mail have been changed. While this technique need continuous feeding.

4. Comparison results of Machine learning for phishing Email filtering In table 5; you will find a comparative study on the advantages and disadvantages of each classification group. Table3. Meta analyses of content-based filtering approaches for phishing Email detection based on machine learning techniques Table 5.Comparison the advantages and disadvantages between machine learning techniques for phishing emails detection and prediction. No

1

Techniques Used Methods based on Bag-of-Words model

2 Compared multi Classifiers algorithms 3 hybrid system

4 Classifiers Model-Based Features

5

Advantages

-Time consuming - huge number of features -consuming memory

- High level of accuracy - create new type of features like Markov features

-huge number of features -many algorithm for classification which mean time consuming -higher cost -need large mail server and high memory requirement -Less accuracy because it depend on unsupervised learning , need feed continuously Need feed continuously

-Fast in classification process Clustering of Phishing Email

6

Evolving Connectionist System (ECOS) for phishing email detection

Disadvantages

-Build secure connection between user’s mail transfer Agent (MTA) and mail user agent (MUA) -Provide clear idea about the effective level of each classifier on phishing email detection -High level of accuracy by take the advantages of many classifiers

fast ,less consuming memory, high accuracy, Evolving with time, online working

Non standard classifier -Time consuming because this technique has many layers to make the final result

5. Summaries In the current study, the problem of phishing email attacks, especially on electronic commerce security, which has economical and ethical implications can be discussed. The most accepted and well-developed approach to fight phishing emails is via learning-based filtering approaches. The current approaches include many filters based on various classification techniques used in

126

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

different parts of email messages. The Naïve Bayes classifier and support vector machine approaches, occupy a special place because of their high accuracy and simplicity to help produce many practical solutions. However, the big number of classifiers causes the need for systematic evaluation and comparative studies. Additionally, only a few authors discuss and test the limitations of the approaches presented. More and more existing techniques for filtering phishing emails have limitations ,technology techniques still have many limitation on accuracy or performance because they are “time consuming”, costly, and the huge number of rules created from learning techniques have increased, many algorithms have been adapted, but still there is no standard technique that is able to stop phishing attacks in general, or phishing emails as a special case. Moreover, most works were done on offline mode which requires data collection, data analysis, and a profile creation phase to be completed first. Offline approach is static in nature, which means that if changes happen on phishing email features, all the phases need to be repeated in order to adapt with the changes. Thus, there is still need new approaches and new technology techniques that are able to solve all limitations related with phishing email detection and prediction.

ACKNOWLEDGMENTS This research is supported by National Advanced IPv6 Centre of Excellence (NAV6) , Universiti Sains Malaysia (USM).

References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12]

Anti-Phishing Working Group, "Phishing Activities Trends REPORT," ed, apwg, http://www.antiphishing.org/reports/apwg_report_Q1_2011.pdf , 2011 Jianyi Zhang, Shoushan Luo, Zhe Gong,Xi Ouyang,Chaohua Wu,Yang Xin , "Protection Against Phishing Attacks: A Survey," IJACT: International Journal of Advancements in Computing Technology, vol. 3,no.9, pp. 155-164, 2011. Z. G. ZHANG Weifeng, Tian Xiantao, "Detecting Phishing Web Pages Based on Image Perceptual Hashing Technology," IJACT: International Journal of Advancements in Computing Technology, vol. 4, p. 139 ~ 145, 2012. GARTNER. (2007, December 17). Gartner Survey Shows Phishing Attacks Escalated in 2007; More than $3 Billion Lost to These Attacks. Available: http://www.gartner.com/it/page.jsp?id=565125 IID, "eCrime Trends Report," 2011 APWG. (2010). Phishing Activity Trends Report Available: http://www.antiphishing.org/reports/apwg_report_Q1_2010.pdf, 2010 Blanzieri, EnricoBryl, Anton, "A survey of learning-based techniques of email spam filtering," Artificial Intelligence Review, Springer Netherlands, vol. 29,no.1, pp. 63-92, 2008. Chapelle, O., "Training a support vector machine in the primal," Neural Computation, vol. 19, no.5,pp. 1155-1178, 2007. R. Dazeley, et al., "Consensus Clustering and Supervised Classi cation for Pro ling Phishing Emails in Internet Commerce Security," in Knowledge Management and Acquisition for Smart Systems and Services, Berlin Heidelberg, pp. 235-246, 2010. Yearwood, JMammadov, MBanerjee, A, "Profiling Phishing Emails Based on Hyperlink Information," in 2010 International Conference on Advances in Social Networks Analysis and Mining, Odense,,IEEE, Denmark 2010, pp. 120-127. Wilfried N. Gansterer David P\, et al., "E-Mail Classification for Phishing Defense,", Springer-Verlag, presented at the Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, Toulouse, France,PP. 449-460, 2009. Abu-Nimeh, SNappa, D Wang, X Nair, S, "A comparison of machine learning techniques for phishing detection, ACM," in Proceedings of the eCrime Researchers Summit,, Pittsburgh, PA,, pp. 60-69. 2007.

127

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

[13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34]

Shih, K.H.Cheng, C.C.Wang, Y.H, "Financial Information Fraud Risk Warning for Manufacturing Industry-Using Logistic Regression and Neural Network," Romanian Journal of Economic Forecasting, pp. 54-71, 2011. del Castillo, M.Iglesias, Ángel Serrano, J., "An Integrated Approach to Filtering Phishing Emails Computer Aided Systems Theory – EUROCAST 2007." vol. 4739, R. Moreno Díaz, et al., Eds., ed: Springer Berlin / Heidelberg, pp. 321-328, 2007. Islam, M.R.Abawajy, J.Warren, M., "Multi-tier phishing email classification with an impact of classifier rescheduling,IEEE," in the International Symposium on Pervasive Systems, Algorithms, and Networks, Kaohsiung, Taiwan pp. 789-793, 2009. InterNIC. (2010). Whois search,” InterNIC Public information Regarding Internel Domain Name Registration Services. Available: http://www.internic.net/whois.html.,2012 Olivo, C.K.Santin, A.O.Oliveira, L.S., "Obtaining the Threat Model for E-mail Phishing1," Applied Soft Computing, ScienceDirect ,PP. 1568-4946, 2011. Chandrasekaran, M.Chinchani, R.Upadhyaya, S., "Phoney: Mimicking user response to detect phishing attacks," in In: Symposium on World of Wireless, Mobile and Multimedia Networks, IEEE Computer Society,WoWMoM pp. 668-672, 2006. Markus Jakobsson, Steven Myers, " Phishing and Countermeasures, Microsoft’s antiphishing technologies and tactics, Wiley," pp. 551–562, 18 MAY 2007. P. Robichaux and D. L. Ganger, "Gone phishing: Evaluating anti-phishing tools for windows. Technical report, 3Sharp LLC," September 2006. Fette, I.Sadeh, N.Tomasic, A.., "Learning to detect phishing emails," in Proceedings of the 16th International World Wide Web Conference (WWW 2007), ACM Press, Banff, Alberta, Canada, pp. 649-656. 2007. pop3proxy. (2010). SpamAssassin for Win32. Available: http://sourceforge.net/projects/sawin32/,2010 Bergholz, A.Chang, J.H.Paaß, G.Reichartz, F.Strobel, S.., "Improved phishing detection using model-based features," in Proceedings of the Conference on Email and Anti-Spam (CEAS), Mountain View,CA., 2008. Khonji, M.Iraqi, Y.Jones, A., "Enhancing Phishing E-Mail Classifiers: A Lexical URL Analysis Approach," International Journal for Information Security Research (IJISR), vol. 2,no.1/2, 2012. Bergholz, A.Paaß, G.D’Addona, L.Dato, D., "A Real-Life Study in Phishing Detection," in Seventh annual Collaboration, Electronic messaging, Anti-Abuse and Spam Conference Redmond, Washington,, CEAS, US, 2010. Ma, L.Ofoghi, B.Watters, P.Brown, S., "Detecting phishing emails using hybrid features," ,IEEE,pp. 493-497, 2009, A. Hamid, I.Abawajy, J., "Hybrid Feature Selection for Phishing Email Detection," Algorithms and Architectures for Parallel Processing, pp. 266-275, 2011. Ram Basnet, Srinivas Mukkamala, and Andrew H. Sung. Sung "Detection of Phishing Attacks: A Machine Learning Approach " Studies in Fuzziness and Soft Computing, springer, vol. 226/2008, pp. 373-383, 2008. Ma, L.Yearwood, J.Watters, P., "Establishing phishing provenance using orthographic features,IEEE," pp. 1-10, 2009. Bagirov, A.M., "Modified global k-means algorithm for minimum sum-of-squares clustering problems," Pattern Recognition, vol. 41,NO.10, pp. 3192-3199, 2008. Elavarasi, S.A.Akilandeswari, J.Sathiyabhama, B, "A SURVEY ON PARTITION CLUSTERING ALGORITHMS," learning, vol. 1,NO.1, 2011. Zhan, J.Thomas, L., "Phishing detection using stochastic learning-based weak estimators," in Computational Intelligence in Cyber Security (CICS), 2011 IEEE Symposium on, pp. 55-59, 2011. Oommen, B.J.Rueda, L., "Stochastic learning-based weak estimation of multinomial random variables and its applications to non-stationary environments", Pattern Recogn,VPL.39,NO.1, 2006. Benuskova, LubicaKasabov, Nikola, "Evolving Connectionist Systems (ECOS)," in Computational Neurogenetic Modeling, ed: Springer US, pp. 107-126. 2007.

128

A survey of Learning Based Techniques of Phishing Email Filtering Ammar ALmomani,Tat-Chee Wan, Ahmad Manasrah,Altyeb Altaher,Eman Almomani, Karim Al-Saedi,Ahmad ALnajjar and Sureswaran Ramadass

[35] [36] [37] [38]

N. Kasabov, Z. S. H. Chan, Q. Song and D. Greer. Evolving Connectionist Systems with Evolutionary Self-Optimisation,VOL.173,ISSU.1,2005. Angelov, P.P.Angelov, P.Filev, D.P.Kasabov, N., Evolving intelligent systems: methodology and applications vol. 12: Wiley-IEEE Press, 2010. Ammar Al-Momani,Wan T.C, K Al-Saedi, "An Online Model on Evolving Phishing E-mail Detection and Classification Method," journal of applied science, vol. 11, pp. 3301-3307, 2011. Ammar Almomani, Tat-Chee Wan, Altyeb Altaher, "Evolving Fuzzy Neural Network for Phishing Emails Detection," Journal of Computer Science, vol. 8,no.7, pp. 1099-1107, 2012.

129