New Filtering Approaches for Phishing Email - Semantic Scholar

2 downloads 113787 Views 731KB Size Report
analysis of email text and external links, the detection of embedded lo- gos as well ..... Structural Features (4) Structural features can be captured from the HTML.
New Filtering Approaches for Phishing Email Andr´e Bergholz1 Jan De Beer2 Sebastian Glahn1 Marie-Francine Moens2 Gerhard Paaß1 Siehyun Strobel1 1

2

Fraunhofer IAIS St. Augustin Germany

K.U. Leuven Heverlee Belgium

Abstract Phishing emails usually contain a message from a credible looking source requesting a user to click a link to a website where she/he is asked to enter a password or other confidential information. Most phishing emails aim at withdrawing money from financial institutions or getting access to private information. Phishing has increased enormously over the last years and is a serious threat to global security and economy. There are a number of possible countermeasures to phishing. These range from communication-oriented approaches like authentication protocols over blacklisting to content-based filtering approaches. We argue that the first two approaches are currently not broadly implemented or exhibit deficits. Therefore content-based phishing filters are necessary and widely used to increase communication security. A number of features are extracted capturing the content and structural properties of the email. Subsequently a statistical classifier is trained using these features on a training set of emails labeled as ham (legitimate), spam or phishing. This classifier may then be applied to an email stream to estimate the classes of new incoming emails. In this paper we describe a number of novel features that are particularly well-suited to identify phishing emails. These include statistical models for the low-dimensional descriptions of email topics, sequential analysis of email text and external links, the detection of embedded logos as well as indicators for hidden salting. Hidden salting is the intentional addition or distortion of content not perceivable by the reader. For empirical evaluation we have obtained a large realistic corpus of emails prelabeled as spam, phishing, and ham (legitimate). In experiments our methods outperform other published approaches for classifying phishing emails. We discuss the implications of these results for the practical application of this approach in the workflow of an email provider. Finally we describe a strategy how the filters may be updated and adapted to new types of phishing.

1

1

Introduction

Phishing has increased enormously over the last years and is a serious threat to global security and economy. Criminals are trying to convince unsuspecting online users to reveal sensitive information, e.g., passwords, account numbers, social security numbers or other personal information. Unsolicited mail disguised as coming from reputable online businesses such as financial institutions is often the vehicle for luring individuals to these fake, usually temporary sites. According to the Anti-Phishing Working Group [2] there were on average 996 unique phishing websites detected by APWG per day in 2007. On average 141 unique brands were hijacked per month in 2007. While most phishing campaigns target financial institutions like banks, APWG reports an increasing number of attacks against government agencies like the US Internal Revenue Service or tax authorities in the UK and Australia. In addition there are phishing attacks against non-traditional sites, such as automotive associations. Highly targeted attacks on the employees or members within a certain company, government agency, or organization are called “spear phishing”. Here the phisher wants to gain access to a company’s computer system. There are a large number of possible countermeasures to phishing scams. As technical approaches like secure email authentication are not yet commonly used and require a high administrative overhead, we concentrate on countermeasures based on the contents of phishing emails. In this paper we describe a number of highly informative features about phishing attempts. Some of these features, e.g., a low-dimensional model for the description of email topics, an identification procedure for embedded logos and a hidden salting detection approach, have been especially developed for this task within our project. We use machine learning techniques to determine the relative importance of the features for labeled training data. This yields a classifier integrating the features in such a way that new emails can be classified correctly with a high degree of confidence. We tested the approaches described in this paper on a large real-life corpus of emails prelabeled as spam, phishing, and ham (legitimate). For both classification of phishing versus ham and classification of phishing versus ham and spam we achieve very low error rates and outperform other published approaches for classifying phishing emails. The filters described in this paper can be installed on the server side, e.g., by large email providers. Keeping the filters up to date is extremely important as many new phishing scams are detected each day. We describe the active learning approach, which is able to help the emails providers in updating the filters to new phishing scams. We discuss the implications of these results for the practical application of this approach in the workflow of an email provider. It would also be possible to use our filters on the client-side, which would make processing of encrypted emails possible. We have previously published first results of our work [4]. In this paper we extend the work in several major ways: First, we incorporate a large number of new features, in particular graphical features, such as the hidden salting detection, the image distortion, and the logo detection. Second, we report 2

on many more experiments, in particular on datasets captured from real-life. Third, we also consider the question of spam as in real life spam emails cannot completely be eliminated from an email stream before a dedicated phishing filter is applied. Fourth, we outline an active learning deployment strategy. Section 2 describes the current approaches to phishing filtering ranging from communication-based filters to content-based filters. For content-based filtering the computation of informative email features is most important. These are described in the subsequent sections ranging from simple structural features to a number of statistical models capturing special aspects of phishing. Section 5 is devoted to the hidden salting detection, which aims at identifying intentional addition or distortion of content not perceivable by the user. Our system extracts invisible content from an email and predicts its perception by the user using a cognitive model. Section 6 describes the experiment data, the setup, and evaluation metrics. Section 7 presents the results of the experiments. The last section contains a summary including an assessment of the different phishing countermeasures.

2

Types of Phishing Attacks

Two different types of phishing attacks may be distinguished: Malware-based phishing and deceptive phishing [15]. For malware-based phishing a malicious software is spread by deceptive emails or by exploiting security holes of the computer software and installed on the user’s machine. Then the malware may capture user input, and confidential information may be sent to the phisher. The focus of this paper is deceptive phishing, in which a phisher sends out deceptive emails pretending to come from a reputable institution, e.g., a bank. In general, the phisher urges the user to click a link to a fraudulent site where the user is asked to reveal private information, e.g., passwords. This information is exploited by the phisher, e.g., by withdrawing money from the users bank account. A number of tricks are common in deceptive phishing: • Social engineering: The invention of plausible stories, scenarios and methodologies to produce a convincing context and in addition the use of personalized information. • Mimicry: The email and the linked website closely resembles official emails and the official websites of the target. This includes the use of genuine design elements, trademarks, logos and images. • Email spoofing: Phishers hide the actual senders identity and show a faked sender address to the user. • URL hiding: Phishers attempt to make the URLs in the email and the linked website to appear official and legitimate and hide the actual link addresses.

3

• Invisible content: Phishers insert information into the phishing mail or website, which is invisible to the user and aimed at fooling automatic filtering approaches. • Image content: Phishers create images that contain the text of the message only in graphical form. Phishing cause large losses to the economy. As the targeted organizations want to avoid a bad press they are reluctant to provide precise information on losses. In 2007 Gartner performed a survey of 4500 U.S. adults using the internet [19]. A fraction of 3.3% reported to have lost money in 2007 in phishing scams. This corresponds to 3.6 million people in the US. The average loss was $886, yielding a total damage of $3.2 billion. These numbers do not account for the loss of reputation and reduced customer trust. It is alarming that 11% of the participants say they use no security software, such as anti-virus or anti-spyware products, at all. A study by Dhamija et al. [14] supports that many internet users have problems to detect phishing attacks. Even if users have the explicit task to identify phishing scams, many of them cannot distinguish a legitimate website from a spoofed website. In the study, the best phishing site was able to deceive more than 90% of the participants. In addition, users often do not know which security signs indicate the trustworthiness of a website, e.g., the padlock symbolizing secure transmission. A quarter of the participants only used the content of the website to evaluate its authenticity, without looking at any other portions of the browser. The user study of Wu et al. [42] comes to the same conclusions.

3

Countermeasures Against Phishing

A variety of countermeasures against phishing have been proposed. In this section we give an overview over these measures. We distinguish network- and encryption-based measures, black- and whitelisting, and content-based measures.

3.1

Network- and Encryption-Based Countermeasures

Communication-oriented measures aim at establishing a secure communication [15]. A first protection against malware-based phishing attacks is the installation of virus scanners and regular software updates. An additional measure is email authentification. There are several technical proposals but currently the vast majority of email messages do not use any. Other approaches are password hashing (transmitting a hash of the password together with the domain name), or two factor authentication, where two independent credentials are used for authentification, e.g., smartcards and passwords. Recently some banks (e.g., Postbank in the Netherlands, Bank Austria) use transaction authentication numbers (TANs) sent to the user on demand by Short Message Service (SMS) using the mobile phone network [3]. As the TAN encodes the amount to 4

be transferred and the receiving account number, it cannot be used for transferals to other accounts and is not affected by man-in-the-middle attacks. This reduces the phishing risk to a large extent. Note that authentication schemes like the mobile TAN require a considerable infrastructure, are time consuming and incur costs. Currently this cannot be done for many communication scenarios that are susceptible to phishing. Examples are the communication in social networks as well as product search engines. A phisher can submit an offer to a shopping search engine like Froogle at a low price [33] and thus direct a stream of visitors to his site. The search engine usually does not control the payment transactions and will not insist on a complicated authorization protocol. Hence, additional phishing countermeasures are necessary. They concentrate on the filtering of contents of emails and websites.

3.2

Blacklisting and Whitelisting

One approach of phishing prevention concentrates on checking web addresses when a page is rendered in a web browser. In the Mozilla Firefox browser, for instance, each web page requested by a user is checked against a blacklist of known phishing sites [32]. This list is automatically downloaded to the local machine and updated in regular intervals. It is well-known, however, that new phishing sites appear frequently, e.g., in December 2007 about 35 new phishing sites were detected per hour [2]. The average time for phishing sites to be online is only 3 days; many sites disappear within hours. It takes some time until a new phishing site is reported and added to the blacklist. Therefore the effectiveness of blacklisting is limited. An additional problem is caused by distributed phishing attacks [25] where the links in a phishing email point to a large number of different servers are hosted on a bot net. Gupta concludes that well-crafted spam and phishing messages regularly manage to pass the blacklist filters [21]. As phishing attacks pertain to a finite number of target institutions, whitelist approaches have been proposed. Here a whitelist of “good” URLs is compared to external links in incoming emails. This approach seems more promising, but maintaining a list of trustworthy sources can be a time-consuming and laborintensive task. Another drawback of whitelists is that they can produce false positives, i.e., filtered out ham emails, whereas blacklists can only produce false negatives, i.e., missed spam emails, a less severe type of error.

3.3

Content-Based Filtering

In this paper we concentrate on content-based filtering approaches to detect phishing attacks. These evaluate the content of an email or the associated website and try to identify different tricks for producing a plausible phishing attack, e.g., • The detection of typical formulations urging the user to enter confidential information. 5

• The detection of design elements, trademarks, and logos for known brands. Note that only relatively few brands are attacked, e.g., 144 different brands in December 2007 [2]. • The identification of spoofed sender addresses and URLs. • The detection of invisible content inserted to fool automatic filtering approaches. • The analysis of images which may contain the message text. The filters statistically combine the evidence from many features and classify a communication as phishing or non-phishing. As discussed later many different features may be an indicator of phishing. Early filtering approaches manually assign a “weight” to each feature and classify an email or website as phishing, if the score exceeds a threshold (c.f. [44]). 3.3.1

The Logic of Statistical Phishing Filters

Statistical phishing filters are a way to assess the relative importance of features. The feature weights are modified in such a way that on a set of emails labeled as ham, spam or phishing - the training set - the filter performance becomes optimal with respect to some quality criterion. However, the weighted addition of features is not the only way to combine features. If features have interactions, which indicate the existence of a phishing mail only if a specific combination of features occur, then more complex non-linear feature combinations have to be utilized. In statistics and data mining a large number of classifier models have been developed which are adequate in these situations [36]. Statistical classifiers automatically assess the relevance of the input features x = (x1 , . . . , xm ) of an email and establish a function to determine the desired classification y of the email (e.g., phishing or non-phishing) y = f (x, γ)

(1)

The vector of unknown parameter values γ = γˆ is determined in a training phase in such a way that the relation between x and y in the training set (x1 , y1 ), . . . , (xD , yD ) is reproduced according to some optimization criterion. In the application phase the same features are extracted from a new incoming email xnew . Based on these features and the model the classifier produces a classification yˆnew = f (xnew , γˆ ) of the email. To assess the classifier performance different quality measures can be computed on an independent test set that was not used for training (c.f. Section 6). If the number of free parameters in γ is large the classifier tends to reproduce random effects in the training data. To avoid this “overfitting” and to concentrate on systematic relations between x and y modern classifiers use regularization techniques, e.g., by penalizing large parameter values. In our experiments we have used the Support Vector Machine (SVM) classifier proposed by Vapnik [39]. An SVM computes the boundary between classes in such a way that 6

the distance to the closest instances of the classes, represented as real-valued vectors, is maximized. Because all computations are based on inner products arbitrary kernels can be used for SVMs, which enables processing of a large number of features. As classifiers are inductively trained from examples the classification is not correct with certainty but only with some probability. The average performance on an independent test set gives an indication on the correctness of a classifier on average. It is possible, however, to estimate for each individually classified email the probability, that the proposed classification is correct. This is done by computing a probability of correct classification directly or by comparing the predictions of different classification models, which form an “ensemble of classifiers”. Especially for emails which are dissimilar to all emails of the training set the uncertainty of classification usually is high. As the class probabilities are different from 0 and 1 these emails may be considered as “borderline cases”. On the one hand these borderline cases may be especially marked in the inbox and the user may be asked to check them individually. On the other hand borderline cases often are emails with new characteristics, e.g. new types of phishing tricks. Therefore they could be labeled as ham, spam, or phishing by a human annotator and added to the training set. Subsequently the classifiers may be re-trained on this enhanced training set. It can be shown that by such “active learning” approaches [22] the classifiers can be adapted to emails with new characteristics with a minimum number of additional annotations. 3.3.2

Content-Based Filtering for Websites

One type of statistical phishing filters evaluates the content of a website to identify phishing scams. The Internet Explorer 7 web browser, for instance, offers a built-in classifier that filters web pages based on their characteristics [35]. It includes blacklisting and computes a number of features whose weighted sum can be interpreted as a probability score for phishing. The weights are frequently updated from central servers that determine them by machine learning algorithms. No performance figures are given for this approach. The accuracy of detecting phishing sites by published statistical methods leaves room for improvement. Zhang et al. [44] report that they are able to catch about 90% of phishing sites with 1% false positives. Miyamoto et al. [31] yield a true positive rate of 94% with a false negative rate of 0% based on a limited sample of 100 websites. Stepp and Collberg [37] discuss some browser-based anti-phishing tools that try to detect website spoofing and acknowledge their usefulness as they automate processes the user should perform anyway. 3.3.3

Content-Based Filtering for Emails

An alternative way to identify phishing scams is the filtering of the email content. In recent years, spam filters based on email content have been widely discussed [20]. However, we suspect that the identification of phishing emails is quite different from spam classification. A spammer simply wants to contact a

7

user and inform him about some product while the phisher has to deliver a message that has an unsuspicious look and pretends to come from some reputable institution. Therefore, many techniques used in spamming, such as deliberate typos to defeat spam filters, usually do not appear in phishing emails and different features for filtering may be adequate. Subsequently we describe current approaches to phishing detection based on email filtering. Chandrasekaran et al. start with the following features for phishing email classification: (1) a number of style marker features for emails (e.g., the total number of words divided by the number of characters), (2) structural attributes (e.g., the structure of the greeting provided in the email body), and (3) the frequency distribution of selected function words (e.g., ”click”) [11]. They evaluate these features for a small corpus of 400 emails using a Support Vector Machine classifier and report precision and recall values of up to 100%. However, the significance of the results is difficult to assess given the small size of the email collection. Microsoft uses data from honeypots and the feedback of more than 300000 Hotmail users to train its SmartScreen phishing filter [35]. The filter evaluates more than 100000 attributes of an email and uses a learning algorithm based on Bayesian statistics. A team of experts constantly tunes the filter and adapts it to the latest spamming and phishing techniques, however, no performance figures are given. In a comparison of Windows phishing filters the Internet Explorer filter scored best with a recall of 89% with no false positives [34]. Fette et al. follow a similar approach, but use a larger publicly available corpus of about 7000 legitimate (ham) emails and 860 phishing emails [16, 17]. They propose ten different features to identify phishing scams. Nine of these features can be extracted from the email itself, while the tenth feature – the age of linked-to domain names – has to be obtained by a WHOIS query at the time the email is received. Among the remaining features is the score of a publicly available spam filter, the SpamAssassin. By 10-fold cross-validation Fette et al. arrive at a false positive rate of 0.13% and a false negative rate of 3.6%, which corresponds to an F-measure of 97.6%. Abu-Nihmeh et al. investigate the performance of different popular classifiers used in text mining, e.g., logistic regression, classification and regression trees, Bayesian additive regression trees, Support Vector Machines, random forests, and neural networks [1]. A public collection of about 1700 phishing mails and 1700 legitimate mails from private mailboxes is used. For this setup the random forest classifier yields the best result with an F-measure of 90%. However, other classifiers had similar performance and different rankings result if other criteria are used. The authors conclude that the most promising way to improve predictive accuracy will be the inclusion of additional features. This is the approach chosen in this paper. In the next section we describe a number of features we used for phishing filtering.

8

4

Features for Content-based Filtering of Phishing Emails

In this section we present the features that are designed to discriminate phishing mails and serve as input to classifiers. Beside customary features we present a number of novel features targeted to phishing detection like a special version of topic models, logo detection and hidden salting detection.

4.1

Basic Features

One main strategy of phishing emails is to lure users to some website where they are urged to reveal confidential data. It is thus natural to look at the email structure and external links, respectively, for the detection of phishing emails. In related work several sets of basic features for phishing detection have been proposed [11, 1, 17]. Based on these works we extract a total of 27 basic features. One key difference of our approach is that all our basic features can be derived directly from the email itself. In particular, we do not use features that require information about specific websites, such as the age of linked-to domains. Structural Features (4) Structural features can be captured from the HTML tree and reflect the body part structure of an email. The MIME standard defines a number of possible message formats. We record four features: the total number of body parts, the number of discrete and composite body parts and the number of alternative body parts, which are different representations of the same content. Link Features (8) Link features reflect various properties of links contained in an email. We record eight features: the total number of links, the number of internal and external links, the number of links with IP-numbers, the number of deceptive links (links where the URL visible to the user is different from the URL the link is pointing to), the number of links behind an image, the maximum number of dots in a link, and a Boolean indicating whether there is a link whose text contains one of the following words: click, here, login, update. Element Features (4) Element features reflect what kinds of web technologies are used in an email. We record four Boolean features of whether HTML, scripting and in particular JavaScript, and forms are used. Spam Filter Features (2) We use an untrained, off-line version of SpamAssassin to generate two features, the score and a Boolean of whether or not an email is considered as spam. A message is considered spam if its score is greater than 5.0. It is important to note that we consider only features intrinsic to an email; we do not use any kinds of black- or whitelists. In a real-life scenario, the inclusion of such information would probably lead to performance improvements. 9

Word List Features (9) Finally, we use a positive word list, i.e., a list of words hinting at the possibility of phishing. For each word in the list we record a Boolean feature of whether or not the word occurs in the email. The list contains a total of nine words or word stems: account, update, confirm, verify, secur, notif, log, click and inconvenien. In the following we propose two kinds of advanced features. Both of them can be viewed as classifiers themselves, because they are based on models. The outputs of these models serve as features in our global email classification process.

4.2

Topic Features

Unobservable “latent” topics are clusters of words that tend to appear together in emails. We can expect that in a phishing email the words “click” and “account” often appear together, while in regular financial emails the words “market”, “prices”, and “plan” may co-occur. Topic models collect these words frequently co-occurring in a training set and form a number of K different clusters or topics. Each word is typically assigned to only one or very few topics with a high weight. Consequently each document usually has a number of different topics. The knowledge of the prevalent topics of an email may help in classification. If the model is applied to a new email it calculates for each content-bearing word in the email the probabilities that it belongs to one of the topics. Usual latent topic models do not take into account different classes of documents, e.g., phishing or non-phishing. In [4] we developed a new statistical model, the latent Class-Topic Model (CLTOM), which is an extension of latent Dirichlet allocation (LDA) [6] in such a way that it incorporates category information of emails during the model inference for topic extraction. This method yields word clusters, which are more focused to the distinction of phishing emails from legitimate emails. We first have to train the class-topic model using a training collection of emails annotated with classes. The estimated model may be applied to a new, uncategorized email and assigns a latent topic to each word of the email. The proportions of the different topics in the email then serve as input features for the classifier. αc

λ

c

βk

C

θ

z

K

w Nd D

Figure 1: The graphical model of CLTOM We now describe the class-topic model in more detail. The CLTOM assumes a probabilistic generative process for content of a message. In general, we assume that there are C different classes of messages and K different latent topics, 10

where K is a parameter fixed in advance. First, a class indicator c ∈ {1, . . . , C} is sampled from a multinomial distribution parameterized by λ = (λ1 , . . . , λC ). Given class c a K-dimensional probability vector θ = (θ1 , . . . , θK ) for the topic composition of the message is generated by a Dirichlet distribution parameterized by αc = {αc1 , . . . , αcK }. Then, for each word position n in the message, a topic zn is sampled from the multinomial θ. Finally a word wn is randomly generated according to a topic-specific word distribution, which is a multinomial parameterized by β zn describing the probability βzn ,w for each possible word w in the set of all possible words W. Under this assumption, the probability of the N words w = (w1 , . . . , wN ) in an email is Z p(w|λ, α, β) = p(c|λ)

p(θ|α, c)

N Y

p(zn |θ)p(wn |zn , β)dθ.

n=1

By the introduction of the class variable c and separate parameters αc for each class, the model is expected to capture more class-relevant semantic features compared with fully-unsupervised LDA. Note that the word order in a message content is not considered; a message is represented as a bag-of-words. Figure 1 shows the graphical model of CLTOM, which summarizes the stochastic process mentioned above. The learning of the CLTOM is done using a variational EM algorithm [6]. For a corpus D = {w1 , . . . , wD }, the bag-of-word representation of a collection of training examples x1 , . . . , xD , the log-likelihood is L = log p(D|λ, α, β) =

D X

log p(wd |λ, α, β).

(2)

d=1

Here, the exact posterior inference for latent variables is intractable, so we employ a mean-field variational method for the inference of {(cd , θd , zd )}D d=1 . For a message wd , the distribution of latent variables (cd , θd , zd ) is fully factorized QNd as q(cd , θd , zd ) = q(cd |νd )q(θd |ηd ) n=1 q(zdn |φdn ). Two kinds of variational parameters νd and φdn characterizes the posterior multinomial distribution for cd and zdn , and ηd is the variational parameter of the posterior Dirichlet q(θd |ηd ). Then, the log-likelihood in (2) can be lower-bounded by L≥

D X

(Eqd [log(cd |λ)] + Eqd [log(θd |cd , α)]

d=1

+Eqd [log(zd |θd )] + Eqd [log(wd |zd , β)] + H(qd )) ≡ F, where qd = q(cd , θd , zd ) and H(qd ) is the entropy term for the posterior distribution qd . We maximize the lower bound F based on the variational approximation. In the E-step of the variational EM algorithm, the distribution for hidden variables q(c), q(θ), and q(z) are estimated. Actually, q(cd ) for training messages are 11

fixed prior to learning, because the category information is already given for messages in training corpus. In the M-step, the model parameters λ, α, and β are determined. Two multinomial parameters λ and β can be calculated in closed form. The Dirichlet parameter α is estimated in an iterative way using a Newton-Raphson method [30], because there is no closed-form solution for the parameter. These steps are iteratively alternated until convergence. When a new message wm arrives, the semantic features for the message are derived through an inference in the trained CLTOM. In this application phase to a new message, unlike the training phase, q(c|νm ) is also estimated along with q(θ|ηm ) and q(z|φm ), because the category information is not known for the newly arriving message. Finally, P the K-dimensional posterior mean value θ¯m = (θ¯m1 , . . . , θ¯mK ), θ¯mk = ηmk / l ηml , is calculated from the Dirichlet q(θ|ηm ), and is provided as the semantic feature for the incoming message, where θ¯mk is the mean word probability of topic k in message m. In experiments we showed that the CLTOM algorithm yields features that lead to a better phishing classification performance than equivalent features based on the LDA algorithm, which does not take into account the class information [4].

4.3

Characterizing Text Sequences by Dynamic Markov Chains

The text-based features considered so far are based on the bag-of-words approach ignoring the order of the words. With the introduction of features based on Dynamic Markov chains we overcome this limitation. The idea here is to model the “language” of each class of messages. For each class a probabilistic automaton is learned that allows, for a new message, to output a probability with which the message belongs to that class. Dynamic Markov chains are based on probability theory and are a model for the sequential composition of messages belonging to a specific class. For each class of interest a model is generated from training data. For a new message we may compute its likelihood for the different classes. The likelihood for a class divided by the sum of all likelihoods may be directly interpreted as the probabilities that the message belongs to that class. The dynamic Markov chain generation is a technique developed for arithmetic compression [13], the problem of compressing arbitrary binary sequences. A sequence is thought to be generated by a random source. Cormack et al. developed a technique for the incremental construction of a Markov chain approximating the source [13]. These dynamic Markov chains have been successfully applied to text classification problems in various domains [29]. Each class is considered as a different source, and each text belonging to the class is treated as a message emitted from the corresponding source. The source is approximated by incrementally enhancing the initial starting chain. Given a sufficiently large number of training examples the iterative approximation of the unknown source permits the accurate estimation of the likelihood that a given sequence originated from that source. By comparing these likelihoods for different sources the 12

sequence may be classified. The Markov chain classification can be summarized as follows. Let the sequence of bits (b1 , . . . , bn ) be the representation of an email. Let M be a model for the sequence of bits (b1 , . . . , bn ), which predicts the probability p(bi |b1 , . . . , bi−1 , M ) that bit bi = 1 given the sequence of previous bits b1 , . . . , bi−1 . Then the average log-likelihood of the email for the model is log p(b1 , . . . , bn |M ) = log

n Y

p(bi |b1 , . . . , bi−1 , M )

(3)

i=1

Assume that we have estimated a model Mc for each of the classes ham, spam and phishing. The class c for which log p(b1 , . . . , bn |Mc ) is maximal is the most likely class of the email x. Therefore the classification of x can be formulated based on the maximal log-likelihood for all classes C as: c∗ = arg max log p(b1 , . . . , bn |Mc ) c∈C

where Mc is the model for class c ∈ C. The training of the dynamic Markov chain models is done by building a more and more detailed sequence model as more training data arrives. It is effectively a compression technique for observed data and therefore its space requirements grow unlimited as more emails arrive. In [4] we presented an approach to reduce the size of the models: By judging the value of each specific example in terms of impact on the classification accuracy, training examples can be heuristically omitted, similar to uncertainty sampling in active learning [22]. We empirically showed that our heuristic adaptation technique limits the amount of space needed for each model by about two thirds. Bratko et al. [7] achieve good results for the classification of spam emails using the approach describe above. In contrast to their work we extract features from dynamic Markov chain models that differ in the input for the generation process. For the DMCText features, we convert emails into plain text meaning that all headers etc. are removed and file attachments are discarded. The reason for the exclusion of header information is that we feel they might make synthetic test data trivially separable through the inclusion of domain names or IP addresses. In addition we construct DMCLink features. To this end, we extract the external links and, if available, the link texts. We omit all links that point to external graphics or documents. For both features we build two models, one for ham emails and one for phishing emails, and extract four features, respectively: the average log-likelihood values for each of the two models and two Boolean features indicating membership in either class.

4.4

Image Features

To defeat usual text-based filtering techniques spammers and phishers started to embed the message into attached images, a trick known as image spam or 13

Figure 2: A part of an image with embedded text and one logo

Figure 3: The extracted connected components

image phishing. In addition, invisible text is often inserted into the body of an email. According to Commtouch image spam accounted for 30-50% of all spam in 2006 [12]. 4.4.1

Image Distortion

Spammers started to “obscure” image text to defeat OCR tools by introducing small random distortion and stains, which easily can be read by humans but drastically reduce the accuracy of OCR. To detect such distortions, different features for images were developed based on low-level image processing techniques, e.g., based on color moment and color heterogeneity ([5, 9, 41]). For image phishing these techniques are less relevant as phishing emails should look clean and professional as if they come from reputable senders. Therefore OCR techniques are potentially valuable for phishing emails if a high-quality OCR solution is used. For the experiments in this paper we implemented the algorithm described in [5], which is based strongly on CAPTCHA breaking techniques. We compute two features that are able to detect random background noise and broken or merged characters. Emails that contain these obscuring techniques have a high potential to be spam. 4.4.2

Logo detection

A phishing email looks much more authentic if it contains a proper logo of the affected institution, e.g., bank. Note that only relatively few brands are reported to be attacked, e.g., 144 different brands in December 2007 [2]. Zhu and Doermann [45] discuss the state of the art of logo detection and identification in images. For phishing detection the presence of logos is a feature that works only in combination with multiple features. For example, the combination of hidden text and a logo of a well known company can make the decision easier whether or not an email is phishing. For this paper we implemented a novel technique for logo detection [28]. Compared to the standard object detection scenario, detecting and comparing logos is a relatively easy task as the logo will not be subject to major trans-

14

formations, such as occlusion or change of perspective. This is due to the fact that it is the phisher’s goal to use the logo in exactly the same way the bank or company would do. To detect a logo we first perform a color segmentation on the image. Then we identify all connected components in this image. The next step is to form a graph from these components (one component per node) and match this graph to the trained graphs of the original logos. This algorithm is very fast because in general a logo-graph is only composed of very few nodes. Figure 2 shows an image with an embedded logo and figure 3 depicts the extracted connected components. As logo detection features we record for each email the total count of detected logos, the count of distinct logos, and for each known logo a Boolean feature indicating whether or not the logo appears in the email.

5

Hidden Text Salting Detection Methodology

Common tricks of spammers known as message salting are the inclusion of random strings and diverse combinatorial variations of spacing, word spelling, word order, etc. Some salting techniques called hidden salting cause messages to visually appear the same to the human eye, even if the machine-readable forms are very different. This makes automated message processing very hard and is especially suited for phishing. Hidden salting often makes use of content modification using HTML, such as inserting white text on a white background and other invisible information in HTML messages as well as the inclusion of bogus HTML tags or HTML tables. As phishers constantly invent new hidden salting techniques we have designed a new technique for the automatic detection of these tricks. Given some text, e.g. an email, the main idea of our technique is to separate what is seen by the user from what is really contained in the text (but hidden from the user). We use a computer vision technique to segment text into partitions that have the same reading order, and we use a document image understanding technique to find the reading order of these partitions. Specifically, given an email as input, a text production process, e.g. a Web browser, creates a parsed, internal representation of the email text and drives the rendering of that representation onto some output medium, e.g. a browser window. Our proposed method of detecting text salting takes place during this rendering process, and is composed of two steps: • Step 1: we tap into the rendering process to detect hidden content (= possible manifestations of salting); • Step 2: we feed the intercepted, visible text into an artificially intelligent cognitive model to obtain the truly perceived text by the user. For Step 1, we intercept requests for drawing text primitives, and build an internal representation of the characters that appear on the user screen. This representation is a list of attributed glyphs (= positioned shapes of individual

15

characters, with rendering attributes and any concealing shapes). Glyphs are listed in their compositional order (= order in which they are drawn). Then, we test for glyph visibility (= which glyphs are seen by the user) according to these glyph visibility conditions: 1. clipping: glyph drawn within the physical bounds of the drawing clip, which is a type of ‘spatial mask’; 2. concealment: glyph not concealed by other glyphs or shapes; 3. font color: glyph’s fill color contrasts well with the background color; 4. glyph size: glyph size and shape is sufficiently large. Failure to comply to any of these conditions results in an invisible glyph, which we use to identify hidden salting. E.g. zero-sized fonts violate the ‘glyph size’ condition. To resolve salting detected through these conditions, we eliminate all invisible glyphs (i.e. retain only what is perceived). For Step 2, given a set of retained (visible) glyphs, we detect an additional type of hidden salting as follows: we define the order in which glyphs are most likely read by humans (reading order ); we cannot assume that reading order = compositional order in the source, since spammers are known to exploit this (e.g. the ‘slice-and-dice’ trick). Then, given a set of visible, positioned glyphs covering a specific area, e.g. a page, we use a cognitive model of text perception to define the glyph reading order (see Figure 4). The task of the cognitive model is to find a coherent partitioning of the page, with proper reading directions assigned to the each partition (‘coherent’ and ‘proper’ w.r.t some quality metric). We identify salting when the reading order does not correspond to the compositional order in the source.

5.1

Cognitive Model of Text Segmentation

In order to detect hidden salting visually, we segment text into partitions that have the same reading order. Our cognitive model of text segmentation is an instance of a search algorithm, which follows a greedy, limited-width, breadthfirst search strategy. This search algorithm takes as input a page of text, and outputs a partitioning of that page with the most likely reading direction for each partition. 5.1.1

Definitions

Let S be the search space of the algorithm, consisting of all possible partitionings of the page. Let P be a single partitioning. To label every partition P ∈ P with a reading direction D from a set of candidate reading directions D, we introduce a labelling function F : P 7→ (Do , ρ(P)) which assigns the most likely reading direction Do to its argument partition P. We rely on the four heuristics that define the reading order discussed below. In addition, on the basis of these heuristics, F generates a confidence score ρ(P) ∈ [0, 1] as a 16

Figure 4: Reading direction example: a page contains two text blocks (dashed), whose perceived contents arise from rendering a virtual grid of text glyphs (shown aside). The grid traversal order during rendering can be manipulated, resulting in a compositional glyph order (dark arrows) that differs from the reading order (light arrows). Our cognitive model restores the glyphs’ true reading order by deriving the proper reading direction of the identified text blocks. measure of the joint belief in the logical relatedness of the glyphs contained in P and the assignment of Do as ‘proper’ reading direction. We also define an objective function O : P 7→ [0, 1] which measures the coherence of its argument partitioning P. The measure is based on the confidence scores ρ(P) of all containing partitions P ∈ P. 5.1.2

Search Process

Let Ad be the set of candidate partitionings at depth d of the search tree. Initially, d = 1 and A1 contains the initial partitioning Pi = {Pi }. Its sole partition Pi covers the entire page and is labelled by F. Proceeding from depth d to d + 1 (cf. breadth-first search), Ad+1 is generated from Ad as follows (see Figure 5): we repeat over all Pc ∈ Ad . Pc has min(χp ≥ 1, |Pc |) of its partitions P ∈ Pc selected for refinement, which is the current partitioning in this step of the search algorithm1 . The selection is based on the confidence scores ρ(P). For every selected partition P separately, we sample χr (P) ≥ 1 alternative refinements. A refinement of P is a subpartitioning of P, denoted P(P). The refinement sampling process is constrained and guided by visual cues, such as salient visual gaps between the glyphs in P. Reading directions are assigned 1 All

χ∗ are free parameters in our algorithm.

17

Figure 5: Page partitioning search process. Part of a fictive search tree is shown, detailing on the transition from depth 2 to depth 3. In layer 2a, the partitions selected for refinement are shown hatched. Layer 2b depicts the generated refined partitionings, crossing over those with no improvement in objective value (indicated in the bottom-right page corner; the values are exemplary). The breadth of the search tree is restricted to maximum two candidate partitionings. This causes the elimination of the - at first retained - rightmost candidate for depth 3. For clarity, reading directions are omitted from the figure. to all subpartitions by the application of F to all of P(P). This results in a refined partitioning Pc0 = (Pc \ P) ∪ P(P). Only when the refined partitioning is a significant improvement over the current partitioning, this is, when O(Pc0 ) ≥ O(Pc ) + , will Pc0 be includedP in Ad+1 (cf. greedy search). If, after repeating over χp selected partitions and P χr (P) alternative refinements, no improvement over Pc is found, Pc itself is included in Ad+1 to preserve it as a possible candidate partitioning. Finally, after repeating over all Pc ∈ Ad , the χn ≥ 1 best candidate partitionings in Ad+1 are retained to narrow down the search to the best candidates only (cf. limited-width search). As quality criterion we use the objective function. The search algorithm is terminated at the start of depth d when one of several stopping criteria is fulfilled (e.g. when d = χd ≥ 1). Then, O pinpoints the optimal candidate partitioning Po ∈ Ad as a local or global optimum over S . The glyphs’ reading order readily follows from the labellings in Po . The 18

labelling function F outputs the most likely reading direction Do ∈ D of a given partition P in isolation, along with a confidence score ρ(P) for the newly labelled P. 5.1.3

Reading Direction Assignment

The reading direction assignment function F uses heuristics (layout and linguistic features) to filter out the most likely reading direction. These feature scores are computed for each partition; then, their product is consolidated into a single confidence score, parameterized by an empirically-tuned penalty factor. When efficiency is a concern, the heuristics can be used as a cascade, where evidence is only computed when the additional forms of current confidence of the reading direction is low. We use the following layout and linguistic features: 1. Evidence 1. Layout: glyph alignment is measured both horizontally and vertically to detect line orientation. On parallel lines (horizontally), glyphs are expected to be aligned. Across lines (vertically), glyphs may be misaligned when variable-width fonts are used. Also, with fixed-width fonts, we count the number of glyphs in a line: very few glyphs in consecutive lines may indicate a false reading direction2 . 2. Evidence 2. Word length: we define a word as a sequence of glyphs on a single line, delimited by whitespace. We compare word length distribution to word length in a reference corpus: significant statistical differences may indicate a false reading direction. 3. Evidence 3. Characters: we apply the same technique of the previous step to character sequences (3-grams). 4. Evidence 4. Common words: we apply the same technique of the previous two steps to the relative number of common words, found in a dictionary of most frequent words obtained from a reference corpus). Layout and word length are orientation-aware features, approximately equally informative with varied-width fonts; with fixed-width fonts, word length is more effective. Common words are the most discriminative feature in all line orientations and reading directions, but also the most computationally costly feature (term lookup). The features that are used in the determination of the reading order are generic, and, although illustrated and tested for the English language, can be used for detecting the reading order for any language, including segmented (where word tokens are delimited by white space) and unsegmented languages (e.g., Asian languages such as Chinese). In the latter case word lengths are recognized by using a dictionary of words or a language model in the considered language (manually or automatically acquired), and in case certain glyph 2 For

most Indo-European languages.

19

sequences are not found in the dictionary, a default, close-to-zero probability for each glyph that is not part of a recognized word can be used. In case of ambiguity, i.e., when different configurations of words can be recognized, the word length feature score can be averaged over the different possible readings. Alternatively, when considering a language, new features for determining the reading order can be added, or some of the features described above can be deleted.

5.2 5.2.1

Related Work Content Segmentation

Our approach for detecting hidden text salting segments and recognizes email content through the analysis of the result of a rendering process, and, as such, it is related to segmentation and content recognition in computer vision. Image segmentation is a fundamental component in computer vision, typically defined as an exhaustive partitioning of an input image into regions, each of which is considered homogeneous with respect to some property [24]. A feature vector is built for every pixel (e.g. comprising brightness, color, texture, motion features), that also contains information about the pixel location. Here, we segment pieces of text that have the same property (reading direction) and that are segregated from each other based on layout properties (visual boundaries). Other common approaches used for segmenting images are clustering techniques (e.g. graph-theoretic clustering, minimal spanning trees [43]), where the focus is on efficiently searching the space of the many possible partitionings of an image. 5.2.2

Reading Direction

Our cognitive model of partitioning text and assigning reading directions is borrowed from the document image understanding field, which researches the transformation of informative content of a document from paper into an electronic format outlining its logical content. Its objectives and scope are broader than those of Optical Character Recognition (OCR) research, which focusses mainly on the recognition of isolated, individual characters [23]. Most commercial OCR systems implement some form of document image understanding, since present documents can no longer be restricted or assumed to have a plain structure. Of direct relevance to our work are the document image understanding techniques for layout analysis, which aim to automate the zoning process (= detection and classification of zones on a document image, e.g. a graylevel, pixelated image that results from scanning a page) in various data types (e.g. text, graphic, table) and to represent them in a geometric and/or logical structure [10]. In general, document image understanding faces the more difficult problem of capturing structural content from noisy, degraded, skewed, limitedresolution input images, such as scanned images of carbon copy documents or low-resolution faxed documents, and hence mainly targets confined, controlled

20

domains. In our work, however, we target adversary behavior in a competitive environment. For instance, we anticipate the use of unconventional reading directions, which may be unsupported in document image understanding systems but correctly interpreted by humans’ perceptive abilities.

6 6.1

Classification Experiments Phishing Classifiers

As discussed in section 3.3.1 content-based phishing classifiers have to integrate a large number of features. It turns out that methods used for text classification also yield best results for content-based phishing classification [1]. Prominent methods are variants of tree classifiers like Random Forests [8] or Support Vector Machines [39]. A key requirement is that they are able to limit the influence of unimportant features and only capture systematic dependencies. It has been shown by theoretical analysis and practical experiments that they reliably are able to achieve this. In the context of phishing classification different setups may be used. On the one hand we simultaneously may address the classification of emails into ham, spam and phishing leading to a classification problem with three classes. Alternatively we may assume that spam emails are eliminated beforehand by a dedicated spam filter and only the two classes ham and phishing have to be distinguished. A third possibility is to merge spam and phishing emails into one class of “unwanted” emails. In this paper we mainly address the two class problem with ham and phishing emails. In addition we investigated the classification into ham and unwanted emails. Statistical phishing classifiers only work if the features of the new emails correspond to the composition of emails in the training set. If emails with a completely new combination of feature values occur the filter has to be adapted and retrained. It turns out that the filters are quite robust as the classification usually depends not on a single feature value but on the cumulative effect of several features.

6.2

Active Learning

Advanced classifiers do not only predict a class for a new email but also are able to estimate the probability that this class is the true class. This effect can be used to provide a self-diagnosis for the reliability of classifications. If unusual combinations of feature values occur which are dissimilar from all examples seen in the training set then the classifier usually will be undecided and predict low probability values for all classes. This effect can be used to adapt classifiers to new types of phishing emails. If an email with high classification uncertainty is detected it may contain a novel combination of features which never occurred in the training set. Active learning strategies select such training examples for annotation by experts [22]. It can

21

be shown that this type of example selection provides highest information gain with minimal annotation effort. Active learning is well-suited for the selection of examples in the expert’s lab of spam and phishing filter providers.

6.3

Description of Data

We have compiled a real-life dataset of 20000 ham and phishing emails over a period of almost seven months, from April 2007 to mid-November 2007. These emails were provided by our project partners. We assume a ratio of four to five ham emails for one phishing email. Specifically, our corpus consists of 16364 ham emails and 3636 phishing emails. For an additional experiment we added 20000 spam emails from the same time period to create an overall corpus of 40000 emails. We chose such a large number of spam emails to approximately reflect the fact that in real-life the majority of emails are unsolicited.

6.4

Feature Processing and Feature Selection

The features used in our system come from a variety of different sources. It is thus necessary to postprocess them to supply the classifiers with unified inputs. 6.4.1

Feature Processing

It is not advisable to directly use the unmodified feature values as input for a classifier. We perform scaling and normalization. Scaling guarantees that all features have values within the same range. Many classifiers are based on distance measures, such as the Euclidean distance, which overemphasizes features with large values. We perform a Z-transformation to ensure that all features have an empirical mean of 0 and an empirical standard deviation of 1. Additionally, we normalize the length of the feature vectors to one, which is adequate for inner-product based classifiers. 6.4.2

Feature Selection

In practice, machine learning algorithms tend to degrade in performance when faced with many features that are not necessary for predicting the correct label [26, 38]. The problem of selecting a subset of relevant features, while ignoring the rest, is a challenge that all learning schemes are faced with. Feature selection can be understood as a search in a state space. In this space every state represents a feature subset. Operators that add and eliminate single features determine the connection between the states. Because there are 2n − 1 different non-empty subsets of n features, a complete search through the state space is impractical and heuristics need to be employed. A common and widely used technique is the wrapper approach proposed by Kohavi et al. [27]: a search algorithm uses the classifier itself as part of the evaluation function. The classifier operates on an independent validation set; the search algorithm systematically adds and subtracts features to a current

22

subset. In our experiments, we apply the so-called best-first search strategy, which expands the current node (i.e., the current subset), evaluates its children and moves to the child node with the highest estimated performance. We combine the best-first search engine with compound operators [27], which dynamically combine the set of best-performing children. The underlying idea is to shorten the search for each iteration by considering not only the information of the “best” child node (as described in the greedy approach above) but also the other evaluated children. More formally, by ranking the operators with respect to the estimated performance of the children a compound operator ci can be defined to be the combination of the best i + 1 operators.

6.5

Evaluation Criteria

We use 10-fold cross-validation as our evaluation method and report a variety of evaluation measures. For each email four different scenarios are possible: true positive (TP, correctly classified phishing email), true negative (TN, correctly classified ham email), false positive (FP, ham email wrongly classified as phishing), and false negative (FN, phishing email wrongly classified as ham). We do not report accuracy, i.e., the fraction of correctly classified emails, as this measure is only of limited interest in a scenario where the different classes are very unevenly distributed. More importantly, we report standard measures, such as precision, recall, and F-measure as well as the false positive and the false negative rate. These measures are defined as: precision =

|T P | |T P | + |F P | f

f pr =

|T P | |T P | + |F N | 2 · precision · recall precision + recall |F N | f nr = |T P | + |F N | recall =

=

|F P | |F P | + |T N |

Note that in email classification, errors are not of equal importance. A false positive is much more costly than a false negative. It is thus desirable to have a classifier with a low false positive rate. For most classifiers the false positive rate may be reduced at the cost of a higher false negative rate by changing a decision threshold.

7

Results of Experiments

In an earlier work [4] we have shown that by the introduction of model-based features we were able to substantially outperform previous results by Fette et al. [17] on a public dataset. On this dataset we achieved with our best feature combination an F-measure of 99.46% compared to 97.64% by Fette et al., which corresponds to a reduction in error of more than 75%. The results on the public corpus for different sets of features are summarized in Table 1. We also pointed

23

out that the corpus was synthetically composed using emails from different sources and that indeed results on a real-life dataset were somewhat worse. Features Fette et al. [17] All features Basic features DMCText Topic DMCText + Topic Feature selection

FP-Rate 0.13% 0.01% 0.20% 0.00% 0.20% 0.03% 0.00%

FN-Rate 3.62% 1.30% 6.39% 4.02% 2.72% 1.89% 1.07%

Precision 98.92% 99.88% 98.26% 100.00% 98.33% 99.76% 100.00%

Recall 96.38% 98.70% 93.61% 95.98% 97.28% 98.11% 98.93%

F-Measure 97.64% 99.29% 95.88% 97.95% 97.80% 98.93% 99.46%

Table 1: Classification results for different feature sets on a public corpus

General Results For the real-life data described in section 6.3 we achieve excellent results. Table 2 shows the results for classification for different sets of features, especially the model-based features, i.e., the semantic topic features and the dynamic Markov chain features. We have calculated the statistical significance in a paired t-test and conclude that the improvement of using all features vs. the basic features is significant with a t-value of 8.70 with nine degrees of freedom, which makes the improvement significant with a probability of more than 0.999. Features Basic features DMCText DMCLink Topic (K = 10) Salting Image distortion Logo detection All features Feature selection

FP-Rate 0.28 % 0.26 % 0.01 % 0.56 % 0.74 % 1.68 % 0.10 % 0.01 % 0.01 %

FN-Rate 5.24 % 0.36 % 3.83 % 5.66 % 13.74 % 56.17 % 56.56 % 0.31 % 0.19 %

Precision 98.67 % 98.84 % 99.94 % 97.39 % 96.46 % 85.22 % 98.93 % 99.94 % 99.97 %

Recall 94.76 % 99.64 % 96.17 % 94.34 % 86.26 % 43.83 % 43.44 % 99.69 % 99.81 %

F-Measure 96.67 % 99.24 % 98.02 % 95.84 % 91.07 % 57.89 % 60.37 % 99.82 % 99.89 %

Table 2: Overall classification results One can observe that the model-based features by themselves are already very powerful. In particular, the DMCText features alone already achieve a performance of more than 99% F-measure. Nevertheless the inclusion of other features during feature selection leads to a further relative reduction of errors of about 85%. This is in line with our previous observations. Note that the DMCLink features, salting, image distortion and logo detection are new in this work.

24

Salting features As for the salting features, the table shows that for phishing hidden salting is indeed an issue, as phishing emails can be identified with an Fmeasure of more than 90% based on these features alone. The most prominent salting tricks employed in our dataset is the font color trick. The font color trick is a good separator between ham and phishing emails, because it appears in 86% of the phishing emails. Note that more than 10% of the ham emails contain a salting feature. More details about the distribution of salting tricks can be found in Table 3. Salting Trick Concealment Clipping Font color Font size Number of emails

Ham 221 16 1597 21 16364

Phishing 0 0 3124 1 3636

Table 3: Salting statistics for our corpus

Image features On the other hand, image distortion and logo detection by themselves are not very useful features. The simple reason for that is that these features are based on images in an email, and not all emails contain images. This also explains why precision for these features is relatively high whereas recall is relatively low. However, these two features are nice add-ons. Their inclusion helps to further improve an already very good classifier. Image distortion is not very common, both among ham and phishing emails. This makes sense as phishers try to mimic ham emails as closely as possible. On the other hand, the distribution of the image distortion is not exactly the same in ham and phishing emails. First, in our dataset the ham emails contain much fewer images than the phishing emails. Second, though the average distortion is similar for both classes of emails, the variance is much larger for the ham emails. This is intuitively understandable as ham emails simply contain images of a much larger variety. There are similar issues with the logo detection features. Ham emails simply contain much fewer logos. It is thus natural that a classifier based on this feature alone has a good precision, but a low recall. Many phishing emails, namely those without logos, are missed. In conjunction with other features the logo detection feature is valuable. Similar to Fumera et al. [18] we did some experiments with Optical Character Recognition (OCR) of text contained in embedded images. This extracted text can be used as an additional feature for text classification. It turned out, however, that the resulting features did not improve classification performance for the current email corpus, which was already close optimum. For other corpora OCR may play a more important role.

25

Feature selection We performed feature selection as described in Section 6.4.2. In the cross-validation experiments we used feature selection on each individual fold to make the result more comparable to related work. Note that we no longer can use 90% of the data for training in each individual fold, because we reserve 30% of the training data for the feature selection process, i.e., for the validation of individual models based on different feature sets. As shown in table 2 we achieved an even better result using fewer features and less training data. Again, the improvement over the basic features is statistically significant with a probability of more than 0.999. We also observed which features were selected in the process. Besides our model based features (i.e., DMCText, DMCLink and topic features) some structural features (total number of body parts, number of discrete and composite body parts), some link features (total number of links, number of deceptive links, number of image links) and the wordlist features placed among the selected ones. This seems reasonable as the structural features are complementary to the purely content-based advanced features. None of the image features, i.e., image distortion or logo detection, were selected during the feature selection process. We take this as an indication that even for emails that do contain images the message text and information about links is still more useful for the detection of phishing. Cascade classifiers Cascade classifiers were first introduced in the context of image classification [40]. They aim to minimize the computational effort during the classification phase by reducing the number of features to be computed for each example to be classified. A first classifier is trained based on one or very few “cheap” features. If an example can be classified with a sufficient degree of confidence then a lot of processing effort is avoided. Otherwise, a second classifier based on more features is employed and so on. This can be viewed as a feature-on-demand approach. We designed a fixed set of cascades by arranging the features selected in the feature selection process based on their computation effort. Yielding a similar classification performance as before we observed that cascade classifiers increase training time by 10-15% but in return decrease classification time for new emails by 70-75%. Robustness of Classifiers Remember that our corpus contains emails collected during the time period from April 2007 to November 2007. We wanted to find out whether classification models from one time period can be used during later time periods. To this end, we trained the classifier with all emails from April to June and applied it to the emails from July. As expected, the performance turns out to be a bit lower, but we still arrive at an F-measure of 98.66% with a false positive rate of 0.0% and a false negative rate of 2.64%. The good news is that we do not miss any legitimate email, but somewhat more phishing emails are missed, which corresponds to one’s intuition. Note that this result says something about the robustness of the classifiers. During the month of July we have to expect new types of phishing mails which

26

were not used for training the classifier. The result indicates that by keeping the same classifier for a whole month there is some decrease in performance, but this decrease is small. By annotating new “suspect” emails and re-training the classifier with these new examples the drop in performance may be avoided. Spam emails To test the effect of spam emails we added 20000 spam emails from the same time period to our corpus to create an overall corpus of 40000 emails. The spam emails were merged with the phishing class to form an overall “unwanted” class. We ran an unchanged version of our program on this extended corpus using the features from the feature selection and observed that the performance deteriorates only very slightly to an F-measure of 98.48%. Details are in Table 4. Note that in general spam classification may require different feaCorpus W/o spam With spam

FP-Rate 0.01 % 0.93 %

FN-Rate 0.19 % 2.04 %

Precision 99.97 % 99.01 %

Recall 99.81 % 97.96 %

F-Measure 99.89 % 98.48 %

Table 4: Classification results for a corpus with spam emails tures than phishing filtering, as spammers use different salting methods. In our classifier we only used an untrained version of SpamAssasin as a spam indicator. Nevertheless the performance remained quite high. If a dedicated spam filter would be used as an additional feature the performance could be considerably increased.

8

Summary

Phishing has become a serious threat to global security and economy. Because of the fast rate of emergence of new phishing websites and because of distributed phishing attacks it is difficult to keep blacklists up to date. Therefore contentbased phishing filters are needed to fill the remaining security gap. In this paper we have described a number of features of phishing mails which are able to reduce the rate of successful phishing attempts far below 1%. In contrast to many other approaches most features are not handcrafted but are themselves statistical models,which capture different email aspects and are trained using annotated training emails. The classifiers are robust and can identify most future phishing emails. Of course they can further be improved by including black- and whitelists. As new phishing emails appear frequently it is nevertheless necessary to update the filters in short time intervals. As discussed in section 3.3.1 we may select emails for annotation using the “active learning” approach. This reduces the human effort for annotation while selecting the most informative emails for training. In the next phase of the AntiPhish project we will implement this approach for an email stream in a realistic environment. By selecting “borderline emails” for annotation by a small task force of human experts the filters will be

27

continually re-trained and updated. By using this scheme the quality of filters can be kept at a high level using the infrastructure which is already available at companies constantly updating spam filters and virus scanners.

9

Acknowledgments

This paper is based upon work performed within the FP6-027600 project AntiPhish (http://www.antiphishresearch.org/). The authors would like to thank the European Commission for partially funding the AntiPhish project as well as all the AntiPhish project partners for their interest, support, and collaboration in this initiative.

References [1] S. Abu-Nimeh, D. Nappa, X. Wang, and S. Nair. A comparison of machine learning techniques for phishing detection. In Proceedings of the eCrime Researchers Summit, 2007. [2] Anti-Phishing Working Group. Phishing activity trends report for the month of December 2007, 2008. http://www.antiphishing.org/reports/apwg report oct 2007.pdf, accessed on 28.04.08. [3] Bank Austria. Faq mobile TAN, http://www.bankaustria.at/de/19825.html, accessed on 25.01.08.

2008.

[4] A. Bergholz, J.-H. Chang, G. Paaß, F. Reichartz, and S. Strobel. Improved phishing detection using model-based features. In Proceedings of the Conference on Email and Anti-Spam (CEAS), 2008. [5] B. Biggio, G. Fumera, I. Pillai, and F. Roli. Image spam filtering using visual information. In ICIAP ’07: Proceedings of the 14th International Conference on Image Analysis and Processing, pages 105–110, Washington, DC, USA, 2007. IEEE Computer Society. [6] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003. [7] A. Bratko, G. V. Cormack, B. Filipic, T. R. Lynam, and B. Zupan. Spam filtering using statistical data compression models. Journal of Machine Learning Research, 6:2673–2698, 2006. [8] L. Breiman. Random forests. Machine Learning, 45(1):532, 2001. [9] B. Byun, C.-H. Lee, S. Webb, and C. Pu. A discriminative classifier learning approach to image modeling and spam image identification. In CEAS 2007 Fourth Conference on Email and Anti-Spam, August 2-3, 2007, Mountain View, California USA, 2007. 28

[10] R. Cattoni, T. Coianiz, S. Messelodi, and C. Modena. Geometric layout analysis techniques for document image understanding: a review. Technical report, IRST, Trento, Italy, 1998. [11] M. Chandrasekaran, K. Narayanan, and S. Upadhyaya. Phishing email detection based on structural properties. In Proceedings of the NYS Cyber Security Conference, 2006. [12] Commtouch. Commtouch q3 spam statistics: Spam problem reaches new peak, expands in every dimension, 2008. http://www.commtouch.com/Site/News Events/pr content.asp?news id=767&cat id=1, accessed on 11.05.08. [13] G. V. Cormack and R. N. Horspool. Data compression using dynamic markov modelling. The Computer Journal, 30(6):541–550, 1987. [14] R. Dhamija, J. D. Tygar, and M. Hearst. Why phishing works. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems, pages 581–590, 2006. [15] A. Emigh. Phishing attacks: Information flow and chokepoints. In M. Jakobsson and S. Myers, editors, Phishing and Countermeasures, pages 31–64. Wiley, 2007. [16] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. Technical report, School of Computer Science Carnegie Mellon University, CMU-ISRI-06-112, 2006. [17] I. Fette, N. Sadeh, and A. Tomasic. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW), pages 649–656, 2007. [18] G. Fumera, I. Pillai, and F. Roli. Spam filtering based on the analysis of text information embedded into images. Journal of Machine Learning Research, 7:2699–2720, 2006. [19] Gartner. Gartner survey shows phishing attacks escalated in 2007; more than $3 billion lost to these attacks. http://allpaynews.com/node/3820, retrieved on April 27th, 2008, 2007. [20] J. Goodman, G. V. Cormack, and D. Heckerman. Spam and the ongoing battle for the inbox. Communications of the ACM, 50:25–33, 2007. [21] M. Gupta. Spoofing and coutermeasures. In M. Jakobsson and S. Myers, editors, Phishing and Countermeasures, pages 65–104. Wiley, 2007. [22] S. C. H. Hoi, R. Jin, and M. R. Lyu. Large-scale text categorization by batch mode active learning. In Proceedings of the 15th international conference on World Wide Web, pages 633 – 642, 2006.

29

[23] S. Impedovo, L. Ottaviano, and S. Occhinegro. Optical character recognition - a survey. ACM International Journal of Pattern Recognition and Artificial Intelligence, 5:1–24, 1991. [24] A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: a review. ACM Computing Surveys, 31:264–323, 1999. [25] M. Jakobsson and A. Tsow. Making takedown difficult. In M. Jakobsson and Steven, editors, Phishing and Countermeasures, pages 461–467. Wiley, 2007. [26] G. H. John. Enhancements to the Data Mining Process. PhD thesis, Stanford University, CA, USA, 1997. [27] R. Kohavi and G. John. Wrappers for feature subset selection. AI Journal, 97(1-2):273–324, 1997. [28] I. Konya, C. Seibert, S. Glahn, and S. Eickeler. A robust front page detection algorithm for large periodical collection, 2008. Submitted for publication. [29] Y. Marton, N. Wu, and L. Hellerstein. On compression-based text classification. In Proceedings of the European Colloquium on IR Research (ECIR), pages 300–314, 2005. [30] T. P. Minka. Estimating a Dirichlet distribution. Technical report, Microsoft Research, 2003. [31] D. Miyamoto, H. Hazeyama, and Y. Kadobayashi. A proposal of the AdaBoost-based detection of phishing sites. In Proceedings of the Joint Workshop on Information Security, 2007. [32] Mozilla. Phishing protection, 2007. http://www.mozilla.com/enUS/firefox/phishing-protection/; accessed on 23.10.07. [33] S. Myers. Introduction to phishing. In M. Jakobsson and S. Myers, editors, Phishing and Countermeasures, pages 1–29. Wiley, 2007. [34] P. Robichaux and D. L. Ganger. Gone phishing: Evaluating anti-phishing tools for windows. Technical report, 3Sharp LLC, September 2006. [35] J. L. Scarrow. Microsoft’s anti-phishing technologies and tactics. In M. Jakobsson and S. Myers, editors, Phishing and Countermeasures, pages 551–562. Wiley, 2007. [36] F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 34:1–47, 2002. [37] M. Stepp and C. Collberg. Browser-based antiphishing tools. In M. Jakobsson and S. Myers, editors, Phishing and Countermeasures, pages 493–513. Wiley, 2007. 30

[38] S. Thrun et al. The MONK’s problems: A performance comparison of different learning algorithms. Technical Report CMU-CS-91-197, Carnegie Mellon University, Pittsburgh, PA, 1991. [39] V. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, 1995. [40] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple features. 2001. [41] Z. Wang, W. Josephson, Q. Lv, M. Charikar, and K. Li. Filtering image spam with near-duplicate detection. In CEAS 2007 Fourth Conference on Email and Anti-Spam, August 2-3, 2007, Mountain View, California USA, 2007. [42] M. Wu, R. Miller, and S. Garfinkel. Do toolbars actually prevent phishing? In M. Jakobsson and S. Myers, editors, Phishing and Countermeasures, pages 514–521. Wiley, 2007. [43] C. Zahn. Graph-theoretical methods for detecting and describing gestalt clusters. IEEE Transactions on Computers, 20:68–86, 1971. [44] Y. Zhang, J. Hong, and L. Cranor. Cantina: A content-based approach to detecting phishing web sites. In Proceedings of the International World Wide Web Conference (WWW), 2007. [45] G. Zhu and D. Doermann. Automatic document logo detection. In Ninth International Conference on Document Analysis and Recognition, 2007. ICDAR 2007 Vol. 2., volume 2, pages 864–868, 2007.

31