Combining visual and textual features for filtering spam emails

4 downloads 145819 Views 202KB Size Report
and the genre of email contents becomes more and more different ... various proposals have been made in the literature to ad- dress this kind of spam, too.
Combining visual and textual features for filtering spam emails Francesco Gargiulo, Carlo Sansone Dipartimento di Informatica e Sistemistica - University of Naples Federico II {francesco.grg, carlo.sansone}@unina.it Abstract The presence of spam can seriously compromise normal user activities, forcing them to navigate through mailboxes to find the - relatively few - interesting emails. Even if a quite huge variety of spam filters has been developed until now, this problem is far to be resolved since spammers continuously modify their malicious techniques in order to bypass filters. In this paper we present a system for overcoming some of the problems that still remain with spam filters when checking emails that can contain attached images. A comparison with other state-of-the-art spam filters on a database containing both image and textual spam is also reported.

1

Introduction

Different counter-measures to spam have been proposed so far, using regulatory or technical approaches. The legislative approach did not obtain the desired results. Several technical approaches have thus been implemented in different anti-spam filters currently used to detect unsolicited bulk emails [8]. In the past, researchers have first addressed this problem as a text classification or categorization problem [2]. However, as spammers’ techniques continue to evolve and the genre of email contents becomes more and more different, keywords-based anti-spam approaches alone are no longer sufficient. Then, different techniques have been used to analyze the mail text, the majority of which are learning-based approaches [4, 7]. Among the different tricks used by spammers, an emerging kind of spam practice is the so-called image spam. Here spammers send their messages in attached images that are readable by human but hidden from a text-based filter. Even if image spam is relatively new, various proposals have been made in the literature to address this kind of spam, too. Most approaches use some form of embedded text detection within images. The ra-

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

tionale is that spam images should contain a text whose content can spread unsolicited commercial messages. In particular, Wu et al. [9] defined a set of visual features for detecting characteristics common in spam images, such as embedded text and banner features. These features are then combined with message text features for training a one-class SVM that should be able to detect when legitimate (ham) emails are outside the spam class. A different approach is instead followed in [5]. Here the authors propose to process attached images with a state-of-the-art OCR and then to forward OCR outputs to a text-based spam filter. In [3] the authors presented an approach to image spam detection based on an algorithm for speed sensitive feature selection. All the aforementioned approaches, however, cannot be used when text within images is voluntarily distorted and/or obfuscated. As it was noted in [1], in fact, now spammers try to make OCR and text detection techniques ineffective without compromising human readability, by placing text on non-uniform background, or by using techniques like the ones exploited in CAPTCHAs. In a recent paper [6] we defined a method for overcoming some problems that still exist on the state-ofart spam filters when addressing image spam. We fused the key ideas of some of the previously described approaches, by defining two different sets of features. A first set characterizes an image from a global point of view, in order to detect artifacts that typically indicate the presence of spam. The second one has been devised for detecting text into an image, by explicitly taking into account the fact that it is typically deformed and/or obfuscated by spammers. Note that such a method can be only applied for detecting image spam. In this paper we want to investigate how the above mentioned approach can be extended in order to filter out also spam emails that do not contain images. In particular, we propose to extract from the email’s body the same set of features extracted in [6] from texts detected within attached images. In this way, we take into account the fact that spammers typically use techniques

for obfuscating words in the emails’ body that are similar to those considered for obscuring textual messages within images. The choice of these textual features is also motivated by the fact that they can be extracted in a faster way with respect to other proposals [5]. Moreover, we defined a suitable criterion for effectively fusing results coming from the different classifiers which operate on the above cited feature sets. The rest of the paper is organized as follows. First, we present the proposed architecture for coping with spam, by giving details about the proposed features and the way we combined them. Then, the database used for assessing the performance of our system is introduced and the experimental results are presented. A comparison with other state-of-the-art spam filters is also reported. Finally, some conclusions are drawn.

2

The Proposed System

In this paper we propose a novel approach for the detection of the spam in which two different image processing techniques (see Figure 1) are combined with the use of features extracted from the emails’ body. The first image processing technique is devoted to extract some global features from each image attached to an email. Such features should also be able to detect if images were adulterated or not, by considering the complexity of the image itself as it is perceived from an human being. The second processing is carried out by means of two steps. First, there is a preprocessing phase that makes use of an OCR; then, a feature extraction process starts in order to characterize if the text extracted from the OCR has been voluntarily obfuscated and/or distorted. The differences between our use of an OCR and previous ones like the one presented in [5], is that in our case OCR was used as a tool for obtaining a fingerprint of an adulterated image, which is then characterized by means of a suitable set of features. The latter set is also extracted from the emails’ body, since the techniques used by spammers for obfuscating words in the emails’ body are similar to those adopted for obscuring textual messages within images. Each of the above described feature set is used as input of a binary classifier that, after a suitable training phase, can decide if the image under test is ham or spam (see Figure 1). Finally, a combination modules takes the final decision by using a suitably chosen decision rule (see Section 2.3).

2.1

Visual Features

The first set of features, that we called visual features, are obtained directly from the image attached to

Figure 1. The proposed approach for spam filtering the mails. In order to give a visual characterization of the image that is able to discriminate between normal and adulterated images we consider a first category of features that describe the texture of an image from a statistic point of view. All the statistical texture measures are based on the co-occurrence matrices. Spatial gray level co-occurrence estimates image properties related to second-order statistics. In particular, we considered the following features extracted from the co-occurrence matrices (see also [6] for further details): • Contrast: the ratio between the brightest and the darkest value of the image • Entropy: an index of the brightness variation among the pixels in an image • Energy: the spectral content of an image • Correlation: an index of the correlation degree among the pixels • Homogeneity: a measure of the brightness variation within the image Another category of features that can be used for characterizing images from a global point of view are based on the complexity of an image for a human reader. We chose to consider a feature proposed also in [1], i.e.: • Perimetric Complexity: it is defined as the squared length of the boundary between black and white pixels (the perimeter) in the whole image, divided by the black area.

2.2

Text-based Features

When an OCR is used for processing images whose embedded texts have been distorted or obfuscated, the

majority of the words cannot be correctly detected. Furthermore, several characters that typically are not present in common-sense words can appear. Similarly, spammers usually try to obfuscate the textual part of an email’s body by using the same trick, i.e. by substituting some characters in order to bypass keyword-based filters. So, we defined another set of features for obtaining a characterization of this kind of obfuscated text. These features are mainly based on the presence of special characters, i.e. those characters that should not be frequently present in a legitimate text. The whole set we considered is made up of the following characters: { !, ”, #, $, %, &, ’, (, ), *, +, ,, -, . . . , /, @, ˆ }. Starting from this set we defined six Text-based features calculated starting from both the output of the OCR and the text coming from the email’s body: • text length: the number of characters of the whole text • words number: the number of words in the text • ambiguity: the ratio between the number of special and normal characters • correctness: the ratio between the number of words that do not contain special characters and the number of words that contain special characters • special length: the maximum length of a continuous sequence of special characters • special distance: the maximum distance between two special characters.

2.3

Classifier Combination

The combination of an ensemble of classifiers can be of great benefit in many practical pattern recognition applications. Through the appropriate choice of a combining rule, it is possible to dampen the overall effect of the independent errors in each observation domain, thus reaching performance better than those of any single classifier. Anyway, there are some problems that must be taken into account in our domain: • It is necessary to define a method for combining a non constant number of classifiers, since it is not possible to a priori know whether there is one or more images attached to the email and/or there are textual information to be processed. • Padding-attacks should be avoided. That is, the possibility that an attacker puts a spam message

within a normal context, for example by attaching an image containing an embedded spam message to an email that contains legitimate text. In order to deny padding-attacks we do not use a majority voting approach. Instead, for each email we count the number of the classifiers (whose number can vary in practice from one to a dozen) that vote for the spam class and use a threshold τ for deciding if the email under test is spam or ham. The choice of a value of τ equal to zero means that only emails for which all the classifiers vote for the ham class are considered legitimate. By suitably varying such a threshold we can switch among different behaviors of the system, that can then adapt itself to the specific users’ requirements.

3

Experimental Results

In the following we will first present the database used for assessing the effectiveness of the proposed approach, then evaluate if the use of both visual and textbased features can improve the performance of the system with respect to the use of a single set of features. Finally, we make a comparison of our approach with a state-of-the-art anti-spam filter, i.e. SpamAssassin equipped with two different spam image plug-ins. Even if other image spam datasets are freely available [3], they are typically made up of images only, without the original emails. This cannot permit a comparison of our system with SpamAssassin. So, we collected a new dataset, whose details are given in Table 1, composed by 14723 emails, 12946 of which contains spam messages. Emails were collected from the mailboxes of some users of the studenti.unina.it mailserver in a period of about three years (2005-2007). This mailserver hosts the mailboxes of all the students of the University of Naples Federico II. Among those emails, 178 contain ham images and 3731 contain spam images. Total # of Emails Emails with Images Spam Ham Spam Ham 12946 1777 3731 178 Table 1. The dataset used in our tests. As in [6] we adopted a Decision Tree for the classification stage of our approach, since its classification speed makes it very effective for an on-line implementation. In particular, a C4.5 (J48) coming from the open source tool Weka1 was selected. Moreover, it is worth noting that in all the test reported hereinafter the results 1 http://www.cs.waikato.ac.nz/ml/weka/

are given in terms of Precision, Recall and F1-measure averaged by using a 10-fold cross validation estimate. In Table 2 the results obtained on the considered dataset by our approach are presented. As it is evident, the adoption of both visual and text-based features can improve the performance obtainable by the system with respect to the case in which only text-based features are used. By varying the value of τ , we can obtain a system that optimizes the F1-measure or a system that is able to produce very few false positives (i.e., a very good Precision value). Features Visual Textual (from OCR) Textual (from body) Combination (τ =0) Combination (τ =1)

Precision 0.941 0.854 0.924 0.948 0.998

Recall 0.992 0.988 0.972 0.981 0.700

F1 0.965 0.916 0.947 0.964 0.822

Table 2. Results obtained by with different feature sets. In Table 3 we report a comparison of the results obtained by our system with those obtainable with SpamAssassin in its standard configuration and equipped with two plug-ins devised for filtering image spam, namely Bayes-OCR2 and Fuzzy-OCR. For a fair comparison, we choose the configuration of our system that maximizes the Precision values, as it is typically done by SpamAssassin. As it is evident, our approach significantly outperforms both Bayes-OCR and Fuzzy-OCR, by reaching a significantly higher Recall value with the same number of false positives. Finally, note the time needed for processing the whole dataset by our system are practically the same needed by SpamAssassin with Fuzzy-OCR, while it is significantly faster than SpamAssassin equipped with Bayes-OCR. Our System (τ =1) SA with Bayes-OCR SA with Fuzzy-OCR Our System SA

Precision 0.998 0.999 0.997 0.924 0.999

Recall 0.700 0.372 0.371 0.972 0.812

F1 0.822 0.542 0.541 0.947 0.896

Table 3. Comparison between the proposed system and SpamAssassin (SA). Last two rows refer to textual mails only.

2 This plug-in is available for download at the URL: http://prag.diee.unica.it/n3ws1t0/?q=node/107

4

Conclusions

In this paper we presented a system for addressing the spam email problem, which takes into account some of the recent evolutions of the spammers tricks. Tests on a dataset of emails containing also attached images confirmed the effectiveness of the approach and its applicability with respect to other widely used opensource tools, such as SpamAssassin. In the future we plan to better characterize the contribution of the various component of our architecture, in order to find the best trade-off between obtainable performance and computational complexity.

References [1] B. Biggio, G. Fumera, I. Pillai, F. Roli, Image Spam Filtering Using Visual Information, Proc. of the 14th Intern. Conf. on Image Analysis and Processing, Modena, Italy, pp. 105–110, 2007. [2] W. Cohen, Learning rules that classify e-mail, AAAI Spring Symposium on Machine Learning in Information Access, pp. 18–25, 1996. [3] M. Dredze, R. Gevaryahu, A. Elias-Bachrach, Learning Fast Classifiers for Image Spam, Proc. of the 4th Conf. on Email and Anti-Spam, 2007. [4] H. Drucker, D. Wu, and V.N. Vapnik, Support Vector Machines for Spam Categorization, IEEE Trans. on Neural Networks 10(5), pp. 1048–1054, 1999. [5] G. Fumera, I. Pillai, F. Roli, Spam filtering based on the analysis of text information embedded into images, Journal of Machine Learning Research, 7, pp. 2699–2720, 2006. [6] F. Gargiulo, C. Sansone, Visual and OCR-based Features for detecting Image Spam, in A. JuanCscar and G. Sanchez-Albaladejo (Eds.), Pattern Recognition in Information Systems, INSTICC Press, pp. 154-163, 2008. [7] V. Metsis, I. Androutsopoulos, G. Paliouras, Spam Filtering with Naive Bayes - Which Naive Bayes?, Proc. of the 3rd Conf. on Email and Anti-Spam, 2006. [8] G. Schryen, Anti-Spam Measures, Analysis and Design, Springer, 2007. [9] C.-T. Wu, K.-T. Cheng, Q. Zhu, Y.-L. Wu, Using Visual Features for Anti-Spam Filtering, Proc. of the IEEE International Conference on Image Processing 2005, September 11-14, Genova, Italy, 2005.