distributed layer-3 e-mail classification for spam control - CiteSeerX

40 downloads 71850 Views 110KB Size Report
of the proposed approach compared to the full e-mail classification ap- proach. ..... that contributing to false positives: (a) HTML processing;. (b) e-mail header ...
DISTRIBUTED LAYER-3 E-MAIL CLASSIFICATION FOR SPAM CONTROL Muhammad N. Marsono, M. Watheq El-Kharashi, and Fayez Gebali Department of Electrical and Computer Engineering, University of Victoria, Victoria BC, Canada e-mail: {mmarsono, watheq, fayez}@ece.uvic.ca

Abstract This paper proposes a distributed layer-3 e-mail classification for spam control. E-mail packets are inferred in transit and tagged with an intra-packet spam score to indicate whether the packet forms a legitimate or spam e-mail. During e-mail packet reassembly, tags for an e-mail are aggregated to give an inter-packet spam score. The na¨ıve Bayes inference technique is used to evaluate the performance of the proposed approach compared to the full e-mail classification approach. Our simulation results show that the proposed approach exhibits a comparable spam precision (and confidence) to the full e-mail classification approach. Spam recall increases from 63% to 85% depending to the maximum transmission unit size, approaching the 87% of the full e-mail classification. For 67% spam-to-legitimate ratio, we obtain reduction of end servers’s workload by 42% to 57% (across all maximum transmission unit sizes tested) of the total e-mail traffic. Thus, the proposed approach can complement existing anti-spam systems by pre-processing e-mail packets on upstream nodes. Layer-3 e-mail processing requires reduced processing complexity as compared to layer-7 processing and is viable for high throughput hardware-based implementations.

Keywords— Layer-3 e-mail processing; na¨ıve Bayes technique; prioritized e-mail processing; spam control.

1

Introduction

Unsolicited commercial e-mails or spam constitute approximately two-thirds of the e-mail traffic over the Internet [?]. Content-based e-mail classification (one of the most effective approach in anti-spam systems) extracts e-mail features for classification. Several supervised-learning techniques have been proposed with na¨ıve Bayes technique the most-used for spam classification and filtering (e.g., [?]). Current anti-spam systems require full reassembled email for layer-7 processing and thus requires expensive processing due to high buffering requirement. Furthermore, current software-based implementations, even for custom general-purpose processor-based (GPP-based) systems (e.g., [?]) are limited in terms of e-mail processing throughput due to the inability of these systems to scale up with the increase in network load. Specialized e-mail processing architectures (e.g., na¨ıve Bayes inference engines [?]) could be used to cope with the increase in network processing throughput requirements. The association of spam with worms poses higher security threats. Certain worms (e.g., by SoBig, MyDoom, and Mimail) have been used to create e-mail relays (on infected systems) that increase the distribution base and elude certain spam detection mechanisms [?]. It is reported that up to 90% of spam e-mails are being generated by these worm-

Sudhakar Ganti Department of Computer Science, University of Victoria, Victoria BC, Canada e-mail: [email protected]

infected systems [?]. The increase in spam traffic and the possible surge during worm outbreaks motivate the work presented in this paper. This paper investigates the effectiveness of layer-3 e-mail classification for spam control. We propose a novel distributed layer-3 e-mail processing and investigate its performance using na¨ıve Bayes technique compared to full email (layer-7) classification. E-mail packets are individually inferred and tagged with an intra-packet score. During reassembly, tags are aggregated to give an overall interpacket score that indicates the legitimacy of an e-mail. This pre-classification stage reduces up to 57% of spam e-mails, which reduces the amount of e-mails to be processed by end e-mail servers. This paper continues with a brief introduction to the proposed approach in Section 2. Section 3 describes the intrapacket and inter-packet aggregations using the na¨ıve Bayes e-mail technique. We present the efficiency metrics, simulation results, and analysis of false positives in Section 4. We conclude the paper in Section 5.

2

Distributed Layer-3 E-Mail Processing

The current layer-7 e-mail processing, which is more suitable for end-to-end processing, imposes restriction on the physical deployment (e.g., end users, edge, or core network nodes) and implementation options (hardware versus software). While specialized hardware is needed to achieve higher processing throughputs, e-mail processing beyond layer-3 requires much more complex processing due to required packet reassembly, byte stream alignment, and state tracking [?]. Figure ?? introduces the concept of distributed layer-3 e-mail processing. Our proposed approach promises: (a) email classification speedup and (b) reduction of spam traffic. On a per-packet basis, e-mail features are extracted, selected, and aggregated to obtain the intra-packet score to be tagged to the e-mail packet header. Spam traffic reduction can be made possible by buffering and aggregating spam packets in-transit to reduce the spam traffic. E-mail classification speed up can be achieved by either removing reassembled spam traffic (without reprocessing) or by prioritizing the processing of potentially legitimate e-mails based on the pre-classification score as in [?]. The e-mail server reprocesses remaining e-mails using current anti-spam solutions.

Packets can be re-classified and re-tagged

E-mail packets classified and tagged on routers

Packet reassembly and tag aggregation.

Spammer

Edge Router

E-mail server

Core router End system

Figure 1. Distributed layer-3 e-mail classification. Circles represent core and edge (shaded) routers.

with xi given by: xi =

(x{i,1} , x{i,2} , · · · , x{i,|xi |} )

(3)

x{i,k} is the k-th feature from |xi | selected features from e-mail packet pi . Features selected are evenly distributed among m e-mail samples, instead of being selected from one complete e-mail. For each packet pi , the likelihood of xi occurring in the learning data set for spam and legitimate classes are given by: | xi | Y P (xi |c0 ) = P (x{i,k} |c0 ) (4) k=1 | xi |

P (xi |c1 ) =

Y

P (x{i,k} |c1 )

(5)

k=1

Our approach is viable for layer-3 hardware implementation. Layer-3 e-mail packet buffering, state tracking, and inter-packet aggregation can be added for in-transit spam removal. It also allows backward compatibility to systems that do not support this approach, where the tags can be ignored. Tags and source addresses gathered, for instance, at large corporate networks can be used for diagnostic tools to study network loads, link trace, and node problems related to e-mail services.

3

Layer-3 Na¨ıve Bayes E-mail Classification

Intra-Packet Aggregation

We expand our discussion from our initial work on na¨ıve Bayes spam detection in [?]. Fully reassembled e-mail can be represented by a vector of selected features x given as: x

= (x1 , x2 , · · · , x|x| )

(1)

where |x| is the number of selected features in x and xi is the i-th selected word from the e-mail. Assume that an e-mail requires m e-mail packets to be transferred. Thus, x is composed of m vectors of feature xi , each for e-mail packets pi : x

=

(x1 , x2 , · · · , xm )

k=1

The likelihood logarithmic ratio, wi , in Equation (??) is used as the intra-packet spam score, tagged to the email packet header. wi can be obtained by subtraction and additions of binary logarithm values. A wi for packet pi can be computed independently from other e-mail packets.

3.2

Na¨ıve Bayes is a supervised-learning probabilistic technique that requires learning from pre-defined samples. It assumes that all classification features (words in the context of e-mail classification) are independent of each other, in terms of occurrence and sequence. For spam detection, two-class classification of spam (c0 ) and legitimate (c1 ) are assumed. We also assume the following in this paper: 1. Homogeneous na¨ıve Bayes inference engines are used. 2. Tagged e-mail packets will not be reprocessed by the inference engines. 3. No in-transit spam removal. 4. Duplicate packets will be dropped during reassembly. These assumptions are made to facilitate evaluating the baseline estimates for the deployment of this approach at gateway levels.

3.1

Dividing (??) by (??) and taking binary logarithm gives the likelihood spam-to-legitimate logarithmic ratio wi :  |xi |  X wi = lg (P (x{i,k} |c0 )) − lg (P (x{i,k} |c1 )) (6)

(2)

Inter-Packet Aggregation

The a posteriori probability of an e-mail being a spam or being legitimate according to Bayes theorem is given by: P (c0 )P (x|c0 ) (7) P (c0 |x) = P (x) with P (c0 ) represents the a priori probability of spam occurring obtained from the learning data sets. Represented by m features vector xi , the a posteriori probability P (c0 |x) and similarly P (c1 |x) are given by: Qm P (c0 ) i=1 P (xi |c0 ) P (c0 |x) = P1 Qm j=0 P (cj ) i=1 P (xi |cj ) Q m Q| x i | P (c0 ) i=1 k=1 P (x{i,k} |c0 ) = P1 (8) Q m Q| x i | j=0 P (cj ) i=1 k=1 P (x{i,k} |cj ) Q m Q| x i | P (x{i,k} |c1 ) P (c1 ) i=1 k=1 (9) P (c1 |x) = P1 Q m Q| x i | j=0 P (cj ) i=1 k=1 P (x{i,k} |cj )

Dividing (??) by (??) and taking binary logarithm gives a posteriori spam-to-legitimate logarithmic ratio, y:   P (c0 |x) (10) y = lg P (c1 |x) = lg (P (c0 )) − lg (P (c1 ))  | xi |  m X X lg (P (x{i,k} |c0 )) − lg (P (x{i,k} |c1 )) + i=1 k=1

=

lg (P (c0 )) − lg (P (c1 )) +

m X i=1

wi

(11)

From Equation (??), the a posteriori logarithmic ratio y can be easily calculated by summing the likelihood spam-to-legitimate logarithm ratios wi for all m packets and the a priori spam-to-legitimate logarithm ratio lg (P (c0 )) − lg (P (c1 )) obtained during the learning phase. Substituting P (c1 |x) with 1 − P (c0 |x) (due to two-class classification) and solving Equation (??) gives the a posteriori probability for class c0 as: 1 P (c0 |x) = (12) 1 + 2−y P (c0 |x) is easily estimated from the summation of m tags (Equation (??)) and a sigmoid evaluation (Equation (??)). However, inference error is predicted due to: (a) unknown distribution of (selected) features in any e-mail and (b) smaller e-mail sample size, hence the number of features.

4

Experimental Work

Annoyance Filter [?], a C++ implementation of multivariate Bernoulli na¨ıve Bayes e-mail classifier (one of the best classifier implementation, investigated in [?]) is used to evaluate both the proposed layer-3 approach versus the full e-mail classification one. For the proposed approach, maximum transmission unit (MTU) sizes of 576, 1500, 3000, 4470, and 9000 bytes are used. The classification performance evaluation metrics (discussed in the next subsection) are used to indicate the efficiency of the proposed approach as compared to the full e-mail classification. A ten-fold cross validation approach is used to reduce random variations due to the used finite data set. The SpamAssassin data set [?], which consists of 4361 and 2357 predefined legitimate and spam e-mails, respectively, is randomly divided into ten equal-sized sets of roughly 436 legitimate and 236 spam e-mails each. One set is used for classification while the remaining nine sets are used for learning. For each MTU size, the number of features selected is set to 15, the same number of features for full e-mail classification.

4.1

Efficiency Metrics

The classification data set consists of nl + ns e-mails, where nl and ns are the number of legitimate and spam emails, respectively. Assume that ns→s , nl→l , and ns→l are the number of spam e-mails correctly classified as spam, legitimate e-mails correctly classified as legitimate, and spam e-mails wrongly classified as legitimate, respectively. The efficiency metrics used are described as follows: 1. Inference Error (e0 and e1 ) Assume that xl and xs represent legitimate and spam emails in the classification set. The inference errors are evaluated separately for spam (e0 ) and legitimate (e1 ) and are given as: e0 = P (c0 |xs ) − P ′ (c0 |xs ) (13) ′ e1 = P (c0 |xl ) − P (c0 |xl ) (14) P ′ (c0 |xs ) and P ′ (c0 |xl ) are the a posteriori probabilities obtained from full e-mail classification for spam and legitimate e-mails, respectively.

TABLE I Inference errors between the layer-3 and layer-7 approaches for MTU sizes of 576 bytes to 9000 bytes.

MTU (bytes) e0 (×10−2 ) e1 (×10−2 )

576 -23.66 -0.03

1500 -11.36 -0.02

3000 -3.52 -0.02

4470 -2.31 0.05

9000 -0.73 0.00

2. Spam Precision (SP) Spam precision is defined as the ratio of e-mails classified as spam that are actually spam. SP is defined as: ns→s (15) SP = ns→s + nl→s 3. Spam Recall (SR) Spam recall is the ratio of spam e-mails that are classified as spam are indeed spam. SR is defined as: ns→s (16) SR = ns 4. False Positive (FP) False positive is the ratio of legitimate e-mails that are wrongly classified as spam. FP is given by: nl→s FP = (17) nl Of all the metrics, SP and FP are the most crucial indicators for the confidence of the classification approach. FP must be kept as minimum as possible to make the classifier viable. Meanwhile, SR indicates the percentage of spam e-mails that can be correctly classified, hence indicates the reduction in e-mails to be processed by the e-mail server.

4.2

Simulation Results

Table ?? shows inference errors e0 and e1 across different MTU sizes. e0 and e1 affect the classification metrics (SP, SR, and FP) and might differ from the full e-mail classification. For MTU sizes of 576, 1500, and 3000 bytes, e0 and e1 are similar for 5, 10, or 15 selected features from each packet, thus only one figure is given for each MTU size. Table ?? shows that the classification result with the classification threshold set to 0.95. The SP for all MTU sizes, except for 4470 bytes, gives 100% SP and 0% FP. These SP and FP for the said MTU sizes are equivalent to the full e-mail classification. SR increases from 63% to 85%, approaching 87% recorded for full e-mail classifications. SR suffers in small MTU sizes due to limited e-mail sample size and the quality of selected features. This shows the general trend of layer-3 e-mail classification being comparable to full e-mail classifications, especially for sufficiently large MTU sizes.

4.3

Analysis of False Positives

The layer-3 e-mail classification produces only one false positive e-mail occurred from all 4361 legitimate e-mails, for a classification threshold setting of 0.95. The only

TABLE II Comparison between the layer-3 and layer-7 approaches. Classification threshold is set to 0.95.

MTU (bytes) SP (%) SR (%) FP (%)

E-mail 100.00 86.54 0.00

576 100.00 63.04 0.00

non-zero FP is a forwarded article in both plain text and HTML about a product. This legitimate e-mail is partitioned into four e-mail packets. The occurrence of frequently used words in spam (e.g., unsubscribe) and large amount of HTML words resulted in misclassification. Further analysis on false positives when the threshold is lowered to 0.9 and 0.75 indicates the following factors that contributing to false positives: (a) HTML processing; (b) e-mail header processing; and (c) calculation precision. Treating HTML tags as word-feature triggers false positive for certain e-mails, especially where HTML tags occur frequently in learning. The second factor is the bigger ratio of e-mail header as compared to e-mail body (i.e., in a very short e-mail). These e-mails are segmented into two packets that may give two different extreme values. The third is the 16-bit precision used in this experiment. For instance, a 16bit precision P (c0 |x) resulted in a probability of 1, whereas 32-bit precision resulted in a probability of 1.437 × 10−11 .

5

Conclusion

We proposed a novel layer-3 e-mail classification for spam control. The general observation from our ten-fold cross validation experiment using na¨ıve Bayes technique shows comparable SP and FP to full e-mail classification, given a very high threshold setting. This infers to comparable confidence of the layer-3 classification compared to full e-mail (or layer-7) classification. While the proposed approach shows reduced SR due to limited MTU sizes, it is still able to classify more than 60% of the spam directly from inter-packet tag aggregation (i.e., packet reassembly). With two-third spam rate of all e-mail traffic, the e-mail server’s workload is reduced by at least 42% to 57%. The reduction of spam e-mails that need to be processed leads to faster e-mail classification at end systems. The layer-3 e-mail classification approach promises faster e-mail classification and reduced e-mail traffic. More detailed studies are needed to investigate the viability and practicality of this approach, either for in-line classification or at the gateway level. With specialized hardware support, layer-3 e-mail processing, although could not replace the layer-7 anti-spam systems, can reduce the workload on end systems and supports the future e-mail traffic increase.

Acknowledgments The first author is funded by Malaysian Government scholarship JPA-UTM JPA(L)A-3238549. He is attached to Faculty of Electrical Engineering, Universiti Teknologi Malaysia.

1500 100.00 73.35 0.00

3000 100.00 80.44 0.00

4470 99.95 82.39 0.02

9000 100.00 85.06 0.00

References [1] J. Goodman, D. Heckerman, and R. Rounthwaite, “Stopping spam,” Scientific American, pp. 42–49, April 2005. [2] M. Sahami, S. Dumais, D. Heckerman, and E. Horvitz, “A Bayesian approach to filtering junk e-mail,” Learning for Text Categorization: Papers from the 1998 Workshop, Madison, WI, USA, AAAI Technical Report WS-98-05, 1998. [3] (January 2006) SurfControl Riskfilter. [Online]. Available: http://www.surfcontrol.com/products/email /riskfilter/ [4] M. N. Marsono, M. W. El-Kharashi, and F. Gebali, “Binary LNS-based na¨ıve Bayes hardware classifier for spam control,” Accepted, IEEE International Symposium on Circuits and Systems (ISCAS), Island of Kos, Greece, 21-24 May 2006. [5] N. Weaver, V. Paxson, S. Staniford, and R. Cunningham, “A taxonomy of computer worms,” Proceedings of the 2003 ACM Workshop on Rapid Malcode, Washington, DC, USA, pp. 11–18, October 2003. [6] A. Jesdanun (April 15, 2005), “As spam filters improve, attention shifts to containment,” [Online]. Available: http://www.technologyreview.com/TR/ wtr 14284,323,p1.html [7] M. Necker, D. Contis, and D. Schimmel, “TCP-stream reassembly and state tracking in hardware,” Proceedings of the 10th Annual IEEE Symposium on FieldProgrammable Custom Computing Machines, Napa, CA, USA, pp. 286-287, April 2002. [8] R. D. Twining, M. M. Williamson, M. Mowbray, and M. Rahmouni, “Email prioritization: Reducing delays on legitimate mail caused by junk mail,” HP Digital Media Systems Laboratory, Bristol, Technical Report HPL-2004-5(R.1), May 24, 2004. [9] J. Walker. (2005), “Annoyance filter: Adaptive Bayesian junk mail filter,” [Online]. Available: http://www.fourmilab.ch/annoyance-filter/ [10] S. Holden. (2004) “Spam filtering II,” [Online.] Available: http://sam.holden.id.au/writings/spam2/ [11] (Nov 2005) SpamAssassin public corpus, [Online]. Available: http://spamassassin.apache.org/ publiccorpus/