Spam Email Filtering with Bayesian Belief Network: using Relevant Words Xin Jin, Anbang Xu, Rongfang Bie*, Xian Shen and Min Yin

Abstract—In this paper, we report our work on a Bayesian Belief Network approach to spam email filtering (classifying email as spam or nonspam/legitimate). Our evaluation suggests that a Bayesian Belief Network based classifier will outperform the popular Naïve Bayes approach and two other famous learners: decision tree and k-NN. These four algorithms are tested on two different data sets with three different feature selection methods (Information Gain, Gain Ratio and Chi Squared) for finding relevant words. 10-fold cross-validation results show that Bayesian Belief Network performs best on both datasets. We suggest that this is because the ‘dependant learner’ characteristics of Bayesian Belief Network classification are more suited to spam filtering. The performance of the Bayesian Belief Network classifier could be further improved by careful feature subset selection. Index Terms—Data mining, classification, spam filtering, feature selection.

T

I.

INTRODUCTION

he increasing popularity and low cost of electronic mail have intrigued direct marketers to flood the mailboxes of thousands of users with unsolicited messages. These messages are usually referred to as spam or, more formally, Unsolicited Bulk E-mail (UBE), and may advertise anything, from vacations to get-rich schemes. The task of spam filtering is to rule out unsolicited emails automatically from the email stream. These unsolicited emails have already caused many problems such as filling mailboxes, engulfing important personal mail, wasting network bandwidth, consuming users' time and energy to sort through it, not to mention all the other problems associated with spam (crashed mail-servers, pornography adverts sent to children, and so on). Statistics collected by CAUBE.AU show that the volume of spam is increasing at an alarming rate, and some people report that they are even abandoning their email accounts because of it [3]. This work was supported by the National Science Foundation of China under the Grant No. 60273015 and No. 10001006. Xin Jin is with the College of Information Science and Technology, Beijing Normal University, Beijing 100875, China (Email: [email protected]). Anbang Xu is with the Image Processing & Pattern Recognition Laboratory, Beijing Normal University, Beijing, China ([email protected]). * Corresponding Author. Rongfang Bie is with the College of Information Science and Technology, Beijing Normal University, Beijing 100875, China (Email: [email protected]). Xian Shen is with the College of Information Science and Technology, Beijing Normal University, Beijing, China ([email protected]). Min Yin is with the College of Information Science and Technology, Beijing Normal University, Beijing 100875, China (Email: [email protected]).

1-4244-0134-8/06/$20.00 1-4244-0133-X/06/$20.00© ©2006 2006IEEE IEEE

238

Therefore it is challenging to develop spam filters that can effectively eliminate the large volumes of unwanted emails automatically before they enter a user's mailbox [1]. Spam filtering can be recast as document classification task where the classes to be predicted are spam and legitimate. Several supervised machine-learning algorithms have been successfully applied to mail filtering task: Naïve Bayes classifier [2,13,15,26], RIPPER rule induction algorithm [4], Memory Based Learning [5,13], Support Vector Machines [7,8], Decision Tree [10], and combinations of different learners [12]. Among these methods the Naïve Bayes classifier has been found particularly attractive for the task of email filtering because it performs surprisingly well [13], many commercial anti-spam softwares are using naïve bayes classifiers as their filtering engines. Naïve bayes email classifier makes the assumption of class conditional independence, that is, given the class label of an email, the frequencies of the words in the email are conditionally independent of one another. In practice, however, dependencies can exist between words in emails. In this paper we propose the use of Bayesian Belief Network, which can capture the conditional dependencies among words, for email filtering. Experiments on two benchmark datasets show that Bayesian Belief Network can do improve the filtering/classification performance. This paper proceeds as follows. Section 2 presents the email representation and preprocessing method. Section 3 describes the Bayesian Belief Network method for spam emails filtering. Section 4 presents three popular classifiers for comparison. Section 5 describes three feature selection methods. Section 6 reports on performance evaluation results of those classifiers using accuracy and F measure. Section 7 concludes the findings. II.

EMAIL REPRESENTATION AND PREPROCESSING

We represent each email as a bag of words/features. A feature vector V is composed of the various words from a dictionary formed by analyzing the emails. There is one feature vector per email. The ith component/word wi of the feature vector is the number of times that word appears in that email. Two issues need to be considered in preprocessing emails are word stemming and stop-word removal. The main advantages of applying word stemming and stop-word removal are the reduction of feature space dimension and possible improvement on classifiers' prediction accuracy by alleviating the data sparseness problem. Word stemming refers to

converting words to their morphological base forms; for example, both “kicking” and “kicked” are reduced to root word ‘kick’. Stop-word removal is a procedure to remove words that are found in a list of frequently used words like “and, for, a, the, they, we”. In addition, since most emails are in the HTML format, HTML tags need to be removed before email filtering. III. BAYESIAN BELIEF NETWORK Bayesian Belief Networks (BBN or BN) have become a popular tool for modeling many kinds of statistical problems [14,23,24,27]. A BBN is a compact representation of a multivariate statistical distribution function. A BBN encodes the probability density function governing a set of random variables by specifying a set of conditional independence statements together with a set of conditional probability functions. To use Bayesian Belief Networks for spam email filtering, we define the finite set X={X1, X2,…, Xn} of variables to be the words appeared in emails, where each variable/word Xi will take on integrate values. For simplicity, instead of using the words themselves, we use capital letters, such as X, Y, Z, for variable names and lowercase letters x, y, z to denote specific values taken by those variables (i.e. the frequencies of the words in the emails). Sets of variables are denoted by boldface capital letters X, Y, Z and assignments of values to the variables in these sets are denoted by lowercase letters x, y, z.

Fig. 1. An example Bayesian Belief Network over the nodes {X1,…,X5}. Only the qualitative part of the BBN is shown.

A Bayesian Belief Network is an annotated directed acyclic graph that encodes a joint probability distribution of a set of variables X. An example of a Bayesian Belief Network over the variables X = (X1,…,X5) is shown in Fig.1, only the qualitative part is given. The nodes with outgoing edges pointing into a specific node are the parents of that node. Xj is a descendant of Xi if and only if there exists a directed path from Xi to Xj in the graph. In Fig.1 X1 and X2 are the parents of X3, written pa (X3) = {X1, X2} for short. Furthermore, pa (X4) = {X3} and since there are no directed path from X4 to any of the other nodes, the descendants of X4 are given by the empty set and, accordingly, its non-descendants are {X1, X2, X3, X5}. The edges of the graph represent the assertion that a variable is conditionally independent of its nondescendants in the graph given its parents in the same graph. The graph in Fig.1 does for instance assert that for all distributions compatible with it, we have that X4 is conditionally independent of {X1, X2, X5} when conditioned on {X3}. All conditional independence statements can be read off a Bayesian Belief Network structure by using the rules of

1-4244-0133-X/06/$20.00 © 2006 IEEE

239

d-separation. Formally, a Bayesian Belief Network for X, is a pair B = . The first component, namely G, is a directed acyclic graph whose vertices correspond to the variables X1,…,Xn, and whose edges represent direct dependencies between the variables. The graph G encodes the following set of independence statements: each variable Xi is independent of its nondescendants given its parents in G. The second component of the pair, namely Θ, represents the set of parameters that quantifies the network. It contains a parameter PB (xi | pa(xi )) for each possible value xi of Xi, and pa(xi ) of pa(Xi), where pa(Xi ) denotes the set of parents of Xi in G . A Bayesian Belief Network B defines a unique joint probability distribution over X given by: n

PB ( X 1 ,..., X n ) = ∏ PB ( X i | pa ( X i )).

(1)

i =1

Given a training set T={x[1],…,x[N]} of instances of X, find a network B that best matches T. The common approach to learn a Bayesian Belief Network structure B that best matches T is to introduce a score that evaluates the “fitness” of networks with respect to the training data (the score measures the quality of a network structure), and then to search for the best network (according to this score). Several scores have been proposed, including Bayes, BDeu, Minimum Description Length (MDL), Akaike Information Criterion (AIC), Entropy, etc. In our study we use the score, proposed by D. Heckerman, which is based on Bayesian considerations, and which scores a network structure according to the posterior probability of the graph structure given the training data (up to a constant) [24]. The derivation of such score treats the problem as a density estimation problem. The desire is to construct networks that will assign high probability to previously unseen data from the same source. Since presumably the “right” structure is the one that can better generalize from the training data, the structural features of the networks are induced indirectly. Finding the structure that maximizes the score is usually an intractable problem [22]. Thus, we usually resort to heuristic search to find a high-scoring structure. Several methods have been proposed for such search, including genetic searching, standard hill climbing, repeated hill climbing, local score searching, TAN, tabu searching, simulated annealing and K2 (hill climbing restricted by an order on the variables). In our study, we use K2, which is a popular heuristic searching method. IV. OTHER CLASSIFIERS A. Naïve Bayes Naïve Bayes has been found to perform surprisingly well in email filtering [13]. Consider the task of email filtering/classification in a Bayes learning framework. A parametric model is assumed to have generated the data, and Bayes-optimal estimates of the model parameters are calculated using the training data. We classify new email using Bayes rule to turn the generative model around and calculate the posterior probability that a class would have generated the email. Then, classification becomes a simple matter of selecting the most

probable class. Assume that emails are generated by a mixture model parameterized by θ. The mixture model consists of mixture components C ={c1… cM, M is the number of classes, for email filtering, M=2} that correspond to the classes. Each component ci ∈ C is parameterized by a disjoint subset of θ. An email is generated by first selecting a mixture component ci according to the prior distribution P(ci|θ) and then having the component generate a email according to its own parameters, with distribution P(lj|ci;θ). The likelihood of an email is given by a sum of probability over all mixture components: (2) P(l j | θ ) = P(ci | θ )P(e j | ci ;θ )

∑ i

where i= 1, 2. Each email has been manually annotated with its correct class. Since the true parameters θ of the mixture model are not known, we need to estimate the parameters from labeled training emails. If θ’ denotes the estimated parameters, given a set of training emails L={e1,…, eN, N is the number of training samples}, we use maximum likelihood to estimate the class prior parameters as the fraction of training emails is ci:

∑ |θ ) =

N j =1

P(ci | e j )

(3) N where P(ci |ej) is 1 if ej correspond to class ci and 0 otherwise. In general, the email classification problem can be described as follows. Taking into account that one email only belongs to one class (spam or legislate), for a given email e we search for a class ci that maximizes the posterior probability P(ci | e;θ’), by applying Bayes rule: P(ci | θ ' ) P (e | ci ;θ ' ) (4) P (ci | e;θ ' ) = P (e | θ ' ) Note that P(e|θ’) is the same for all classes, thus e can be classified by computing. (5) c l = arg max P(c i | θ ' ) P (e | c i ; θ ' )

θ = P(ci ' ci

'

ci ∈C

C. Decision Tree Decision Tree (DT) is one of the most popular inductive learning algorithms [19]. The nodes of the tree correspond to attribute test, the links (to attribute values and the leaves) to the classes. To induce a DT, the most important attribute, according to an attribute selection criterion, is selected and placed at the root; one branch is made for each possible attribute value. This divides the examples into subsets, one for each possible attribute value. The process is repeated recursively for each subset until all instances at a node have the same classification, in which case a leaf is created. To classify an example we start at the root of the tree and follow the path corresponding to the example’s values until a leaf node is reached and the classification is obtained. To prevent overtraining DT is typically pruned. We have used a fast DT learning algorithm REPTree implemented in Weka [18] for our experiments. V.

RELEVANT WORDS SELECTION

In the field of data mining many have argued that maximum performance is often not achieved by using all available features, but by using only a “good” subset of features. This is called feature selection. For junk email filtering, this means that we want to find a subset of words which help to discriminate between spam and legislate email. In this paper we investigate the power of three feature selection methods: Information Gain (IG), Gain Ratio (GR) and Chi-Squared Static (CSS), to find relevant words for email filtering. Suppose that there are a total of m classes denoted by C={C1, C2,…, Cm} (in junk email filtering, we known that m=2, we may define C1 as spam email and C2 as legislate email), let there be N training emails represented by, (a(1), b(1),…; t(1)),…, ( a(N), b(N),…; t(N)) where, a(i), b(i),…are vectors of n words and t(i)∈C is the class label. Of the N examples, N C k belong to class Ck. A

See [16] for more information on estimating continuous distributions in Naïve Bayes classifiers.

feature can split these examples into V partitions, each of which has N(v) examples. In a particular partition, the number of

B. Memory Based Learning Memory based learning (also called instance based learning) is a non-parametric inductive learning paradigm that stores training instances in a memory structure on which predictions of new instances are based. The approach assumes that reasoning is based on direct reuse of stored experiences rather than on the application of knowledge (such as models or decision trees) abstracted from training. The similarity between the new instance and an example in memory is computed using a distance metric. K-Nearest Neighbors (k-NN) algorithm is a classical memory based learner. In our experiment, we used IB1 [20], a k-NN classifier (k=1) that uses Euclidian distance metric [14]. The main idea is to treat all emails as points in the m-dimensional space (where m is the number of distinct words in the email set), and for an unseen email the algorithm classifies it by the nearest training email.

examples of class Ck is denotes by N C k . The relevance degree

1-4244-0133-X/06/$20.00 © 2006 IEEE

240

(v )

of the feature can be calculated by the following three methods. The higher the value is, the more relevant the feature is for junk email filtering. Information Gain: Information Gain is based on the feature’s impact on decreasing entropy [6] and can be calculated by: m NC NC InfoGain = [∑ − ( k ) log( k )] N N k =1 (6) (v) (v) (v) m V NC NC N − [∑ ( )∑ − ( ( vk ) ) log( ( vk) )] N k =1 N N v =1

Gain Ratio: Gain Ratio is introduced by Quinlan in C4.5 [17]. Gain Ratio compensates for the number of features by normalizing by the information encoded in the split itself: m NC NC (7) GainRatio = InfoGain /[∑ − ( k ) log( k )] N N k =1

Chi Squared: The Chi Squared Statistic is based on comparing the obtained values of the frequency of a class because of the split to the a priori frequency of the class. More specifically,

~ m V ( N (v ) − N ( v ) ) 2 Ck Ck 2 ChiSqured = χ = ∑∑ ~ N C( vk ) k =1 v =1

(8)

~ where, N C( vk ) = ( N ( v ) / N ) N Ck denotes the a priori frequency. Clearly, a larger value of ChiSquared indicates that the split is more homogeneous, i.e., has a greater frequency of instances from a particular class. VI. EXPERIMENTS

terms of a single rating. F ranges from 0 to 1, the higher the F measure the better. C. Results 10-fold cross-validation is used for estimating prediction performance. Fig.2 shows the best accuracy and F measure achievements of Bayesian Belief Network and the other three classifiers (Naïve Bayes, Decision Tree, k-NN) on the Ling and PU1 email datasets. Bayesian Belief Network achieves the best accuracy and F measure on both Ling and PU1. Bayesian Belief Network is also the best for each of the three feature selection methods. %

100

Accuracy

99

A. Corpora The experiments are based on two benchmark email Corpora which are available on [25]. PU1 Corpus: The PU1 corpus consists of 1099 messages, 481 of which are marked as spam and 618 are labeled as legitimate, with a spam rate of 43.77%. The messages in PU1 corpus have header fields and html tags removed, leaving only subject line and mail body text. To address privacy, each token is mapped to a unique integer. Ling Spam Corpus: The Ling spam corpus includes: 2412 legitimate messages from a linguistic mailing list and 481 spam messages collected by the author with a 16.63% spam rate. Like PU1 corpus, four versions are available with header fields, html tags, and attachments removed. Since the Ling spam corpus was compiled from different sources: the legitimate messages came from a spam-free, topic-specific mailing list, and the spam mails were collected from a personal mailbox, the mail distribution is less like that of the normal user's mail stream, which may make messages in Ling spam corpus easily separable. B. Performance Measures Our experiments adopt two popular performance measures in text classification domain: accuracy and F measure. Accuracy: Accuracy is defined by the ratio of the number of correct predictions and the number of all predictions (both correct and incorrect):

Acc =

Ncp ×100% Np

(9)

where Ncp is the number of correct predictions and Np is the number of all predictions (i.e. the number of test samples). For a perfect classification, Ncp = Np and Acc = 100%. So, the Acc ranges from 0 to 100%, with 100% corresponding to the ideal, the higher the accuracy the better. F measure: F measure is defined as 2R * P (10) F= R+P Recall (R) is the percentage of the emails for a given category that are classified correctly. Precision (P) is the percentage of the predicted emails for a given category that are classified correctly. It is a normal practice to combine recall and precision to F measure so that classifiers can be compared in

1-4244-0133-X/06/$20.00 © 2006 IEEE

241

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

98 97 96 95 94 93 92 91 90 89 L.IG

L.CS

L.GR

P.IG

P.CS

P.GR

F-measure

1

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 L.IG

L.CS

L.GR

P.IG

P.CS

P.GR

Fig. 2. Best Accuracy and F measure achievements of Bayesian Belief Network and other three classifiers on the Ling and PU1 email datasets. L.IG means result on Ling dataset using Information Gain (IG) as the feature selection method, L.CS represents Ling + Chi Squred (CS), L.GR represents Ling + Gain Ratio (GR), P.IG represents PU1 + IG, and so on.

For corpus Ling, Bayesian Belief Network reaches a maximum of 97.6% accuracy and a maximum of 0.99 F measure. Three feature selection methods: Information Gain, Gain Ratio and Chi Squared have similar impacts on Bayesian Belief Network, but various impacts on the other three classifiers. Information Gain is the best for other three classifiers. For corpus PU1, Bayesian Belief Network reaches a maximum of 94.9% accuracy and a maximum of 0.96 F measure, both with Information Gain as the feature selection methods. Information Gain is the best for Bayesian Belief Network and Naïve Bayes, and Gain Ratio is the best for Decision Tree and k-NN. Fig.3 shows the accuracy and F measure curves of Bayesian Belief Network and other three classifiers for different word/feature sizes on the Ling email corpus. The number of selected words/features varies from 2 to 5000. With Information Gain feature/word selection, Bayesian Belief

Network reaches the maximum accuracy at 2000 words and the maximum F measure at 1000 words. With only several dozens Information Gain selected words, Bayesian Belief Network can still achieve relatively high accuracy and F1 measure. This shows that only a few Information Gain selected relevant words can still distinguish between spam and legitimate emails. This could be true in reality, for example it is well known that the words “buy, purchase, jobs, …” usually appear in spam emails, thus are powerful email category distinguishers. Accuracy(%)

Ling (InfoGain)

Fig.4 (in the last page) shows the accuracy and F measure curves of Bayesian Belief Network and other three classifiers with different word sizes on the PU1 email corpus. With Information Gain feature/word selection, Bayesian Belief Network reaches the maximum accuracy at 100 words and the maximum F measure at 100 words. Like the situation in Ling corpus, with the help of relevant word selection, the classifier can get high performance with relative small word size.

Accuracy(%) Ling (ChiSquared)

Accuracy(%) Ling (GainRatio)

100

100

98

98

98

96

96

96

94

94

94

92

92

92

90

90

90

88

88

86

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

84 82 80

88

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

86 84 82 80

82

5000

3000

1000

300

0.94

0.94

0.92

0.92

0.92

0.9

0.9

0.9

0.88

0.88

0.8

50

0.96

0.94

0.82

30

0.98

0.96

0.84

Ling (GainRatio)

1

0.98

0.86

10

F-measure

0.96

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

4

2

Ling (ChiSquared)

0.98

0.88

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

84

5000

3000

1000

300

50

1

30

F-measure

10

4

Ling (InfoGain)

86

80

2

5000

3000

1000

300

50

F-measure

30

10

4

2

1

100

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.86 0.84 0.82 0.8

0.86

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.84 0.82 0.8

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

Fig. 3. Accuracy (the first row) and F measure (the second row) curves of Bayesian Belief Network and other three classifiers (Naïve Bayes, Decision Tree using PERTree and k -NN using IB1) for different word/feature sizes on the Ling email corpus. The X-axis denotes the number of selected words. The three figures in each row correspond to using the three feature selection methods: Information Gain (InfoGain), ChiSquared and GainRatio.

VII. CONCLUSION In this paper, we report our work on a Bayesian Belief Network approach to spam filtering. We also investigate the use of three selection methods: Information Gain, Gain Ratio and Chi Squared, for finding relevant words for email filtering. Our evaluation on two benchmark email datasets suggests that a Bayesian Belief Network classifier will outperform the popular Naïve Bayes approach and two other famous learners: k-NN and Decision Tree. We suggest that this is because the “dependent learner” characteristic of Bayesian Belief Network classification is more suited to spam filtering. We believe that the performance of the learning based spam filters could be further improved by more careful feature subset selection. REFERENCES [1] [2]

Le Zhang, Jingbo Zhu, Tianshun Yao: “An Evaluation of Statistical Spam Filtering Techniques.” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 4, Pages 243-269, December, 2004. Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. “A bayesian approach to filtering junk e-mail.” In Learning for Text

1-4244-0133-X/06/$20.00 © 2006 IEEE

242

Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. [3] CAUBE.AU: http://www.caube.org.au/spamstats.html 2006. [4] W.W. Cohen. “Learning rules that classify e-mail.” In Spring Symposium on Machine Learning in Information Access, Stanford, California, 1996. [5] Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, Panagiotis Stamatopoulos: “A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists.” Information Retrieval, 6(1):49-73, 2003. [6] J. Ross Quinlan: “Induction of decision trees.” Machine Learning, 1:81-106, 1986. [7] Harris Drucker, DonghuiWu, and Vladimir N. Vapnik. “Support vector machines for spam categorization.” IEEE Trans. on Neural Networks, 10(5):1048–1054. 1999. [8] Aleksander Kolcz and Joshua Alspector. “SVM-based filtering of e-mail spam with content-specific misclassification costs.” In Proceedings of the TextDM'01 Workshop on Text Mining - held at the 2001 IEEE International Conference on Data Mining, 2001. [9] Corinna Cortes and Vladimir Vapnik: “Support-vector Networks.” Machine Learning, 20(3):273-297, 1995. [10] Xavier Carreras and Lluis Marquez. “Boosting trees for anti-spam email filtering.” In Proceedings of RANLP01, 4th International Conference on Recent Advances in Natural Language Processing, 2001. [11] William W. Cohen: “Fast Effective Rule Induction.” Machine Learning: Proceedings of the Twelfth International Conference (ML95), 1995. [12] Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos.

[13]

[14] [15]

[16]

[17] [18] [19] [20]

[21] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy: “Improvements to Platt's SMO Algorithm for SVM Classifier Design.” Neural Computation, 13(3):637-649, 2001. [22] D. M. Chickering. “Learning Bayesian Networks is NP complete.” In AI&STAT V, 1996. [23] N. Friedman, M. Goldszmidt, A. Wyner: “Data Analysis with Bayesian Networks: a Bootstrap Approach.” In Proc. Fifteenth Conf. on Uncertainty in Artificial Intelligence (UAI), 1999. [24] D. Heckerman, D. Geiger, and D. M. Chickering. “Learning Bayesian networks: The combination of knowledge and statistical data,” Machine Learning, 20:197–243, 1995. [25] Email Benchmark Corpus, http://www.aueb.gr/users/ion/publications.html 2006. [26] Le Zhang, Jingbo Zhu, Yao Tianshun, “An Evaluation of Statistical Spam Filtering Techniques,” ACM Trans. Asian Lang. Inf. Process. 3(4):243-269, 2004. [27] Helge Langseth, “Bayesian Networks in Reliability: Some Recent Developments,” Fourth International Conference on Mathematical Methods in Reliability Methodology and Practice, Santa Fe, New Mexico, 2004.

“Stacking classifiers for anti-spam filtering of e-mail.” Proc. 6th Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), pages 44–50,2001. Karl-Michael Schneider: “A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering.” In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 307-314, April, 2003. D. Heckerman. “A tutorial on learning with Bayesian networks.” In Learning in Graphical Models. 1998. I. Androutsopoulos et al.: “Learning to Filter Spam E-mail: A Comparison of a Naïve Bayesian and a Memory-based Approach.” In Proceedings of the Workshop on Machine Learning and Textual Information Access, Pages 1-13, 2000. George H. John and Pat Langley: “Estimating Continuous Distributions in Bayesian Classifiers.” In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Page 338-345. Morgan Kaufmann, San Mateo, 1995. J. R. Quinlan: “C4.5: Programs for Machine Learning.” Morgan Kaufmann, San Manteo, California, 1993. I.Witten and E.Frank: “Data Mining –Practical Machine Learning Tools and Techniques with Java Implementation.” Morgan Kaufmann, 2000. Ross Quinlan: “C4.5: Programs for Machine Learning.” Morgan Kaufmann Publishers, San Mateo, CA, 1993. Aha, D., and D. Kibler: “Instance-based Learning Algorithms,” Machine Learning, Vol.6, 37-66, 1991.

Accuracy(%)

100

Accuracy(%) PU1 (ChiSquared)

PU1 (InfoGain)

Accuracy(%) PU1 (GainRatio)

100

100

95

95

95

90

90

85

85

80

80

75

90 85 80 75

75

BeliefNetwork

70

NaiveBayes

70

65

DT(REPTree) k-NN(IB1)

65

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

60

60

5000

3000

PU1 (GainRatio)

0.95

0.8

0.8

0.8

0.75

0.75

0.75

0.6

1000

0.85

300

0.85

50

0.9

0.85

30

0.9

10

0.95

0.65

4

2

1

0.9

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

k-NN(IB1)

F-measure

PU1 (ChiSquared)

0.95

0.7

DT(REPTree)

60

5000

3000

1000

300

50

1

30

F-measure

10

4

PU1 (InfoGain)

1

NaiveBayes

65

55

2

5000

3000

1000

300

50

30

10

4

2

F-measure

BeliefNetwork

70

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.7 0.65

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.7 0.65 0.6

0.6

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

Fig. 4. Accuracy (the first row) and F measure (the second row) curves of Bayesian Belief Network and other three classifiers for different word/feature sizes on the PU1 email corpus with three feature selection methods. The X-axis denotes the number of selected words. The three figures in each row correspond to using the three feature selection methods: Information Gain (Info Gain), Chi Squared and Gain Ratio.

1-4244-0133-X/06/$20.00 © 2006 IEEE

243

Abstract—In this paper, we report our work on a Bayesian Belief Network approach to spam email filtering (classifying email as spam or nonspam/legitimate). Our evaluation suggests that a Bayesian Belief Network based classifier will outperform the popular Naïve Bayes approach and two other famous learners: decision tree and k-NN. These four algorithms are tested on two different data sets with three different feature selection methods (Information Gain, Gain Ratio and Chi Squared) for finding relevant words. 10-fold cross-validation results show that Bayesian Belief Network performs best on both datasets. We suggest that this is because the ‘dependant learner’ characteristics of Bayesian Belief Network classification are more suited to spam filtering. The performance of the Bayesian Belief Network classifier could be further improved by careful feature subset selection. Index Terms—Data mining, classification, spam filtering, feature selection.

T

I.

INTRODUCTION

he increasing popularity and low cost of electronic mail have intrigued direct marketers to flood the mailboxes of thousands of users with unsolicited messages. These messages are usually referred to as spam or, more formally, Unsolicited Bulk E-mail (UBE), and may advertise anything, from vacations to get-rich schemes. The task of spam filtering is to rule out unsolicited emails automatically from the email stream. These unsolicited emails have already caused many problems such as filling mailboxes, engulfing important personal mail, wasting network bandwidth, consuming users' time and energy to sort through it, not to mention all the other problems associated with spam (crashed mail-servers, pornography adverts sent to children, and so on). Statistics collected by CAUBE.AU show that the volume of spam is increasing at an alarming rate, and some people report that they are even abandoning their email accounts because of it [3]. This work was supported by the National Science Foundation of China under the Grant No. 60273015 and No. 10001006. Xin Jin is with the College of Information Science and Technology, Beijing Normal University, Beijing 100875, China (Email: [email protected]). Anbang Xu is with the Image Processing & Pattern Recognition Laboratory, Beijing Normal University, Beijing, China ([email protected]). * Corresponding Author. Rongfang Bie is with the College of Information Science and Technology, Beijing Normal University, Beijing 100875, China (Email: [email protected]). Xian Shen is with the College of Information Science and Technology, Beijing Normal University, Beijing, China ([email protected]). Min Yin is with the College of Information Science and Technology, Beijing Normal University, Beijing 100875, China (Email: [email protected]).

1-4244-0134-8/06/$20.00 1-4244-0133-X/06/$20.00© ©2006 2006IEEE IEEE

238

Therefore it is challenging to develop spam filters that can effectively eliminate the large volumes of unwanted emails automatically before they enter a user's mailbox [1]. Spam filtering can be recast as document classification task where the classes to be predicted are spam and legitimate. Several supervised machine-learning algorithms have been successfully applied to mail filtering task: Naïve Bayes classifier [2,13,15,26], RIPPER rule induction algorithm [4], Memory Based Learning [5,13], Support Vector Machines [7,8], Decision Tree [10], and combinations of different learners [12]. Among these methods the Naïve Bayes classifier has been found particularly attractive for the task of email filtering because it performs surprisingly well [13], many commercial anti-spam softwares are using naïve bayes classifiers as their filtering engines. Naïve bayes email classifier makes the assumption of class conditional independence, that is, given the class label of an email, the frequencies of the words in the email are conditionally independent of one another. In practice, however, dependencies can exist between words in emails. In this paper we propose the use of Bayesian Belief Network, which can capture the conditional dependencies among words, for email filtering. Experiments on two benchmark datasets show that Bayesian Belief Network can do improve the filtering/classification performance. This paper proceeds as follows. Section 2 presents the email representation and preprocessing method. Section 3 describes the Bayesian Belief Network method for spam emails filtering. Section 4 presents three popular classifiers for comparison. Section 5 describes three feature selection methods. Section 6 reports on performance evaluation results of those classifiers using accuracy and F measure. Section 7 concludes the findings. II.

EMAIL REPRESENTATION AND PREPROCESSING

We represent each email as a bag of words/features. A feature vector V is composed of the various words from a dictionary formed by analyzing the emails. There is one feature vector per email. The ith component/word wi of the feature vector is the number of times that word appears in that email. Two issues need to be considered in preprocessing emails are word stemming and stop-word removal. The main advantages of applying word stemming and stop-word removal are the reduction of feature space dimension and possible improvement on classifiers' prediction accuracy by alleviating the data sparseness problem. Word stemming refers to

converting words to their morphological base forms; for example, both “kicking” and “kicked” are reduced to root word ‘kick’. Stop-word removal is a procedure to remove words that are found in a list of frequently used words like “and, for, a, the, they, we”. In addition, since most emails are in the HTML format, HTML tags need to be removed before email filtering. III. BAYESIAN BELIEF NETWORK Bayesian Belief Networks (BBN or BN) have become a popular tool for modeling many kinds of statistical problems [14,23,24,27]. A BBN is a compact representation of a multivariate statistical distribution function. A BBN encodes the probability density function governing a set of random variables by specifying a set of conditional independence statements together with a set of conditional probability functions. To use Bayesian Belief Networks for spam email filtering, we define the finite set X={X1, X2,…, Xn} of variables to be the words appeared in emails, where each variable/word Xi will take on integrate values. For simplicity, instead of using the words themselves, we use capital letters, such as X, Y, Z, for variable names and lowercase letters x, y, z to denote specific values taken by those variables (i.e. the frequencies of the words in the emails). Sets of variables are denoted by boldface capital letters X, Y, Z and assignments of values to the variables in these sets are denoted by lowercase letters x, y, z.

Fig. 1. An example Bayesian Belief Network over the nodes {X1,…,X5}. Only the qualitative part of the BBN is shown.

A Bayesian Belief Network is an annotated directed acyclic graph that encodes a joint probability distribution of a set of variables X. An example of a Bayesian Belief Network over the variables X = (X1,…,X5) is shown in Fig.1, only the qualitative part is given. The nodes with outgoing edges pointing into a specific node are the parents of that node. Xj is a descendant of Xi if and only if there exists a directed path from Xi to Xj in the graph. In Fig.1 X1 and X2 are the parents of X3, written pa (X3) = {X1, X2} for short. Furthermore, pa (X4) = {X3} and since there are no directed path from X4 to any of the other nodes, the descendants of X4 are given by the empty set and, accordingly, its non-descendants are {X1, X2, X3, X5}. The edges of the graph represent the assertion that a variable is conditionally independent of its nondescendants in the graph given its parents in the same graph. The graph in Fig.1 does for instance assert that for all distributions compatible with it, we have that X4 is conditionally independent of {X1, X2, X5} when conditioned on {X3}. All conditional independence statements can be read off a Bayesian Belief Network structure by using the rules of

1-4244-0133-X/06/$20.00 © 2006 IEEE

239

d-separation. Formally, a Bayesian Belief Network for X, is a pair B = . The first component, namely G, is a directed acyclic graph whose vertices correspond to the variables X1,…,Xn, and whose edges represent direct dependencies between the variables. The graph G encodes the following set of independence statements: each variable Xi is independent of its nondescendants given its parents in G. The second component of the pair, namely Θ, represents the set of parameters that quantifies the network. It contains a parameter PB (xi | pa(xi )) for each possible value xi of Xi, and pa(xi ) of pa(Xi), where pa(Xi ) denotes the set of parents of Xi in G . A Bayesian Belief Network B defines a unique joint probability distribution over X given by: n

PB ( X 1 ,..., X n ) = ∏ PB ( X i | pa ( X i )).

(1)

i =1

Given a training set T={x[1],…,x[N]} of instances of X, find a network B that best matches T. The common approach to learn a Bayesian Belief Network structure B that best matches T is to introduce a score that evaluates the “fitness” of networks with respect to the training data (the score measures the quality of a network structure), and then to search for the best network (according to this score). Several scores have been proposed, including Bayes, BDeu, Minimum Description Length (MDL), Akaike Information Criterion (AIC), Entropy, etc. In our study we use the score, proposed by D. Heckerman, which is based on Bayesian considerations, and which scores a network structure according to the posterior probability of the graph structure given the training data (up to a constant) [24]. The derivation of such score treats the problem as a density estimation problem. The desire is to construct networks that will assign high probability to previously unseen data from the same source. Since presumably the “right” structure is the one that can better generalize from the training data, the structural features of the networks are induced indirectly. Finding the structure that maximizes the score is usually an intractable problem [22]. Thus, we usually resort to heuristic search to find a high-scoring structure. Several methods have been proposed for such search, including genetic searching, standard hill climbing, repeated hill climbing, local score searching, TAN, tabu searching, simulated annealing and K2 (hill climbing restricted by an order on the variables). In our study, we use K2, which is a popular heuristic searching method. IV. OTHER CLASSIFIERS A. Naïve Bayes Naïve Bayes has been found to perform surprisingly well in email filtering [13]. Consider the task of email filtering/classification in a Bayes learning framework. A parametric model is assumed to have generated the data, and Bayes-optimal estimates of the model parameters are calculated using the training data. We classify new email using Bayes rule to turn the generative model around and calculate the posterior probability that a class would have generated the email. Then, classification becomes a simple matter of selecting the most

probable class. Assume that emails are generated by a mixture model parameterized by θ. The mixture model consists of mixture components C ={c1… cM, M is the number of classes, for email filtering, M=2} that correspond to the classes. Each component ci ∈ C is parameterized by a disjoint subset of θ. An email is generated by first selecting a mixture component ci according to the prior distribution P(ci|θ) and then having the component generate a email according to its own parameters, with distribution P(lj|ci;θ). The likelihood of an email is given by a sum of probability over all mixture components: (2) P(l j | θ ) = P(ci | θ )P(e j | ci ;θ )

∑ i

where i= 1, 2. Each email has been manually annotated with its correct class. Since the true parameters θ of the mixture model are not known, we need to estimate the parameters from labeled training emails. If θ’ denotes the estimated parameters, given a set of training emails L={e1,…, eN, N is the number of training samples}, we use maximum likelihood to estimate the class prior parameters as the fraction of training emails is ci:

∑ |θ ) =

N j =1

P(ci | e j )

(3) N where P(ci |ej) is 1 if ej correspond to class ci and 0 otherwise. In general, the email classification problem can be described as follows. Taking into account that one email only belongs to one class (spam or legislate), for a given email e we search for a class ci that maximizes the posterior probability P(ci | e;θ’), by applying Bayes rule: P(ci | θ ' ) P (e | ci ;θ ' ) (4) P (ci | e;θ ' ) = P (e | θ ' ) Note that P(e|θ’) is the same for all classes, thus e can be classified by computing. (5) c l = arg max P(c i | θ ' ) P (e | c i ; θ ' )

θ = P(ci ' ci

'

ci ∈C

C. Decision Tree Decision Tree (DT) is one of the most popular inductive learning algorithms [19]. The nodes of the tree correspond to attribute test, the links (to attribute values and the leaves) to the classes. To induce a DT, the most important attribute, according to an attribute selection criterion, is selected and placed at the root; one branch is made for each possible attribute value. This divides the examples into subsets, one for each possible attribute value. The process is repeated recursively for each subset until all instances at a node have the same classification, in which case a leaf is created. To classify an example we start at the root of the tree and follow the path corresponding to the example’s values until a leaf node is reached and the classification is obtained. To prevent overtraining DT is typically pruned. We have used a fast DT learning algorithm REPTree implemented in Weka [18] for our experiments. V.

RELEVANT WORDS SELECTION

In the field of data mining many have argued that maximum performance is often not achieved by using all available features, but by using only a “good” subset of features. This is called feature selection. For junk email filtering, this means that we want to find a subset of words which help to discriminate between spam and legislate email. In this paper we investigate the power of three feature selection methods: Information Gain (IG), Gain Ratio (GR) and Chi-Squared Static (CSS), to find relevant words for email filtering. Suppose that there are a total of m classes denoted by C={C1, C2,…, Cm} (in junk email filtering, we known that m=2, we may define C1 as spam email and C2 as legislate email), let there be N training emails represented by, (a(1), b(1),…; t(1)),…, ( a(N), b(N),…; t(N)) where, a(i), b(i),…are vectors of n words and t(i)∈C is the class label. Of the N examples, N C k belong to class Ck. A

See [16] for more information on estimating continuous distributions in Naïve Bayes classifiers.

feature can split these examples into V partitions, each of which has N(v) examples. In a particular partition, the number of

B. Memory Based Learning Memory based learning (also called instance based learning) is a non-parametric inductive learning paradigm that stores training instances in a memory structure on which predictions of new instances are based. The approach assumes that reasoning is based on direct reuse of stored experiences rather than on the application of knowledge (such as models or decision trees) abstracted from training. The similarity between the new instance and an example in memory is computed using a distance metric. K-Nearest Neighbors (k-NN) algorithm is a classical memory based learner. In our experiment, we used IB1 [20], a k-NN classifier (k=1) that uses Euclidian distance metric [14]. The main idea is to treat all emails as points in the m-dimensional space (where m is the number of distinct words in the email set), and for an unseen email the algorithm classifies it by the nearest training email.

examples of class Ck is denotes by N C k . The relevance degree

1-4244-0133-X/06/$20.00 © 2006 IEEE

240

(v )

of the feature can be calculated by the following three methods. The higher the value is, the more relevant the feature is for junk email filtering. Information Gain: Information Gain is based on the feature’s impact on decreasing entropy [6] and can be calculated by: m NC NC InfoGain = [∑ − ( k ) log( k )] N N k =1 (6) (v) (v) (v) m V NC NC N − [∑ ( )∑ − ( ( vk ) ) log( ( vk) )] N k =1 N N v =1

Gain Ratio: Gain Ratio is introduced by Quinlan in C4.5 [17]. Gain Ratio compensates for the number of features by normalizing by the information encoded in the split itself: m NC NC (7) GainRatio = InfoGain /[∑ − ( k ) log( k )] N N k =1

Chi Squared: The Chi Squared Statistic is based on comparing the obtained values of the frequency of a class because of the split to the a priori frequency of the class. More specifically,

~ m V ( N (v ) − N ( v ) ) 2 Ck Ck 2 ChiSqured = χ = ∑∑ ~ N C( vk ) k =1 v =1

(8)

~ where, N C( vk ) = ( N ( v ) / N ) N Ck denotes the a priori frequency. Clearly, a larger value of ChiSquared indicates that the split is more homogeneous, i.e., has a greater frequency of instances from a particular class. VI. EXPERIMENTS

terms of a single rating. F ranges from 0 to 1, the higher the F measure the better. C. Results 10-fold cross-validation is used for estimating prediction performance. Fig.2 shows the best accuracy and F measure achievements of Bayesian Belief Network and the other three classifiers (Naïve Bayes, Decision Tree, k-NN) on the Ling and PU1 email datasets. Bayesian Belief Network achieves the best accuracy and F measure on both Ling and PU1. Bayesian Belief Network is also the best for each of the three feature selection methods. %

100

Accuracy

99

A. Corpora The experiments are based on two benchmark email Corpora which are available on [25]. PU1 Corpus: The PU1 corpus consists of 1099 messages, 481 of which are marked as spam and 618 are labeled as legitimate, with a spam rate of 43.77%. The messages in PU1 corpus have header fields and html tags removed, leaving only subject line and mail body text. To address privacy, each token is mapped to a unique integer. Ling Spam Corpus: The Ling spam corpus includes: 2412 legitimate messages from a linguistic mailing list and 481 spam messages collected by the author with a 16.63% spam rate. Like PU1 corpus, four versions are available with header fields, html tags, and attachments removed. Since the Ling spam corpus was compiled from different sources: the legitimate messages came from a spam-free, topic-specific mailing list, and the spam mails were collected from a personal mailbox, the mail distribution is less like that of the normal user's mail stream, which may make messages in Ling spam corpus easily separable. B. Performance Measures Our experiments adopt two popular performance measures in text classification domain: accuracy and F measure. Accuracy: Accuracy is defined by the ratio of the number of correct predictions and the number of all predictions (both correct and incorrect):

Acc =

Ncp ×100% Np

(9)

where Ncp is the number of correct predictions and Np is the number of all predictions (i.e. the number of test samples). For a perfect classification, Ncp = Np and Acc = 100%. So, the Acc ranges from 0 to 100%, with 100% corresponding to the ideal, the higher the accuracy the better. F measure: F measure is defined as 2R * P (10) F= R+P Recall (R) is the percentage of the emails for a given category that are classified correctly. Precision (P) is the percentage of the predicted emails for a given category that are classified correctly. It is a normal practice to combine recall and precision to F measure so that classifiers can be compared in

1-4244-0133-X/06/$20.00 © 2006 IEEE

241

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

98 97 96 95 94 93 92 91 90 89 L.IG

L.CS

L.GR

P.IG

P.CS

P.GR

F-measure

1

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.99 0.98 0.97 0.96 0.95 0.94 0.93 0.92 0.91 0.9 L.IG

L.CS

L.GR

P.IG

P.CS

P.GR

Fig. 2. Best Accuracy and F measure achievements of Bayesian Belief Network and other three classifiers on the Ling and PU1 email datasets. L.IG means result on Ling dataset using Information Gain (IG) as the feature selection method, L.CS represents Ling + Chi Squred (CS), L.GR represents Ling + Gain Ratio (GR), P.IG represents PU1 + IG, and so on.

For corpus Ling, Bayesian Belief Network reaches a maximum of 97.6% accuracy and a maximum of 0.99 F measure. Three feature selection methods: Information Gain, Gain Ratio and Chi Squared have similar impacts on Bayesian Belief Network, but various impacts on the other three classifiers. Information Gain is the best for other three classifiers. For corpus PU1, Bayesian Belief Network reaches a maximum of 94.9% accuracy and a maximum of 0.96 F measure, both with Information Gain as the feature selection methods. Information Gain is the best for Bayesian Belief Network and Naïve Bayes, and Gain Ratio is the best for Decision Tree and k-NN. Fig.3 shows the accuracy and F measure curves of Bayesian Belief Network and other three classifiers for different word/feature sizes on the Ling email corpus. The number of selected words/features varies from 2 to 5000. With Information Gain feature/word selection, Bayesian Belief

Network reaches the maximum accuracy at 2000 words and the maximum F measure at 1000 words. With only several dozens Information Gain selected words, Bayesian Belief Network can still achieve relatively high accuracy and F1 measure. This shows that only a few Information Gain selected relevant words can still distinguish between spam and legitimate emails. This could be true in reality, for example it is well known that the words “buy, purchase, jobs, …” usually appear in spam emails, thus are powerful email category distinguishers. Accuracy(%)

Ling (InfoGain)

Fig.4 (in the last page) shows the accuracy and F measure curves of Bayesian Belief Network and other three classifiers with different word sizes on the PU1 email corpus. With Information Gain feature/word selection, Bayesian Belief Network reaches the maximum accuracy at 100 words and the maximum F measure at 100 words. Like the situation in Ling corpus, with the help of relevant word selection, the classifier can get high performance with relative small word size.

Accuracy(%) Ling (ChiSquared)

Accuracy(%) Ling (GainRatio)

100

100

98

98

98

96

96

96

94

94

94

92

92

92

90

90

90

88

88

86

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

84 82 80

88

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

86 84 82 80

82

5000

3000

1000

300

0.94

0.94

0.92

0.92

0.92

0.9

0.9

0.9

0.88

0.88

0.8

50

0.96

0.94

0.82

30

0.98

0.96

0.84

Ling (GainRatio)

1

0.98

0.86

10

F-measure

0.96

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

4

2

Ling (ChiSquared)

0.98

0.88

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

84

5000

3000

1000

300

50

1

30

F-measure

10

4

Ling (InfoGain)

86

80

2

5000

3000

1000

300

50

F-measure

30

10

4

2

1

100

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.86 0.84 0.82 0.8

0.86

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.84 0.82 0.8

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

Fig. 3. Accuracy (the first row) and F measure (the second row) curves of Bayesian Belief Network and other three classifiers (Naïve Bayes, Decision Tree using PERTree and k -NN using IB1) for different word/feature sizes on the Ling email corpus. The X-axis denotes the number of selected words. The three figures in each row correspond to using the three feature selection methods: Information Gain (InfoGain), ChiSquared and GainRatio.

VII. CONCLUSION In this paper, we report our work on a Bayesian Belief Network approach to spam filtering. We also investigate the use of three selection methods: Information Gain, Gain Ratio and Chi Squared, for finding relevant words for email filtering. Our evaluation on two benchmark email datasets suggests that a Bayesian Belief Network classifier will outperform the popular Naïve Bayes approach and two other famous learners: k-NN and Decision Tree. We suggest that this is because the “dependent learner” characteristic of Bayesian Belief Network classification is more suited to spam filtering. We believe that the performance of the learning based spam filters could be further improved by more careful feature subset selection. REFERENCES [1] [2]

Le Zhang, Jingbo Zhu, Tianshun Yao: “An Evaluation of Statistical Spam Filtering Techniques.” ACM Transactions on Asian Language Information Processing, Vol. 3, No. 4, Pages 243-269, December, 2004. Mehran Sahami, Susan Dumais, David Heckerman, and Eric Horvitz. “A bayesian approach to filtering junk e-mail.” In Learning for Text

1-4244-0133-X/06/$20.00 © 2006 IEEE

242

Categorization: Papers from the 1998 Workshop, Madison, Wisconsin, 1998. [3] CAUBE.AU: http://www.caube.org.au/spamstats.html 2006. [4] W.W. Cohen. “Learning rules that classify e-mail.” In Spring Symposium on Machine Learning in Information Access, Stanford, California, 1996. [5] Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, Panagiotis Stamatopoulos: “A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists.” Information Retrieval, 6(1):49-73, 2003. [6] J. Ross Quinlan: “Induction of decision trees.” Machine Learning, 1:81-106, 1986. [7] Harris Drucker, DonghuiWu, and Vladimir N. Vapnik. “Support vector machines for spam categorization.” IEEE Trans. on Neural Networks, 10(5):1048–1054. 1999. [8] Aleksander Kolcz and Joshua Alspector. “SVM-based filtering of e-mail spam with content-specific misclassification costs.” In Proceedings of the TextDM'01 Workshop on Text Mining - held at the 2001 IEEE International Conference on Data Mining, 2001. [9] Corinna Cortes and Vladimir Vapnik: “Support-vector Networks.” Machine Learning, 20(3):273-297, 1995. [10] Xavier Carreras and Lluis Marquez. “Boosting trees for anti-spam email filtering.” In Proceedings of RANLP01, 4th International Conference on Recent Advances in Natural Language Processing, 2001. [11] William W. Cohen: “Fast Effective Rule Induction.” Machine Learning: Proceedings of the Twelfth International Conference (ML95), 1995. [12] Georgios Sakkis, Ion Androutsopoulos, Georgios Paliouras, Vangelis Karkaletsis, Constantine D. Spyropoulos, and Panagiotis Stamatopoulos.

[13]

[14] [15]

[16]

[17] [18] [19] [20]

[21] S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy: “Improvements to Platt's SMO Algorithm for SVM Classifier Design.” Neural Computation, 13(3):637-649, 2001. [22] D. M. Chickering. “Learning Bayesian Networks is NP complete.” In AI&STAT V, 1996. [23] N. Friedman, M. Goldszmidt, A. Wyner: “Data Analysis with Bayesian Networks: a Bootstrap Approach.” In Proc. Fifteenth Conf. on Uncertainty in Artificial Intelligence (UAI), 1999. [24] D. Heckerman, D. Geiger, and D. M. Chickering. “Learning Bayesian networks: The combination of knowledge and statistical data,” Machine Learning, 20:197–243, 1995. [25] Email Benchmark Corpus, http://www.aueb.gr/users/ion/publications.html 2006. [26] Le Zhang, Jingbo Zhu, Yao Tianshun, “An Evaluation of Statistical Spam Filtering Techniques,” ACM Trans. Asian Lang. Inf. Process. 3(4):243-269, 2004. [27] Helge Langseth, “Bayesian Networks in Reliability: Some Recent Developments,” Fourth International Conference on Mathematical Methods in Reliability Methodology and Practice, Santa Fe, New Mexico, 2004.

“Stacking classifiers for anti-spam filtering of e-mail.” Proc. 6th Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), pages 44–50,2001. Karl-Michael Schneider: “A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering.” In Proceedings of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, 307-314, April, 2003. D. Heckerman. “A tutorial on learning with Bayesian networks.” In Learning in Graphical Models. 1998. I. Androutsopoulos et al.: “Learning to Filter Spam E-mail: A Comparison of a Naïve Bayesian and a Memory-based Approach.” In Proceedings of the Workshop on Machine Learning and Textual Information Access, Pages 1-13, 2000. George H. John and Pat Langley: “Estimating Continuous Distributions in Bayesian Classifiers.” In Proceedings of the Eleventh Conference on Uncertainty in Artificial Intelligence. Page 338-345. Morgan Kaufmann, San Mateo, 1995. J. R. Quinlan: “C4.5: Programs for Machine Learning.” Morgan Kaufmann, San Manteo, California, 1993. I.Witten and E.Frank: “Data Mining –Practical Machine Learning Tools and Techniques with Java Implementation.” Morgan Kaufmann, 2000. Ross Quinlan: “C4.5: Programs for Machine Learning.” Morgan Kaufmann Publishers, San Mateo, CA, 1993. Aha, D., and D. Kibler: “Instance-based Learning Algorithms,” Machine Learning, Vol.6, 37-66, 1991.

Accuracy(%)

100

Accuracy(%) PU1 (ChiSquared)

PU1 (InfoGain)

Accuracy(%) PU1 (GainRatio)

100

100

95

95

95

90

90

85

85

80

80

75

90 85 80 75

75

BeliefNetwork

70

NaiveBayes

70

65

DT(REPTree) k-NN(IB1)

65

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

60

60

5000

3000

PU1 (GainRatio)

0.95

0.8

0.8

0.8

0.75

0.75

0.75

0.6

1000

0.85

300

0.85

50

0.9

0.85

30

0.9

10

0.95

0.65

4

2

1

0.9

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

k-NN(IB1)

F-measure

PU1 (ChiSquared)

0.95

0.7

DT(REPTree)

60

5000

3000

1000

300

50

1

30

F-measure

10

4

PU1 (InfoGain)

1

NaiveBayes

65

55

2

5000

3000

1000

300

50

30

10

4

2

F-measure

BeliefNetwork

70

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.7 0.65

BeliefNetwork NaiveBayes DT(REPTree) k-NN(IB1)

0.7 0.65 0.6

0.6

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

5000

3000

1000

300

50

30

10

4

2

Fig. 4. Accuracy (the first row) and F measure (the second row) curves of Bayesian Belief Network and other three classifiers for different word/feature sizes on the PU1 email corpus with three feature selection methods. The X-axis denotes the number of selected words. The three figures in each row correspond to using the three feature selection methods: Information Gain (Info Gain), Chi Squared and Gain Ratio.

1-4244-0133-X/06/$20.00 © 2006 IEEE

243