Spam Filtering with Several Novel Bayesian Classifiers - CiteSeerX

Spam Filtering with Several Novel Bayesian Classifiers Chuanliang Chen1, Yingjie Tian2, Chunhua Zhang3,† Department of Computer Science, Beijing Normal University, Beijing 100875, China 2 Research Centre on Fictitious Economy & Data Science, Chinese Academy of Sciences, Beijing 100080, China 3 School of Information, Renmin University of China E-mail: [email protected], [email protected] 1

Abstract In this paper, we report our work on spam filtering with three novel bayesian classification methods: Aggregating One-Dependence Estimators (AODE), Hidden Naïve Bayes (HNB), Locally Weighted learning with Naïve Bayes (LWNB). Other four traditional classifiers: Naïve Bayes, k Nearest Neighbor (kNN), Support Vector Machine (SVM), C4.5 are also performed for comparison. Four feature selection methods: Gain Ratio, Information Gain, Symmetrical Uncertainty and ReliefF, are used to select relevant words for spam filtering. Results of experiments on two corpora show the promising capabilities of bayesian classifiers for spam filtering, especial for that of AODE.

wares. Therefore, in this paper, we emphasize the research on improving the predicting ability of Naïve Bayes by introducing three novel bayesian classification methods [1,2,3]. In this paper, besides the three bayesian classifiers, we also perform four feature selection methods to extract relevant words of e-mails in order to reduce the complexity of classifiers meanwhile preserve their performances. The remainder of this paper is organized as follows. In section 2, we describe the e-mail representation and preprocessing. We briefly review the three bayesian classifiers in section 3. We report the results of experiments and our necessary analysis in section 4. Finally, section 5 concludes our work.

2. Preprocessing of Corpus

1. Introduction

2.1. Message Representation

In recent years, many studies address the issue of spam filtering with the aid of machine learning methods. Many supervised learning algorithms have been successfully applied to spam filtering: Naïve Bayes [5,6,7,8], Support Vector Machine [9,10], Memory Based Learning methods [11], Decision Tree [12], and other well-known algorithms. Among them, Naïve Bayes is particularly attractive for spam filtering [6]. The reason may be their simplicity, which makes them easy to implement, their linear computational complexity, and their accuracy, which in spam filtering is comparable to that of more elaborate learning algorithms [4]. Naïve Bayes classifier has been filtering engine of many commercial anti-spam soft-

Each e-mail of corpora is represented as a set of words. Every e-mail is represented as a feature vector including N elements, and the ith word of the vector is a binary variable represents whether this word is in this e-mail. During preprocessing, we perform word stemming, stop-word removable and Document Frequency Thresholding (DFT), in order to reduce the dimension of feature space. HTML tags of e-mails are also removed during preprocessing. In the end, we extract the first 5,000 tokens of dictionary according to their Mutual Information to form the corpora. That is, we sort these tokens in accordance with their mutual information with class variable and then select the first 5,000 tokens with highest mutual information.

†

Corresponding Author.

978-1-4244-2175-6/08/$25.00 ©2008 IEEE

2.2. Feature Selection Methods Four feature selection methods are used in this paper to select relevant words. They are Information Gain, Gain Ratio, Symmetrical Uncertainty and ReliefF. Their definitions are briefly stated below. In the following formulas, m is the number of classes (in spam filtering domain, m=2), and Ci denotes the ith class. V represents the number of cells a feature can split the training set into. Let N is the total number of samples, then N Ci is that of class i. In the vth cell, N C( vi ) denotes the number of samples belongs to class i.

Information Gain: Information Gain is based on the feature’s impact on decreasing entropy and defined as follows: m NC NC InfoGain = [∑ −( i ) log( i )] − N N i =1 N C( v ) N C( v ) N (v) m )∑ −( ( vi ) ) log( ( vi ) )] . N i =1 N N v =1 Gain Ratio: Gain Ratio is firstly used in C4.5, which is defined as: m NC NC GainRatio = InfoGain / [∑ −( i ) log( i )] . N N i =1 Symmetrical Uncertainty: A shortcoming of Information Gain is its bias. The Symmetrical Uncertainty (SU) criterion compensates for the inherent bias of Information Gain by dividing it by the sum of the entropies of class labels C and features X: V

[∑ (

parent of all other attributes [3]. When classifying an object x = , AODE excludes models where the training data contain fewer than m samples of the value for x of the parent attribute xi, where m is predefined threshold parameter on the size of samples for statistical inference purposes [3] and we set it to be 30. Then the kernel equation of AODE’s strategy for estimating class probabilities is defined as: ∑ i:1≤i ≤ n ∧ F ( xi )≥ m P( y, xi ) P(x | y, xi ) , p ( y , x) = | {i :1 ≤ i ≤ n ∧ F ( xi ) ≥ m} | And AODE attempts to find the class satisfies: n ⎛ ⎞ arg max ⎜ ∑ i:1≤i ≤ n ∧ F ( x ) ≥ m Pˆ ( y , xi )∏ Pˆ ( x j | y , xi ) ⎟ . i y j =1 ⎝ ⎠

When ¬ ∃i :1 ≤ i ≤ n ∧ F ( xi ) ≥ m , AODE defaults to Naïve Bayes [3].

3.2. Hidden Naïve Bayes The basic idea of HNB for spam filtering is to create a hidden parent for each word/attribute, which combines the influences from all other words/attributes.

SU = 2 × InfoGain / ( Ent (C ) + Ent ( X )) , where the entropy of X and C can be obtained by Ent ( X ) = − ∑ p ( x) log 2 ( p ( x )) , x∈ X

where p(x) is X’s marginal probability density function. ReliefF: The key idea of ReliefF is to estimate features according to how well their values distinguish among the instances that are near to each other. More details can be found in [13].

3. Three Bayesian Classifiers 3.1. Aggregating One-Dependence Estimators AODE is an ensemble learning method, in which a collection of models is built and their predictions are combined [2]. In the process of AODE, it intends to select a limited class of 1-dependence classifiers and aggregate the predictions of all qualified classifiers within this class. The selected class is all 1-dependence classifiers where there is a single attribute that is the

Figure 1. The structure of HNB Fig. 1 gives the structure of HNB, which is originnally proposed in [2]. In HNB, attribute dependencies are represented by hidden parents of attributes. C is the class node, and is also the parent of all attribute nodes. Each attribute Ai has a hidden parent Ahpi (i = 1, 2,…, n), represented by a dashed circle. The arc from the hidden parent Ahpi to Ai is also represented by a dashed directed line, to distinguish it from regular arcs. Therefore, the joint distribution can be defined as follows. n

P ( A1 ,..., An , C ) = P (C )∏ P( Ai | Ahpi , C ) , i =1

where P ( Ai | Ahpi , C ) =

n

∑

j =1, j ≠ i

Wij ⋅ P( Ai | Aj , C ) ,

and ∑ j =1, j ≠ i Wi j = 1 . More details about calculating Wij n

and HNB can be found in [2].

3.3. Locally Weighted learning with Naïve Bayes The improvement of LWNB is to learn a local model at predicting time. In LWNB, Naïve Bayes is learned locally in the same way as linear regression used in locally weighted linear regression. A local Naïve Bayes model is fit to a subset of the data that is in the neighborhood of the instance whose class value is to be predicted [1]. The training samples in this neighborhood are weighted, and further ones are assignned less weight. Same as [1], we use a linear weighting function in our experiments. Then classification is gotten from these Naïve Bayes models.

4. Experiments and Analysis Our experiments are based on two popular corpora, PU1, PU2‡. In all PU corpora, attachments, html tags, and header fields other than the subject were removed, leaving only subject line and mail body text. In order to address privacy, each token of a corpus is encoded to a unique integer. PU1 Corpus: The PU1 corpus consists of 1099 messages, which has 481 spam messages and 618 legitimated ones. The spam rate is 43.77%. PU2 Corpus: The PU2 corpus contains messages less than PU1, which has 721 messages. Among them, there are 579 messages labeled legitimate and 142 spam. Limited to length, we only present the results of the seven classifiers measuring by accuracy. Other performance measures are also used in our experiments. The following results are gained through 10 folds cross-validation. In our experiments, the k of kNN is set to be 1, SVM is run with linear kernel function, and we select three neighbors when running LWNB. Tables 1 and 2 summarize the results of the seven classifiers on PU1 and PU2. We select top 100 relevant words by performing the four feature selection methods. From Tables 1 and 2, we can find, the performances of AODE is the most excellent among all of seven classifiers and all feature selection methods. We can also find selecting relevant words by SU is competitive with other three ones. Since the research about SU for spam filtering is little, we further study the capability and stability of SU and the seven classifiers for spam filtering by perform these ‡

http://www.aueb.gr/users/ion/publications.html

classifiers on different number of relevant words selected according to their SU scores. In our experiments, we sort the words of a corpus according to their SU scores, and then study the change of performances of the seven classifiers on different number of top relevant words. The up limit numbers of selected words in Fig. 2 are determined by the number of words with positive SU scores, i.e. informative. In our experiments, we found there are about 1150 words in PU1 and 616 words in PU2 with positive SU scores. Table 1. Comparison of accuracy (%) on PU1 corpus by selecting top 100 relevant words. HNB AODE LWNB SVM C4.5 NB kNN InfoGain 94.13 96.24 93.39 95.96 92.66 91.74 92.39 GainRatio 92.02 92.75 92.39 93.21 91.19 92.84 91.74 ReliefF

92.20 94.59 92.02 93.85 91.47 89.54 90.46

SU

94.50 96.06 93.30 95.32 92.11 93.03 90.83

Table 2. Comparison of accuracy (%) on PU2 corpus by selecting top 100 relevant words. HNB AODE LWNB SVM C4.5 NB kNN InfoGain 94.23 95.21 91.13 93.52 90.56 92.82 90.28 GainRatio 92.82 91.27 91.83 91.41 88.45 91.97 90.56 ReliefF

93.80 94.93 92.39 94.23 89.01 93.66 92.82

SU

94.23 94.93 90.14 93.24 92.11 93.24 90.56

In Fig. 2, we can find that the performances of AODE, HNB and LWNB are stabler than those of other classifiers such as kNN and SVM. However, the performance of LWNB is not that satisfying, and perhaps LWNB is not suitable to take the task of spam filtering. Compared to LWNB, other three bayesian classifiers perform very satisfyingly, especial for AODE, whose performance is the most excellent and stablest. Though the performance of HNB is not better than that of AODE, its capability of spam filtering is also competitive because it has an explicit semantics and then more understandable than AODE. Moreover, the weights in HNB can be assigned by human experts, which allows an effective interaction between antispam experts and spam filtering program [2]. It is easy to find that the second most excellent classifier is SVM, but its high complexity limits it to be only able to solve small-size problem. However, the complexity of AODE and HNB is less than that of SVM, and therefore they are more feasible than SVM in practice. The ability of SU for relevant words selection is also promising. From Fig. 2, the performances of classifiers

98%

97%

97%

96%

96%

95%

95% 94%

94%

93%

93%

92%

92% 91%

90%

Accuracy

Accuracy

91% 89% 88% 87%

HNB NB LWNB SVM

86% 85% 84% 83%

C4.5 kNN AODE

90% 89% 88%

HNB NB LWNB SVM

87% 86% 85%

82%

C4.5 kNN AODE

84%

81%

83%

80%

82%

79% 78%

81% 5

10

30

50

100

300

500

1000

1150

Number of Attributes

5

10

30

50

100

300

500

616

Number of Attributes

Figure 2. Accuracy curves of seven classifiers with different numbers of relevant words according to their SU scores on PU1 (left) and PU2 (right) corpora. NB represents Naïve Bayes.

achieve good enough when the numbers of relevant words are 100 on PU1 and 50 on PU2 respectively.

5. Conclusions In this paper, we propose to perform three novel bayesian classification methods, AODE, HNB and LWNB, for spam filtering. Other four traditional methods are also performed for comparison. We also use four feature selection methods, Information Gain, Gain Ratio, Symmetrical Uncertainty and ReliefF, to select relevant words of e-mails in our experiments. Our empirical study shows that the abilities of AODE and HNB for spam filtering are both more competitive than those of traditional methods in different aspects: AODE for its high accuracy and efficiency, and HNB for its explicit semantics and excellent discriminability. However, for LWNB, the results of our experiments seem that it is not suitable for spam filtering. Moreover, we find in our experiments the ability of SU for relevant words selection is also promising, about whom the research is little.

6. Acknowledgement The work described in this paper is supported by the Science Foundation of Renmin University of China (No. 06XNB055) and the National Natural Science Foundation of China (No. 10601064, No. 70621001, No. 70601033).

References [1] E. Frank, M. Hall, and B. Pfahringer, “Locally Weighted Naïve Bayes,” In Proc. of the Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, pp. 249-256, 2003.

[2] H. Zhang, L. Jiang, J. Su, “Hidden Naive Bayes,” In Proc. of the 20th National Conference on Artificial Intelligence. pp. 919-924, AAAI Press, 2005. [3] G.I. Webb, J.R. Boughton, Z.H. Wang. “Not So Naïve Bayes: Aggregating One-Dependence Estimators.” Machine Learning, 58, pp. 5-24, 2005. [4] I. Androutsopoulos, G. Paliouras, and E. Michelakis, “Learning to filter unsolicited commercial e-mail,” Technical Report 2004/2, NCSR “Demokritos”, 2004. [5] M. Sahami, S. Dumais, D. Heckerman, E. Horvitz, “A bayesian approach to filtering junk e-mail,” In Learning for Text Categorization, Madison, Wisconsin, 1998. [6] K.M. Schneider, “A Comparison of Event Models for Naïve Bayes Anti-Spam E-Mail Filtering,” In Proc. of the 10th Conference of the European Chapter of the Association for Computational Linguistics, Budapest, Hungary, pp. 307-314, April, 2003. [7] I. Androutsopoulos et al., “Learning to Filter Spam Email: A Comparison of a Naïve Bayesian and a Memory-based Approach,” In Proc. of the Workshop on Machine Learning and Textual Information Access, pp. 1-13, 2000. [8] L. Zhang, J.B. Zhu, T.S. Yao, “An Evaluation of Statistical Spam Filtering Techniques,” ACM Trans. Asian Lang. Inf. Process. 3(4):243-269, 2004. [9] H. Drucker, D.H. Wu, and V.N. Vapnik, “Support vector machines for spam categorization,” IEEE Trans. on Neural Networks, 10(5):1048–1054, 1999. [10] A. Kolcz, J. Alspector, “SVM-based filtering of e-mail spam with content-specific misclassification costs,” In Proc. of IEEE International Conference on Data Mining, 2001. [11] G. Sakkis, I. Androutsopoulos, G. Paliouras et al., “A Memory-Based Approach to Anti-Spam Filtering for Mailing Lists,” Information Retrieval, 6(1):49-73, 2003. [12] X. Carreras, L. Marquez , “Boosting trees for antispam email filtering,” In Proc. of RANLP01, 4th International Conference on Recent Advances in Natural Language Processing, 2001. [13] I. Kononenko, “Estimating Attributes: Analysis and Extensions of Relief,” In Proc. of ECML'94, pp. 171182, Springer-Verlag New York, Inc., 1994.