Ham or Spam? A Comparative Study of Classification ... - IEEE Xplore

0 downloads 0 Views 479KB Size Report
In this paper, we compare among some proposed content based filtering algorithms that rely on text classification to decide whether an email is spam or not.
17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

Ham or Spam? A comparative study for some Content-based Classification Algorithms for Email Filtering Salwa Adriana Saab, Nicholas Mitri, Mariette Awad Faculty of Electrical and Computer Engineering American University of Beirut, Lebanon Email: {sis08, ngm04, mariette.awad}@aub.edu.lb R. Lakshmi [2] compared the performance of Naïve Bayesian (NB), Multi-layered Perceptron (MLP), J48, and Linear Discriminant Analysis (LDA) algorithms. Using WEKA software, they achieved a prediction accuracy of 93% for J48, slightly exceeding MLP’s 92%, however at the expense of an increased computational time. Using RapidMiner, MLP accuracy surpassed that of LDA by 1% and that of NB by 3%. S. Youn and D. McLoed [3] explored how the size of a dataset affects classification performance. For a dataset size of 1000, support vector machines (SVM), NB, and J48 achieved accuracies of 92.7%, 97.2%, and 95.8% respectively. Nevertheless, when the size increased to 5000, accuracy of SVM dropped by 1.8% and of NB by 0.7%, whereas that of J48 increased by 1.8%. Moreover, they deduced that accuracy increased with increasing feature size. The authors of [4] performed weighted SVM on spam filtering. Un-weighted SVM ignores the particular importance of every sample, which often leads to imbalance classification and less precise results. They tested their algorithm on 400 emails from the Chinese corpus ZH1 with half of them being spam. Results revealed that as the weight value of the legitimate emails class increased from 1 to 10 with the spam weight fixed to 1, the precision increased from 97.47% to 99.44%. Furthermore, [5] modified SVM into a relaxed online SVM so that it trains only on actual errors. This resulted in less number of iterations and minimized the computational cost of the algorithm. The precision of the algorithm on the benchmark email datasets trec05p-1 and trec06p was very close to that of the online SVM while reducing the CPU execution time. In [6], the authors combined the Best Stepwise feature selection with a classifier of Euclidean nearest neighbor and created a Naïve Euclidean approach. Each email was represented in D-dimensional Euclidean space. Using SpamBase from the UCI repository, and a 10-fold cross validation, they achieved an accuracy of 82.31% compared to 60.6% for the Zero rule. R. Laurutis et al. [7] applied artificial neural network (ANN) to classify spam emails. Their main

Abstract—Spam emails are widely spreading to constitute a significant share of everyone’s daily inbox. Being a source of financial loss and inconvenience for the recipients, spam emails have to be filtered and separated from legitimate ones. This paper presents a survey of some popular filtering algorithms that rely on text classification to decide whether an email is unsolicited or not. A comparison among them is performed on the SpamBase dataset to identify the best classification algorithm in terms of accuracy, computational time, and precision/recall rates.

I.

INTRODUCTION

Spam is defined as unsolicited and unwanted emails sent with the purpose of financial gain or simply causing harm or annoyance to users. They may be used to distribute viruses or fake announcements that cause responders an average loss of 25 USD per reply [1]. It has been estimated that 1 among 40,000 users reply to spam emails. Moreover, the fact that 48 billion out of the 80 billion emails sent daily are spam underscores both the urgency and importance of developing effective classification procedures for received emails [1]. Filtering spam is one of the focal applications of pattern recognition and data mining advancement as heavy research has been conducted to generate algorithms capable of recognizing spam from legitimate emails. Emails are usually filtered based on their content, which includes text and images or their header fields which provide information about the sender. In this paper, we compare among some proposed content based filtering algorithms that rely on text classification to decide whether an email is spam or not. The rest of the paper continues as follows. Section II presents a survey summarizing the main classification algorithms in literature. We provide a general background about the used classification algorithms in Section III. We present and analyze the results of these experiments in section IV and conclude with future work in Section V. II.

RELATED WORK

Many algorithms have been proposed for classifying spam and legitimate emails. N. Radha and

978-1-4799-2337-3/14/$31.00 ©2014 IEEE

439

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

contribution was to replace the frequency of words in the content with descriptive properties of elusive patterns created by the spammers. Their data corpus contained around 1800 spam and 2800 legitimate emails. Their experiments revealed that ANN provided a maximum precision value of 90.57% after training it with 57 email parameters. Authors of [8] compared SVM to Rocchio classifier. Rocchio classifier is based on the normalized TF-IDF modeling of training vectors. The dot product of the test and prototype vectors is obtained to classify the document as spam or nonspam. The threshold value of classification is obtained such that the training error is minimized after rank ordering the generated dot products of the prototype vector with the whole set of training. Compared to binary SVM with an error rate 0.213, Rocchio classifier reached a 0.327 error rate using a dictionary of mixed upper and lower cases. J. Provost [9] evaluated the rule-learning RIPPER algorithm compared to the NB algorithm. RIPPER generates keyword-spotting rules to set -and bagvalued attributes. It performs its classification according to the predefined rules that determine the impact of having certain words in the header fields or content of the emails. The experiments performed on junk email provided by several users and on legitimate ones form the inbox of the author achieved 90% accuracy after training it with 400 emails whereas NB reached 95% after only 50 training emails. In [10], B. Medlock proposed the smoothed n-gram language interpolation and modeling which assumes that the probability of a specific word in a sequence is solely dependent on the previous n-1 words. Separate language models were built for legitimate and spam email followed by computing the probability that the model generated this text message. Bayes rule was applied later to find the class with the highest probability for the provided message. An accuracy of 98.84% and 97.48% was obtained for the adaptive bigram and unigram language model classifier after applying it to the GenSpam corpus of 9072 legitimate and 32332 spam emails. Finally, we discuss similar comparative studies specific to the Spambase corpus used herein. In [11], Kiran et al. analyzed the performance of several classifiers in identifying spam emails based on the Spambase corpus using the WEKA toolset. Ensemble classifiers were employed as well in a set of experiments measuring classification accuracy, precision, and recall. Different validation techniques including half splits, leave-one-out, and 10-fold cross validation were used while noting consistent results across all. Ensemble Decision Tree (25 trees into

total) was shown to achieve the best accuracy rate with 96.4%. In [12], in the most recent relevant work (2013), Sharma et al. evaluated Spambase using 24 classifiers from the WEKA toolset as well. Ten fold cross validation was used and accuracy, precision, and recall are reported on for each algorithm. Random Committee achieved the best result with 94.28% accuracy. III.

PRELIMINARIES

We first provide a brief background for the classification algorithms employed herein before proceeding with a comparison of how they perform when applied to SpamBase. A. Support Vector Machines SVM offer a principled approach to machine learning (ML) problems because of their mathematical foundation in statistical learning theory. SVM construct their solution as a weighted sum of SVs, which are only a subset of the training input. SVs are considered most influential in constructing the optimal decision boundary. Beyond minimizing some error cost function based on the training data sets similarly to what other discriminant ML techniques do, SVM impose an additional constraint to the optimization problem; the hyper plane needs to be situated such that it is at a maximum distance from the different classes. Such a term forces the optimization step to find an optimal hyper plane that would eventually generalize better since it is situated at an equal and maximum distance between the classes. SVs and their corresponding weights are found after an exhaustive optimization step that uses Lagrange relaxation and solves KarushKuhn-Tucker (KKT) constraints to determine the parameters of this unique hyper plane. For linearly non-separable problems, SVM use the kernel trick, which consists of a kernel function satisfying Mercer’s theorem to map the data to the feature space where the data would become at worst pseudolinearly separable [13]. B. Local Mixture Support Vector Machines (LMSVM) LM-SVM is a variation of traditional SVM proposed earlier [14] by the authors of this paper. It aims at preprocessing the dataset by reducing its size using local mixture measures before it’s fed into the optimization stage of SVM. This is achieved by first clustering the data and then filtering the resulting clusters according to the aforementioned measures which act as fitness qualifiers. Thus, it can be viewed

440

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

as an overlaid level on top of traditional SVM, and therefore benefits from all of its capabilities while using its reduction scheme to cut down the training and testing times. Degradation in performance is naturally incurred and varies according to the threshold that controls the reduction rate.

classifier should achieve high precision since classifying legitimate emails as spam is a critical issue that cannot be tolerated. In addition to this, we determine the recall of spam class as it reflects the percentage of spam correctly classified out of the total number of true spam emails. We use the 10-fold cross validation (CV) technique to determine these measures because it minimizes the effect of randomly choosing the training and testing data and repeats 10 times the evaluation process to provide more accurate values. In what follows, we present a treatment of each classifier individually while varying specific parameters and recording the aforementioned measurements. All experiments used MATLAB 2009b on a 2.4 GHz quad processor on the Spambase database [15].

C. Decision Trees (DT) Decision trees are multistage decision systems that split feature space into regions associated with the various classes. Building the tree hierarchy is achieved through the use of candidate questions at the node level based on predefined splitting criteria. These criteria are usually strongly related to the notion of node impurity and its gradual decrease as the tree is traversed. Stopping criteria are used in supplement to control the growth of the tree while various pruning methods ensure decent generalization is preserved. Upon the arrival of a new feature vector, sequential decisions are made by traversing the tree top to bottom and making the final decision at the leaf level where class assignment rules are utilized.

A. Support Vector Machines Running SVM for classification is usually coupled with a search for optimal parameters. Having chosen the radial basis function (RBF) for the kernel implementation of SVM, a traditional grid search is employed over a small grid yielding a choice of 0.0014 for Sig, which is used in a line search for regularization term (C). Table I shows the results of the line search for different values of C while recording the CV accuracy, average precision and recall, training time, and prediction time. While the randomness of the partitioning incurs an uncertainty as to what the optimal choice is, there’s a discernible trend in the computational time recordings. As C is increased, the training time increases while testing time decreases. Therefore, we compromise between the two while looking for a value of C that achieves good precision and generalized accuracy. That value is taken to be 20 as highlighted in the table.

D. Artificial Neural Networks (ANN) ANN have had a history riddled with highs and lows since their inception in the 1960s. Created as a simplified analogue of the human brain, ANNs have undergone multiple stages of evolution. With each generation, newly available insights from the neuroscience community inspire incremental enhancements to ANNs, both structurally and functionally. Nevertheless, few iterations have stood the test of time. The basic structure of ANN consists of an input and output layers with hidden layers in between each of varying or constant number of neurons that operate as simple computing elements to mimic the biological signal propagation while minimizing the empirical risk during the training phase. Generally speaking, SVM distinguish themselves from ANN by the fact that they don’t suffer from the classical multi-local minima, the curse of dimensionality, and over fitting. IV.

Acc. (%) 91.35 92.37 92.96 93.35 93.57 93.65 93.74 93.61 93.85 93.87 93.94 93.87 93.85 93.40

EXPERIMENTS

The evaluation conducted here uses the Spambase dataset. Spambase consists of 4597 instances with 57 features (39.4% spam – 60.6% non-spam). Performance is evaluated based on the classification accuracy along with the total time it takes to converge, both for training and testing. Moreover, we focus on the precision of the spam class, which is the ratio of the number of true spam emails over the total number of emails predicted as spam. A good

 

441

TABLE I EXPERIMENTAL RESULTS FOR LINE SEARCH ON SVM Training Testing Precision Recall Time Time (%) (%) (sec) (sec) 92.75 84.67 1.42 0.1204 93.31 86.87 1.26 0.1037 93.16 88.64 1.13 0.0862 93.33 89.52 1.07 0.0781 93.42 90.02 1.09 0.0717 93.23 90.46 1.099 0.0663 93.20 90.73 1.23 0.0631 93.13 90.46 1.35 0.0626 93.51 90.68 1.38 0.0616 93.32 90.95 1.48 0.0605 93.28 91.17 1.598 0.0594 93.32 90.95 1.79 0.0595 93.42 90.79 2.10 0.0581 As reported in [11]

C 1 2 5 10 20 40 80 100 120 150 200 250 300

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

B. Local Mixture Support Vector Machines We presented LM-SVM as an extension of the kernel SVM and thus it inherit the same parameter choices. LM-SVM, nevertheless, has its own set of parameters whose choice needs to also be optimized. Those are the number of clusters (k) and the sigma variable associated with the kernel k-means clustering procedure used. Empirical tests coupled with a wide grid search produced a choice of 0.0014 for sigma and 60 for k. Finally, a threshold variable T that controls the rate of reduction was varied, producing the results in Table II. It can be seen that at a threshold of 0.5, while accuracy and precision suffer a light drop, training and testing time are reduced by 90% and 88% respectively compared to some chosen (highlighted) run for SVM. This speedup can be levered at server side where the high throughput of emails forwarded demands fast classification of potential spam in an energy aware computing.

Acc. (%) 92.13 90.67 90.16 90.15 90.00 89.17

TABLE II EXPERIMENTAL RESULTS FOR LM-SVM Training Testing Precision Recall Time Time (%) (%) (sec) (sec) 92.83 89.79 0.864 0.0604 90.78 86.03 0.573 0.0428 90.43 85.02 0.464 0.0340 91.12 91.12 0.103 0.0080 90.14 90.71 0.066 0.0059 89.66 89.33 0.025 0.0023

T 1 0.9 0.7 0.5 0.3 0.2

  C. Decision Trees The same evaluation procedure was followed for decision trees. When trained whole and tested on a partition, the DT model achieved a precision as high as 98%, significantly higher than any of the other classifiers achieved. Using the 10-fold CV scheme implemented thus far, accuracy and precision took a significant dip as shown in Table III. The experimental measurements are recorded across different runs with varying levels of pruning. Pruning is used to ensure no over-fitting of the data occurs and therefore good generalization is maintained. Here, two pruning techniques are implemented: level based pruning and criteria based pruning. The criteria chosen here was re-substitution error. Table III shows the average accuracy, precision, recall, and computational times corresponding to the DT model training and testing for pruning levels 0 through 28 at which point the performance suffers

significantly. It can be seen that accuracy and precision assume a general projectile trend where performance increases before suffering again due to the loss of too many nodes. We chose a pruning level near the top of the trend, level 14, while keeping in note of the uncertainty caused by the use of CV and random partitioning. At this level, an accuracy of 92.08% and precision of 91.51% are achieved with training time comparable to SVM but 99% and 92% reduction in testing time compared to SVM and LMSVM respectively. Results reported in [11] and [12] are included. While accuracy rates are comparable, precision is [12] is notably higher which can be attributed to different partitioning results for CV or different pruning techniques that the authors used but didn’t mention.

Acc. (%) 91.76 92.19 92.02 92.13 92.43 92.52 92.19 92.08 91.56 90.91 90.54 89.32 89.65 85.21 80.56 91.71 92.58

TABLE III EXPERIMENTAL RESULTS FOR DECISION TREE Training Testing Precision Recall Pruning Time Time (%) (%) Level (sec) (msec) 89.83 89.18 1.040 0.920 0 90.21 89.96 1.032 1.027 2 90.39 89.24 1.040 0.828 4 90.23 89.74 1.006 0.771 6 90.71 90.01 1.057 0.6993 8 91.24 89.63 1.058 0.700 10 91.2 88.74 1.046 0.610 12 91.51 88.08 1.049 0.617 14 91.06 87.14 1.055 0.551 16 90.62 85.82 1.019 0.529 18 90.62 84.77 1.028 0.519 20 89.29 82.84 1.011 0.515 22 90.24 82.68 1.025 0.485 24 88.56 71.75 1.018 0.458 26 82.06 64.86 1.002 0.432 28 95.62 91.14 As reported in [12] As reported in [11]

  D. Artificial Neural Networks For the ANN network, we adopted the ANN: DTU MATLAB toolbox [16], which implements a two layer feed-forward neural network using a hyperbolic tangent function for the hidden layer and a sigmoidal function for the output layer. Table IV shows the observed accuracy, precision, recall, training and testing time while the number of hidden units (h) was varied. It is to be noted that the model built with 5 units in the hidden layer where the classifier’s generalization is at its best before accuracy slides down due to over-fitting.

442

17th IEEE Mediterranean Electrotechnical Conference, Beirut, Lebanon, 13-16 April 2014.

TABLE V EFFECT OF FEATURE REDUCTION

While the training time is orders higher than any of the other classifiers, accuracy is best while precision is only second to SVM. TABLE IV EXPERIMENTAL RESULTS FOR ANN Acc. Precision Recall Time h (%) (%) (%) (sec) 93.50 91.50 92.05 8.25 1 93.52 92.43 91.00 23.64 2 94.02 92.96 91.78 70.81 5 93.78 92.82 91.28 66.65 10 93.74 92.81 91.17 250.34 20 93.28 94.332 94.5 As reported in [12] 90.80 As reported in [11]

SVM LMSVM DT ANN

Precision

Recall

No. of Features

93.57 90.41 90.15 90.01 92.08 90.32 94.02 90.56

93.42 91.62 91.12 91.82 91.51 89.77 92.96 89.97

90.02 83.29 91.12 86.2 88.08 85.16 91.78 85.6

57 12 57 21 57 7 57 7

  REFERENCES

 

[1]

E. Feature Reduction Finally, since all experiments thus far have taken the full dimensionality of the feature space, we considered reduction of the vector size by utilizing a main ingredient in the DT model building process; namely the information gain (IG). Typically, at every level of the DT hierarchy, the IG associated with each feature is computed and the highest is chosen to be the next ancestor node. Therefore, IG can be seen as not only a measure for the decrease in impurity but also as an indicator for the significance of an attribute or feature. Here, we calculated the IG vector for the root only and sorted it in ascending order. The previous experiments were then re-run using the optimal parameters found but at increasing levels of feature reduction. The stopping criterion was chosen to be the level at which accuracy dropped below the 90% mark. Table V shows the aforementioned results. For every classifier, both the full feature vector and reduced feature vector performance measurements are recorded and presented as a pair. DT and ANN were able to cut 50 features and still achieve ~90% accuracy and precision. LM-SVM was able to reduce the least yet retained and even increased precision.   V.

Accuracy

[2] [3]

[4]

[5]

[6] [7]

[8] [9] [10]

CONCLUSION

[11]

We have presented in this paper a short comparative study for SVM, LM-SVM, ANN and DT classifiers on the SpamBase database and reported on accuracy, precision, recall and computational requirements. Results were consistent with the theoretical strength and limitations of each approach. An interesting extension to this work would be to check the effectiveness of unsupervised classifiers in correctly clustering emails as spam or non-spam.

[12] [13] [14]

[15]

ACKNOWLEDGMENT

[16]

This work was funded by MER (partnership with Intel and KACST) and the ECE department at AUB.

443

Miszalska, I., Zabierowski, W., & Napieralski, A. (2007, February). “Selected Methods of Spam Filtering in Email.” In CAD Systems in Microelectronics, 2007. CADSM'07. 9th International Conference-The Experience of Designing and Applications of (pp. 507-513). IEEE. Scholar, M. (2010). “Supervised learning approach for spam classification analysis using data mining tools.” organization, 2(08), 2760-2766. Youn, S., & McLeod, D. (2007). “A comparative study for email classification.” In Advances and Innovations in Systems, Computing Sciences and Software Engineering (pp. 387-391). Springer Netherlands. Xiao-li, C., Pei-yu, L., Zhen-fang, Z., & Ye, Q. (2009, August). “A method of spam filtering based on weighted support vector machines.” In IT in Medicine & Education, 2009. ITIME'09. IEEE International Symposium on (Vol. 1, pp. 947-950). IEEE. Sculley, D., & Wachman, G. M. (2007, July). “Relaxed online SVMs for spam filtering.” In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 415-422). Chan, T. Y., Ji, J., & Zhao, Q. “Learning to Detect Spam: Naive-Euclidean Approach.” International Journal of Signal Processing, 1. Puniškis, D., Laurutis, R., & Dirmeikis, R. (2006). “An artificial neural nets for spam e–mail recognition.” Elektronika ir Elektrotechnika (Electronics and Electrical Engineering), 5(69), 73-76. Drucker, H., Wu, D., & Vapnik, V. N. (1999). “Support vector machines for spam categorization.” Neural Networks, IEEE Transactions on, 10(5), 1048-1054. Provost, J. (1999). “Naıve-Bayes vs. Rule-Learning in Classification of Email.” University of Texas at Austin. Medlock, B. (2006, July). “An Adaptive, Semi-Structured Language Model Approach to Spam Filtering on a New Corpus.” In CEAS. Kiran, S. R. (2009). “Spam or not spam--That is the question.” Sharma, S., & Arora, A. (2013). “Adaptive Approach for Spam Detection.” International Journal of Computer Science Issues (IJCSI), 10(4). Theodoridis S. and Koutroumbas K., Pattern Recognition, USA, Academic Press, 2009. Rizk Y., Mitri N., Awad M., “A Local Mixture Based SVM for an Efficient Supervised Binary Classification”, International Joint Conference on Neural Networks (IJCNN), Dallas, TX, August 4–9, 2013. “Spam Email Datasets”, Internet: http://csmining.org/index.php/spam-email-datasets-.html, 2010 [Dec. 10, 2011]. http://cogsys.imm.dtu.dk/toolbox/ann/index.html#nc_binclas s