A Comparative Study of Various Supervised Feature ...

3 downloads 29079 Views 534KB Size Report
Classification of the spam from bunch of the email files is a ... the business emails are spam [1]. ... box with unwanted email, engulfing good emails, capturing.
A Comparative Study of Various Supervised Feature Selection Methods for Spam Classification Shrawan Kumar Trivedi

Shubhamoy Dey

Information System BML Munjal University, School of Management Gurgaon, Haryana–122413, India

Information Systems Indian Institute of Management Prabandh Shikhar Rau Indore –453 556, India

[email protected]

[email protected]

ABSTRACT Classification of the spam from bunch of the email files is a challenging research area in text mining domain. However, machine learning based approaches are widely experimented in the literature with enormous success. For excellent learning of the classifiers, few numbers of informative features are important. This researh presents a comparative study between various supervised feature selection methods such as Document Frequency (DF), Chi-Squared (  2 ), Information Gain (IG), Gain Ratio (GR), Relief F (RF), and One R (OR). Two corpuses (Enron and SpamAssassin) are selected for this study where enron is main corpus and spamAssassin is used for validation of the results. Bayesian Classifier is taken to classify the given corpuses with the help of features selected by above feature selection techniques. Results of this study shows that RF is the excellent feature selection technique amongst other in terms of classification accuracy and false positive rate whereas DF and  2 were not so effective methods. Bayesian classifier has proven its worth in this study in terms of good performance accuracy and low false positives.

Categories and Subject Descriptors I.5.2 [Pattern Recognition]: Design Methodology—classifier design and evaluation, feature evaluation and selection; I.2.7 [Natural Language Processing] – Text analysis

General Terms Algorithms, Performance, Experimentation, Application

Keywords Spam classification; Feature selection; Bayesian Classifier; False Positive Rate; Classification Accuracy; F- Value.

1. INTRODUCTION Nowadays, Email is seen as a rapid and cheap tool of the communication between the peoples and organisations. Conversely, Spam or unsolicited emails are the trouble of such communication. A recent study indicates that more than 70% of SAMPLE: Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICTCS, March 4-5, 2016, Udaipur, Haryana, India. Copyright 2016 ACM 1-58113-000-0/00/0010 …$15.00.

DOI: http://dx.doi.org/10.1145/12345.67890

the business emails are spam [1]. Such growth leads to some serious problems such as unnecessarily filling the users’ mail box with unwanted email, engulfing good emails, capturing huge bandwidth of network and storage space and time consuming in sorting them [2]. A number of techniques like challenge response, blacklisting, white-listing and grey-listing based approach etc. have been tested in the research for above problems but none of them performed well. Nowadays, content based filtering approach is being experimented widely in the literature in which machine learning techniques are gaining substantial success. Such approaches use leaning of the classifier technique where a spam filtering model is created by training of the classifier. In this technique, a few numbers of informative features of existing instances are recommended to use for training of the model. For a good classifier, a well discriminant subset of the features is necessary. The feature selection technique is used to select such features that are capable in discriminating the email files of different classes. Feature selection is categorised into the three parts i.e. Supervised [3], Unsupervised [4] and Semi supervised feature selection [5]. However, this research concentrates only on supervised feature selection methods which are widely used for the machine learning approaches. Supervised feature selection techniques are employed after preprocessing (stop word removal and lemmatisation). This technique helps to find informative terms from the complete set of terms. For evaluation of classifiers, the use of a few good features (i.e. terms) to represent documents has been shown to be effective. In these methods, different rule based algorithms are used for representing the documents using informative feature subset. In this research a comparative study between various feature selection approaches such as Document Frequency (DF), ChiSquared (  2 ), Information Gain (IG), Gain Ratio (GR), Relief F (RF), and One R (OR) is performed. Bayesian classifier is the choice for testing above feature selection approaches. Two publically available corpuses such as Enron and SpamAssassin in being preferred for this research where Enron is the main corpus and SpamAssassin is used for validation the result of first corpus.

2. STRUCTURE OF SPAM CLASSIFIER 2.1 Corpora This Research incorporates two different corpora, taken from two different sources. Our main analysis is done on “Enron email” corpus, thereafter “SpamAssassin” corpus is engaged for

validation of the results found from the first corpus. A description of the corpus is given below:-

2.1.1

Enron email corpus:

Out of the six existing versions of enron corpus, version 3, 4, 5, and 6 are selected to create 6000 legitimate (Ham) and 6000 unsolicited (Spam) files by random sampling. The spam files are produced by version 4, 5, and 6. For making the same rate of ham files, version 3 has also been taken together with version 4, 5, and 6. The rationale behind to select these version was the complexities imbibed in the Email Spam files [6, 7, 8, 9].

Document Frequency (DF) Document frequency evaluates all the email documents of a corpus and search informative features. In this method, the features whose document frequency is lesser or greater than some estimated threshold frequency, are removed. These selected features are used to test the examples with highest probability. In this method, both types of features (frequent or less frequent) is considered non-informative or weak in improving the classification accuracy. This is one of the simplest, scalable and effective techniques for feature selection. Information Gain (IG)

2.1.2

SpamAssassin:

Another corpus, i.e. “SpamAssassin”, is also taken for this study. This corpus carries some older as well as recent unsolicited emails (Spam) developed by some non-spam-trap sources. Out of whole spam outsets, 2350 Spam email files are taken for this research. On the other hand, this corpus has some easy and hard legitimate (Ham) files. For maintaining same rate, both easy and hard emails are mixed to generate 2350 ham email files.

2.2

Pre-Processing of the Corpora:

Pre-Processing is used to transform email files where strings of characters are transformed to a representation suitable to the classification algorithm. Initially, Text is transformed by the following Step:   

HTML (or other) tags removal Stop-words removal Lemmatization

The above steps are done in the Feature Extraction process.

2.2.1

Dimensionality Reduction:

The major problem of email classification is high dimensionality of the feature space where one dimension of a unique word is found in many files. This large set of the feature space creates difficulty in standard classification methods due to costly computation and unreliable results. Therefore, this large feature space is reduced. This reduction process is known as dimensionality reduction and done by feature selection mechanisms.

2.2.3

H Y    yY p  y  log 2 p  y 

H  YX    xX  x  yY p

Feature Selection Process:

Feature selection methods are used to remove less-informative features/words and to select informative features from email files for reducing the computational complexity. Many feature section methods are popular in classification research such as Document frequency (DF), Information gain (IG), Chi-square (  2 ), Gain ratio (GR), Relief F and One R (OR) etc.

(1)

  log p   y x

2

y x

(2)

Information gain is the value of additional information of Y provided by X for which the entropy of Y decreases. It is computed by the following formula: IG = H Y   H  YX 

Feature Extraction:

In this process, the information of the email files is extracted for developing an associated feature dictionary. String-to-wordvector mechanism is applied for this process that includes, HTML (or other) tags removal, Stop-word removal (some words which appear often such as articles, prepositions and conjunctions etc.) and lemmatization (reducing words to their basic form such as improving to improve). Selected features set is further taken for dimensionality reduction process.

2.2.2

Information gain is used to evaluate excellent informative features in Machine Learning Application domain. It computes the information which is obtained for predicting the class of an email document. The classes are decided from the presence or absence of a word/feature in that document. Information Gain is calculated by measuring the reduction in overall entropy after inclusion of a new feature. Entropy is defined as the expected value of information required for classifying an instance. This method works with the association between features. Let us consider X and Y as the desecrate random variables/features. Entropy of Y before and after inclusion of X is calculated as:

(3)

= H  X   H  YX  = H  Y   H  X   H Y , X  Information gain is a symmetrical measurement hence the value of this for Y after observing X will be similar to the value for X after observing Y . Gain Ratio (

)

This technique is an extension of Information Gain ( ). The weakness of was bias towards selection of features that contain higher numerical value even when they carry very little information.

IG  H Y   H  YX   H  X   H  YX 

(4)

To compensate the bias of IG , Gain Ratio ( ) is used which is a non-symmetrical measurement (Hall & Smith, 1998) [10]. GR 

IG H (X )

(5)

From “(5)”, when variable Y is to be predicted, IG will be normalised by dividing by the entropy of X . The normalisation process gives values between 0 and 1. When value is 1, the information in will completely predict Y , and if it is 0, then X and will have no relation with each other. Gain Ratio ( ) differs from by favouring features with lower numerical value.

Chi-Squared (

2.4.1 Bayesian classifier

)

This is a well-known and commonly used technique for selecting features (Liu & Setiono, 1995) [11]. Chi-Squared (  2 ) method provides valuable features from the feature space with respect to the class by analysing value of chi-square statistics. This method tests initial hypothesis which makes an assumption, that “two features will be dissimilar”.

 2  i 1  j 1 r

c



Oij  E ij E ij



2

This idea is proposed by Lewis in 1998 [13], who introduced the term P

ci dj

defined as the probability of a document

recognized by a vector d j  w1j , w2j ,..., wnj of terms fall within a certain category ci . This probability is calculated by the Bayes theorem

(6)

Where, the notations is the observed frequency and is the expected frequency, justified by the Null hypothesis. Higher value of the will give significant evidence against the initial hypothesis .

  and

P

  ci dj

P ci *P

 

 

dj ci

(8)

P dj

 

Where P d j

Symbolize as the Probability of arbitrarily

selected documents represented by the documents vector d j and

Relief F (RF)

P  ci  is the probability of arbitrary selected documents d j falls

This algorithm randomly selects features from email instances and observes their nearest neighbors for adjusting a final feature weighting vector. In this way, it provides larger weight to the features that can well discriminate the instances from other neighbors of different classes. Particularly, it attempts to observe best estimation of W f from the given probabilities to assign the weight for each feature f

in a particular class ci . The discussed classification method is usually known as “Bayesian Classification”.

W f  P  Different value for f Nearest Instance from different class   P  Different value for f Nearest Instance from same class 

(7)

3. EXPERIMENTAL DESIGN 3.1

Software:

This study used JAVA and MATLAB environments in Window 7 operating system for testing the concerned classifiers.

3.2

System Design:

One Rule (OR)

1. For each value v x from the domain f x

After the pre-processing of the concerned corpuses, feature selection methods are used in this study for selecting the most informative features. 1180 features for Enron and 1210 features for SpamAssassin are selected for testing the Bayesian classifier that will give a clear picture of the credible performance of the Feature Selection Techniques. The instances of the corpuses are split for performing the training and testing where 66% instances are taken for training from the selected features and remaining 34% are used for testing.

2. Choose the set of examples where feature f x has

3.3

This algorithm is developed by Robert Holte (1993) [12], a professor from University of Ottawa. This technique works by taking a set of examples having several features and classes. It selects a single best attribute iteratively and bases the rules solely on that feature. The algorithm is discussed below. For each feature f x

value v x 3. Let us consider c x = most frequent class within the set. 4. Add the condition “for feature f x with value v x the class will be c x ”.

Evaluation Metrics:

Classification Accuracy, F-Value and False Positive Rate are being used for this research:Table 1.

Performance Instruments

Instruments (Rule for feature f x )

2.3

Feature Representation

Term Weighting feature representation method is the choice of this research. Let us suppose that each email file is represented as a column vector D x which is defined by the words extracted from the email files i.e. D x  (w1, w 2 , w 3 ....) where w i is termed as i th word/feature of the email document d x . The combination of all email documents and words form a M  N matrix where M represents to the number of distinct features and N represents the number of email instances. It is known as Term Document relationship and an a ij matrix is defined as the degree of relationship between Term i and Instance j .

2.4

Machine Learning Classifier

Related Formulas N

Hamc Spamc Accuracy  NHamc  NHam m  NSpamc  NSpamm

N

Accuracy

F-Value

F

2*P recision*Re call P recision+ Re call

m FPrate  NHamHam m  NHamc

N

False Positive Rate

In the table above, the formulae of performance measure have been shown, where N Hamc denotes the total number of correctly classified Ham Emails, misclassified Ham emails,

N Hamm

N Spam c

denotes the number of is the correctly classified

denotes the total number of

misclassified Spam emails.

4. RESULT & ANALYSIS Analysis of this study is performed in two different categories. First part analyses classification accuracy achieved by different feature selection techniques which is done by two different measures. The second category shows the sensitivity of the classifier as well as feature selection methods towards accurate classification for which False Positive Rate is preferred.

4.1

Analysis of best feature selection method with classification accuracy:

In this part, Classification accuracy and F-value of Bayesian classifier have been calculated for evaluating most informative feature selection methods where the values for both measures are close to each other. This part is subdivided in two sections where at first Enron corpus results will be analysed and then next part will take SpamAssassin for validation of the results coming from Enron corpus.

4.1.1 Analysis for Enron Corpus: The results of Bayesian classifier tested on Enron corpus is shown in Tables 2, 3 and Figures 1, 2. After analysis, two observations have been identified:

For less number of features, Relief F (RF) is identified to be best Feature Selection Mechanism with 83.7% classification accuracy from only 380 features whereas, Document Frequency (DF) and Gain Ratio (GR) are seen to be second and third best Feature Selection Techniques with Classification Accuracy 74.3% and 62.4% respectively. 100

Percentage Accuracy

N Spamm

90

80 C hi S qure 70

Ga in R a tio I nform a tion Ga in One R

60

R e lie f F D ocum e nt Fre que ncy

50

40 380

580

Fig 1.

For maximum number of features, Bayesian classifier gives similar classification accuracy and F-Value for all Feature Selection Techniques. In this case, the classifier accuracy is 96.3% for all Feature Selection Techniques. This observation states that, for highest number of features, almost all the feature selection techniques behave in a similar fashion. Table 2. Percentage Accuracy for Enron corpus % Accuracy FS

2 GR IG OR RF DF

380

Number of Features 580 780 980

1180

49.8 62.8 49.3 63.5 84.0 75.5

72.5 79.0 71.9 81.7 85.5 81.4

96.3 96.3 96.3 96.3 96.3 96.3

85.2 87.9 85.0 88.1 90.3 88.5

92.9 94.0 93.1 92.4 93.0 96.2

Table 3. Percentage F-Value for Enron corpus % FValue FS

2 GR IG OR RF DF Observation 2:

980

1180

Percentage Accuracy for Enron corpus

100 90 80 C hi S qure 70

Ga in R a tio I nform a tion Ga in One R

60

R e lie f F D ocum e nt Fre que ncy

50 40 30 380

580

780

980

1180

Number of Features

Fig 2. Observation 1:

780

Number of Features

Percentage F-Value

Spam emails and

Percentage Accuracy for Enron corpus

On the other side, Information Gain (IG) and Chi Squared (  2 ) methods disappoint with only 38.8% classification accuracy for both. After increasing the number of features, the results show that RF is best throughout the study whereas, One Rule (OR) results are close to the best one.

4.1.2 Analysis with SpamAssassin Corpus: The same analysis done on SpamAssassin corpus (Table 4, 5 and Figure 3, 4) strongly supports the results coming from the Enron corpus. The observations from this corpus are given below: Observation 1: Again, for the maximum number of the features, all the feature selection techniques perform in a similar fashion as for Enron corpus. In this case, the Classification accuracy and F-Value of the Bayesian classifier is 95.2% for all Feature Selection Methods. Table 4. Percentage Accuracy for SpamAssassin corpus

380 38.8 62.4 38.8 59.4 83.7 74.3

Number of Features 580 780 980 72.1 78.8 71.4 81.5 85.3 80.8

85.2 89.1 84.9 88.1 90.2 88.4

93.0 94.1 93.1 92.4 93.0 96.3

1180 96.3 96.3 96.3 96.3 96.3 96.3

% FS

2 GR IG OR RF DF

210

410

49.4 49.4 49.4 51.6 76.9 49.4

51.8 66.6 51.2 61.9 79.7 51.2

Number of Features 610 810 1010 79.7 83.5 79.9 82.5 82.3 81.1

90.7 90.5 90.7 89.9 92.5 90.7

96.2 94.8 95.7 96.4 93.2 95.7

1210 95.2 95.2 95.2 95.2 95.2 95.2

Table 5. Percentage F-Value for SpamAssassin corpus

% FS

2 GR IG OR RF DF

210

410

32.7 32.7 32.7 36.3 76.9 32.7

39.1 66.6 42.2 57.1 79.6 42.2

Number of Features 610 810 1010 79.4 83.5 79.9 82.4 82.2 80

90.7 90.5 90.7 89.9 92.5 90.7

96.2 94.8 95.7 96.4 93.2 95.7

95.2 95.2 95.2 95.2 95.2 95.2

Observation 2: Similarly, for a less number of features, RF proves to be excellent Feature Selection Technique amongst others with 79.6% Classification accuracy whereas GR is the second best method with 88.3% and 66.6% classification accuracy respectively for only 410 features. Chi squared (  2 ) again shows its poor strength with 39.1% Classifier Accuracy for 410 features. For feature numbers greater than 410 and smaller than 1210, again RF is the best technique whereas OR results are close to the best one.

% Rate FS

80

C hi S qure

GR IG OR RF DF

26.1 6.4 3.1 3.4 46.0

Number of Features 580 780 980 1180 14.1 7.0 5.3 1.1 10.3 14.7 7.9 2.3 36.3

3.8 6.1 8.7 1.1 22.0

2.3 4.0 8.3 1.2 3.9

1.1 1.1 1.1 1.1 1.1



FP Number of Features 410 610 810 1010

210

2

100.0 100.0 100.0 0.0 19.9 100.0

3.3 33.5 88.0 5.0 13.0 88.0

9.5 14.5 13.6 11.1 12.4 13.6

7.1 10.4 7.4 5.6 7.8 7.4

5.3 6.8 6.1 3.6 6.7 6.1

1210 5.1 5.1 5.1 5.1 5.1 5.1

Ga in R a tio I nform a tion Ga in)

60

One R 50

R e lie f F 50

D ocum e nt Fre que ncy

40 210

410

610

810

1010

1210

Number of Features

Fig 3.

Percentage Accuracy for SpamAssassin corpus

Percentage False Positve Rate

Percentage Accuracy

90

2

380 6.4

The same test done on the SpamAssassin corpus (Table 7, Figure 6) strongly validates the results of Enron corpus. In this case, FP Value of Bayesian classifier again shows similar results where this value is 5.1% for all Feature Selection Mechanisms. Table 7. Percentage FP Rate for SpamAssassin corpus

GR IG OR RF DF

100

70

% FP Rate FS

1210

C hi S qure

45

Ga in R a tio 40

I nform a tion Ga in One R

35

R e lie f F D ocum e nt Fre que ncy

30 25 20 15 10 5 0

100

380

580

780

980

1180

Percentage Accuracy

Number of Fetures 90

Fig 5. 70

Observation 2:

C hi S qure Ga in R a tio I nform a tion Ga in)

60

One R R e lie f F 50

D ocum e nt Fre que ncy

40 210

410

610

810

1010

1210

Number of Features

Fig 4.

4.2

Percentage FP Rate for Enron corpus

80

Percentage F-Value for SpamAssassin corpus

Analysis of best feature selection method with False Positive Rate:

Misclassification is a serious problem for the Machine Learning Classifier and the robustness of the classifiers is dependent not only on classification accuracy but also on low false alarm. For identification of a good classifier for making a robust Spam filtering System, False Positive Rate (FP Rate) is calculated. The observations from the FP rate (Table 6, 7 and Figure 5, 6) are given below: Observation 1: The Results of the Enron corpus (Table 6, Figure 5) demonstrate that False Positive Rate (FP) of Bayesian Classifier gives the same results for the all Feature Selection Techniques. In this case, FP Rate is low i.e. 1.1% for all Feature Selection Methods. Table 6. Percentage FP Rate for Enron corpus

For less number of features selected on Enron corpus (Table 6, Figure 5), One Rule (OR) and Relief F (RF) have shown their potential for sensitivity towards accurate classification with low False Positive Rate i.e. 3.1% and 3.4% respectively for only 310 features. For the feature sizes more than 310, RF performs excellent whereas OR is the second best feature selection mechanism. Document Frequency (DF) and Gain Ratio (GR) give poor results and show worst technique for accurate classification with high false Positive Rate i.e. 46.0% and 26.1 respectively. The same test done on SpamAssassin corpus (Table 7, Figure 6), strongly validates the observation coming from Enron corpus. In this case, for low number of features, again OR and RF give best results and show sensitivity towards accurate classification with 13.0% and 5% FP Rate respectively which are much lower than other Feature Selection Techniques. DF and GR are again identified to be worst feature selection mechanisms for a les number of features with 88.0% and 33.6% FP Rate respectively for 210 features. IG result is also poor with 80% FP Rate for 210 features but this value reduces to a significant acceptance when the number of features is increased.

Percentage False Positive Rate

dependence estimation. In Proceedings of the 24th international conference on Machine learning (pp. 823830). ACM.

100 C hi Squre

90

Ga in R a tio

80

I nform a tion Ga in One R

70

R e lie f F 60

D ocum e nt Fre que ncy

50 40

[5]. Zhao, Z., & Liu, H. (2007, April). Semi-supervised Feature Selection via Spectral Analysis. In SDM.

30 20 10 0

210

410

610

810

1010

1210

Number of Features

Fig 6.

[4]. Mitra, P., Murthy, C. A., & Pal, S. K. (2002). Unsupervised feature selection using feature similarity. IEEE transactions on pattern analysis and machine intelligence, 24(3), 301-312.

Percentage FP Rate for SpamAssassin corpus

5. CONCLUSION A subset of most informative features is important for improving the learning capability and accuracy of the classifiers. At the beginning the aim of this research was to do a comparative study of various supervised feature selection approaches for identifying best one which is successfully achieved. A good Feature Selection Mechanism cannot be decided only by Classifiers’ Performance but Sensitivity towards accurate classification is also important. In view of all the metrics Relief F (RF) has been identified as an excellent Feature Selection Mechanism since it has given high accuracy with low False Positive Rate. In addition to this, OR has performed well but RF was the best in all the dimensions. For a maximum number of features, almost all techniques have same features with different rankings and hence classifiers’ training is done by the same features for all techniques which lead to the same result. In the future, same feature selection mechanisms can be tested on other corpuses with the use of different classifiers.

REFERENCE [1]. Aladdin Knowledge Systems, Anti-spam white paper, . [2]. Trivedi, S. K., & Dey, S. (2013). Interplay between Probabilistic Classifiers and Boosting Algorithms for Detecting Complex Unsolicited Emails. Journal of Advances in Computer Networks, 1(2). [3]. Song, L., Smola, A., Gretton, A., Borgwardt, K. M., & Bedo, J. (2007, June). Supervised feature selection via

[6]. Trivedi, S. K., & Dey, S. (2013, December). An Enhanced Genetic Programming Approach for Detecting Unsolicited Emails. In Computational Science and Engineering (CSE), 2013 IEEE 16th International Conference on(pp. 11531160). IEEE. [7]. Trivedi, S. K., & Dey, S. (2013, October). Effect of feature selection methods on machine learning classifiers for detecting email spams. In Proceedings of the 2013 Research in Adaptive and Convergent Systems (pp. 35-40). ACM. [8]. Trivedi, S. K., & Dey, S. (2013). Effect of Various Kernels and Feature Selection Methods on SVM Performance for Detecting Email Spams.International Journal of Computer Applications, 66(21). [9]. Trivedi, S. K., & Dey, S. (2014). Interaction between feature subset selection techniques and machine learning classifiers for detecting unsolicited emails.ACM SIGAPP Applied Computing Review, 14(1), 53-61. [10]. Hall, M. A., & Smith, L. A. (1998). Practical feature subset selection for machine learning. [11]. Liu, H., & Setiono, R. (1995, November). Chi2: Feature selection and discretization of numeric attributes. In 2012 IEEE 24th International Conference on Tools with Artificial Intelligence (pp. 388-388). IEEE Computer Society. [12]. Holte, R. C. (1993). Very simple classification rules perform well on most commonly used datasets. Machine learning, 11(1), 63-90. [13]. Lewis, D. D. (1998). Naive (Bayes) at forty: The independence assumption in information retrieval. In Machine learning: ECML-98 (pp. 4-15). Springer Berlin Heidelberg.