Improving Statistical Bayesian Spam Filtering

0 downloads 0 Views 7MB Size Report
May 22, 2005 - COMPUTER SCIENCE AND TECHNOLOGY. AT ..... pora: Ling Spam, Spam Assassin and Annexia/Xpert and they are evaluated using ... learning based spam filtering approaches like Support Vector Machine [5], k-NN [6],.
University Id :

10532

Student Id

: LY2002007

Topic Index :

TP181

Security Level : Normal

Hunan University

MASTER’S THESIS

Improving Statistical Bayesian Spam Filtering Algorithms By

Raju Shrestha

College

: Computer and Communication

Major

: Computer Science and Technology

Research Field : Machine Learning Supervisor

: Prof. Yaping Lin

May 2005

University Id : 10532 Student Id

: LY2002007

Security Level: Normal

Hunan University

Improving Statistical Bayesian Spam Filtering Algorithms MASTER’S THESIS By

Raju Shrestha College

: Computer and Communication

Major

: Computer Science and Technology

Research Field

: Machine Learning

Supervisor

: Prof. Yaping Lin

Submission Date : April, 2005 Defense Date

: May 22, 2005

Defense Committee Chairman : Prof. Li Renfa

i

Improving Statistical Bayesian Spam Filtering Algorithms

By

Raju Shrestha

THE THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF ENGINEERING IN COMPUTER SCIENCE AND TECHNOLOGY AT

Graduate School

Hunan University Changsha, P.R. China May 2005

©Copyright by Raju Shrestha, 2005

ii

Hunan University Statement of Originality I hereby certify that the thesis submitted is based on the research carried out by myself independently under the guidance and supervision of my supervisor. Any ideas or quotations from the work of other individual and collective works are fully acknowledged in accordance with the standard referencing practices of the discipline. Author’s Signature:

Date: May 24, 2005

Copyright Statement Permission is herewith granted to Hunan University to circulate and to have copied for noncommercial purposes, at its discretion, this thesis upon the request of individuals or institutions. The author reserves other publication rights, and neither the thesis nor extensive extracts from it may be printed or otherwise reproduced without the author's written permission. This thesis belongs to: 1. Secure

, and this power of attorney is valid after____________.

2. Not Secure (Please mark the above corresponding check box)

Author’s

Signature:

Date: May 24, 2005

Supervisor’s Signature:

Date: May 24, 2005

iii

Table of Contents Table of Contents

iv

List of Tables

vii

List of Figures

viii

Abstract

ix

Acknowledgements

xi

1 Introduction 1.1 Spam and its Types . . . . . . . . . . . . . . 1.2 Anti-spamming Techniques . . . . . . . . . . 1.3 Previous Works on Bayesian Spam Filtering 1.4 Contributions . . . . . . . . . . . . . . . . . 1.5 Thesis Organization . . . . . . . . . . . . . .

. . . . .

1 1 2 4 5 6

. . . . .

7 8 10 11 13 15

3 Preprocessing and Feature Selection 3.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Feature extraction or Tokenization . . . . . . . . . . . . . . . . . . .

16 16 17

. . . . .

. . . . .

. . . . .

. . . . .

2 Statistical Bayesian Spam Filtering Algorithms 2.1 Spam Filtering Steps . . . . . . . . . . . . . . . . . 2.2 Naive Bayes (NB) Algorithm . . . . . . . . . . . . . 2.3 Paul Graham’s (PB) Algorithm . . . . . . . . . . . 2.4 Gary Robinson’s (GR) Algorithm . . . . . . . . . . 2.5 Dealing with Small Probabilities and Normalization

iv

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

4 Filtering Based on Co-weighted Multi-estimations 4.1 Main Idea and Algorithm Description . . . . . . . . . . . . . . . . . . 4.2 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Classification Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

19 19 21 22

5 Filtering Based on Co-weighted Multi-area Information 5.1 Main Idea and Algorithm Description . . . . . . . . . . . . . . . . . . 5.2 Training Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Classification Algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

23 23 25 25

6 Dataset Collections and Evaluation Measures 6.1 Corpora Collections . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Evaluation Measures . . . . . . . . . . . . . . . . . . . . . . . . . . .

27 27 29

7 Experiments and Analysis 7.1 Parameters Tuning . . . . . . . . . . . . . 7.2 Experiments with Co-weighted Multiestimations . . . . . . . . . . . . . . . . . 7.2.1 Experiments and Results . . . . . . 7.2.2 Analysis . . . . . . . . . . . . . . . 7.3 Experiments with Co-weighted Multi-area Information . . . . . . . . . . . . . . . . . 7.3.1 Experiments and Results . . . . . . 7.3.2 Analysis . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

32 32

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34 34 39

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

40 40 45

8 Conclusions and Future Work 8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48 48 49

Appendix

50

A Implementation of A.1 Data Structures A.2 Source Files . . A.3 Data Files . . .

Filter . . . . . . . . . . . .

Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

B Application User’s Manual B.1 System Requirements . . . . . . . . B.2 Installation of the Application . . . B.3 Running and Using the Application B.3.1 Dataset Preparer . . . . . . v

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

50 50 53 55 57 57 57 58 58

B.3.2 Trainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.3 Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . B.3.4 Tester . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C Program Documentation C.1 Package and Class Summaries . . . C.1.1 Class Summary . . . . . . . C.1.2 Enum Summary . . . . . . . C.2 Hierarchy For Package rsspambayes C.2.1 Class Hierarchy . . . . . . . C.2.2 Enum Hierarchy . . . . . . . C.3 Class Details . . . . . . . . . . . . C.3.1 Algorithm . . . . . . . . . . C.3.2 Category . . . . . . . . . . . C.3.3 Classifier . . . . . . . . . . . C.3.4 Counts . . . . . . . . . . . . C.3.5 DatasetPreparer . . . . . . . C.3.6 FreqTable . . . . . . . . . . C.3.7 Frequencies . . . . . . . . . C.3.8 GRobinsonBayes . . . . . . C.3.9 NaiveBayes . . . . . . . . . C.3.10 PGrahamBayes . . . . . . . C.3.11 RShresthaBayes1 . . . . . . C.3.12 RShresthaBayes2 . . . . . . C.3.13 Stats . . . . . . . . . . . . . C.3.14 Tester . . . . . . . . . . . . C.3.15 Tokenizer . . . . . . . . . . C.3.16 Trainer . . . . . . . . . . . . C.3.17 Utils . . . . . . . . . . . . . C.4 Enum Details . . . . . . . . . . . . C.4.1 Algorithms . . . . . . . . . C.4.2 Areas . . . . . . . . . . . . C.4.3 EmailCats . . . . . . . . . . C.4.4 Headers . . . . . . . . . . . C.4.5 HtmlTags . . . . . . . . . . C.4.6 Method Detail for All Enum

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Types

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59 60 63 67 67 68 68 69 69 69 70 70 71 73 78 82 83 86 93 97 101 105 109 116 119 123 130 132 138 138 139 139 140 140 141

D List of Papers Published

143

Bibliography

145 vi

List of Tables 6.1

Training and test dataset sizes . . . . . . . . . . . . . . . . . . . . . .

28

6.2

Confusion/Contingency Matrix . . . . . . . . . . . . . . . . . . . . .

29

7.1

Test results (SR, SP , F P R and F N R) with individual and co-weighted multi-estimations . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2

Weighted accuracy rates, error rates and total cost ratios with individual and co-weighted multi-estimations . . . . . . . . . . . . . . . . . .

7.3

36

Test results (SR, SP , F P R and F N R) with individual area-wise and co-weighted multi-area based estimations . . . . . . . . . . . . . . . .

7.4

35

41

Weighted accuracy rates, error rates and total cost ratios with individual area-wise and co-weighted multi-area based estimations . . . . . .

42

B.1 Valid parameters corresponding to six algorithms . . . . . . . . . . .

62

vii

List of Figures 1.1

Spam Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

2.1

Block diagram of Bayesian spam filter . . . . . . . . . . . . . . . . . .

9

7.1

Classification accuracies with individual and co-weighted multi-estimations, for λ = 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2

Total cost ratios with individual and co-weighted multi-estimations, for λ = 9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3

42

Classification accuracies with individual area-wise and co-weighted multiarea based estimations, for λ = 99 . . . . . . . . . . . . . . . . . . . .

7.8

42

Total cost ratios with individual area-wise and co-weighted multi-area based estimations, for λ = 9 . . . . . . . . . . . . . . . . . . . . . . .

7.7

37

Classification accuracies with individual area-wise and co-weighted multiarea based estimations, for λ = 9 . . . . . . . . . . . . . . . . . . . .

7.6

37

Total cost ratios with iindividual and co-weighted multi-estimations, for λ = 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.5

36

Classification accuracies with individual and co-weighted multi-estimations, for λ = 99 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4

36

43

Total cost ratios with individual area-wise and co-weighted multi-area based estimations, for λ = 99 . . . . . . . . . . . . . . . . . . . . . .

43

B.1 Snapshot of Trainer in action . . . . . . . . . . . . . . . . . . . . . .

60

B.2 Snapshot of Classifier output . . . . . . . . . . . . . . . . . . . . . . .

63

B.3 An example contents of classifier output file . . . . . . . . . . . . . .

64

B.4 An example contents of test output file . . . . . . . . . . . . . . . . .

66

viii

Abstract The aim of this thesis is to improve accuracy of Bayesian spam filtering, the most popular and widely used approach in spam filtering. Among the various possible approaches to this aim, two approaches that improved the filtering performances are presented in this thesis. Three popular evolutions of Bayesian spam filtering algorithms: Naive Bayes, Paul Graham’s and Gary Robinson’s are reviewed. Formulated on top of those evolutions, proposed algorithms incorporate new novel ideas. The first approach proposed is co-weighting of multiple probability estimations. Though based on Bayesian theorem, several ways of computing probability estimations have been proposed and used. Those estimations are examined and a new, combined, more effective estimation based on co-weighted multi-estimations is proposed. The approach is compared with individual estimations. The second approach is based on co-weighted multi-area information. Bayesian spam filters, in general, compute probability estimations for tokens either without considering the email areas of occurrences except the body or treating the same token occurred in different areas as different tokens. However, in reality the same token occurring in different areas are inter-related and the relation too could play role in the classification. This novel idea is incorporated, co-relating multi-area information by co-weighting them and obtaining more effective combined integrated probability estimations for tokens. It is shown that this approach also improves the performance of spam filtering. The new approach is compared with individual area-wise estimations and traditional separate estimations in all areas.

ix

x

The filters are tested by thorough experiments with three well known public corpora: Ling Spam, Spam Assassin and Annexia/Xpert and they are evaluated using several performance measures. Both the proposed approaches are shown to exhibit significant improvement, stability, robustness and consistency in the spam filtering.

Acknowledgements I would like to thank Prof. Yaping Lin, my supervisor, for his valuable comments, suggestions and constant support during the research work. I am also thankful to Dr. Zhi Ping Chen for his guidance. My sincere thanks goes to His Majesty’s Government of Nepal and Government of the People’s Republic of China for their support on my graduate studies in China. I would also like to thank Foreign Affair’s Office, Hunan University for their coordination and support during my studies. Of course, I am grateful to my parents for their patience and love during my study abroad. Raju Shrestha May, 2005

xi

Chapter 1 Introduction Spam, also known as junk, is one of the greatest challenges to the email world these days. Spam not only wastes the time, but also wastes bandwidth, server space and some contents like pornographic contents are even harmful to under-aged recipients. Recent study from the University of Maryland stated that the time wasted deleting junk e-mail costs American businesses nearly $22 billion a year [1]. As the volume of such spam mails has grown enormously in the past few years, the need for stopping such emails is realized and many anti-spam filtering techniques are being already proposed and in use. However, the new non-stop clever tricks of spammers necessitates further improvement in the filtering approaches. In this chapter, types of spams and different ways to tackle them are introduced with more focus on statistical Bayesian filters.

1.1

Spam and its Types

Basically, spam is an unsolicited email sent to large numbers of people to promote products and services. The major factors that contribute to the proliferation of unsolicited spam email are: (1) bulk email is inexpensive to send, and (2) pseudonyms 1

2

Figure 1.1: Spam Types are inexpensive to obtain [2]. Because of the fast, most efficient and most importantly the cheapest medium of communication, email is being widely used these days for promoting various products and services. From internet survey, spam is basically classified as “get rich schemes” and “adults”. The detailed picture of the types of spam is shown in the Fig. 1.1.

1.2

Anti-spamming Techniques

Realizing the need for tackling ever growing problems of spam, various solutions to stop spam are being proposed and some of them are already in use. Anti-spam solutions can be classified into two broad categories: legal and technical. Attempts to introduce legal measures against spam mailing have had limited effect. A more effective solution is to develop tools to help recipients identify or remove automatically spam messages. Such tools, called anti-spam filters, vary in functionality from black lists [3] of frequent spammers to content-based filters. The latter are generally more powerful, as spammers often use fake addresses. They use patterns

3

appeared in the mails and based on that the mail may be classified as being spam or non spam (legitimate, also called ham). In general, two technical approaches are in practice, one at the network level implemented on the email server and one at the user level. The first types of filters are in use long before. It can also maintain blacklists of addresses and reject all messages from those addresses on the list. But since the definition of spam and requirement may be different for different user, the second type of filters is getting popular and more effective. The first generation content-based filters are rule-based. Cohen [4] devised a rule-learning system called RIPPER which automatically generated keyword-spotting rules. Rule-based filters rely mostly on manually constructed pattern matching rules that need to be tuned to each user’s incoming messages, a task requiring time and expertise. Furthermore, the characteristics of spam change over time, requiring the rules to be maintained. This prompts to the development of automated filters with the use of machine learning algorithms. Most of them are adapted from those used in text categorization. However, spam filtering can’t just be considered as mere text classification since email is more structured and two types of errors are not the same, rather false positives are worse than false negatives. Moreover, live human spammers are working actively to defeat filters with new and new tricks. So spam filtering algorithms need quite different approaches and treatments. Although many machine learning based spam filtering approaches like Support Vector Machine [5], k-NN [6], boosting trees [7], Genetic Algorithms [8] etc. including Bayesian filter are being proposed, the later is the most popular and widely used because of efficient training, quick classification, easy extensibility, adaptive learning and few false positives. This motivated me for further exploring improvements on the approach in this thesis.

4

1.3

Previous Works on Bayesian Spam Filtering

Pantel [9] first employed the Naive Bayes algorithm to classify messages as spam or legitimate using words as features of the messages and frequency counts of words in spam and legitimate collections to generate probabilities of a word being in spam and legitimate messages. Sahami [10] also used the Naive Bayes method but using the mutual information measure as a feature selector to select words with the strongest resolving power between spam and legitimate messages and also using some domain-specific features of spam like specific phrases, overemphasized punctuation etc. as attributes of a message. In the series of papers, Androutsopoulos [11, 12, 13, 14] extended Naive Bayes filter by investigating the effect of different number of features and training-set sizes on the filter’s performance. The accuracy of the Naive Bayes filter was shown to greatly outperform a typical keyword-based filter [12]. Paul Graham in [15] and [16] defined various tokenization rules treating tokens in different parts of emails like To, From, Subject and Return-path separately and proposed different way of computing token probabilities and combined spam probability, however based on Bayes rule. Gary Robinson, in [17] suggested enhancements to Graham’s approach by proposing Bayesian approach of handling rare words and in [18] further suggesting to use Fisher’s inverse chi-square function for combining probability estimations. William Yerazunis, in [19] presented the way beyond 99.9% plateau in spam filtering.

5

1.4

Contributions

This thesis investigates and explores different ways of estimating token probabilities with the novel goal to improve Bayesian spam filtering and contribute two different approaches that successfully achieved the goal. • All previous algorithms use either token counts (token frequencies) or spam and legitimate email counts (document frequencies) or average number of occurrences of tokens in spam and legitimate emails (mixed) in probability computations. The first proposed approach takes into account both document, token and average frequencies in probability computations and appropriately co-weight them to give resultant probability estimates. • Furthermore, almost all previous algorithms consider tokens in body text only, some considering tokens in subject area as well. Even though some others consider tokens in other areas like different headers and html tags, they treat the same token occurred in different areas separately and independently. However, in reality those tokens are inter-related. The second approach takes into account of such relation whereby the same token occurred in different areas are co-related by co-weighting the individual area-wise probability estimations for the token, which gives more effective resultant token probability estimation. Feature extraction is one of the important steps in spam filtering upon which the filter performance heavily depends. The thesis further contributes by defining its own preprocessing and feature extraction (tokenization) techniques.

6

1.5

Thesis Organization

The rest of the thesis from Chapter 2 is organized as follows. Chapter 2 discusses the general implementation steps of statistical Bayesian spam filters. It also reviews three popular evolutions and variants of Bayesian spam filtering algorithms: Naive Bayes, Paul Graham’s and Gary Robinson’s algorithms. Chapter 3 presents preprocessing and feature selection steps used by both approaches presented in this thesis. Chapter 4 describes the first approach of improving the Bayesian spam filter. It presents the proposed probability computation scheme along with the training and classification algorithms. Chapter 5, like in Chapter 4, describes the second approach of improving the filter. Chapter 6 outlines the corpora datasets collected to test the filter and measures used to evaluate performances. Chapter 7 discusses the experiments and analysis. The chapter also made comparative analysis on different algorithms. Chapter 8 concludes the thesis and suggests future work. Appendix A discusses the data structures, data files and source files used in the implementation of the filter application. Appendix B presents the Application User’s Manual for the filter application. Appendix C outlines the complete program documentation generated from the source code documentations of the application program.

Chapter 2 Statistical Bayesian Spam Filtering Algorithms Statistical Bayesian spam filtering algorithms are probability driven learning algorithm based on Bayes theorem with independent feature model, i.e. the Naive Bayes probability model. They extract different features (in a simple case, a bag of words) typical of spam, each feature is assigned a probability score, and the spam and/or legitimate score for the whole email are computed from individual scores. However, the different variants of algorithms are being proposed in different approaches, though all of them are based on the original Bayes probability model. In this chapter, steps involved in filtering are first reviewed and then three popular variants of statistical Bayesian filtering algorithms including the original Naive Bayes are introduced. At the end of the chapter, how to deal with small probabilities during calculations is discussed.

7

8

2.1

Spam Filtering Steps

Most learning algorithms (so do Bayesian spam filtering) include three main steps: 1. A mechanism for extracting features from messages. 2. A mechanism for assigning weights to the extracted features. 3. A mechanism for combining weights of extracted features to determine whether the mail is spam. In Bayesian spam filtering, words or tokens are used as features to implement step 1, probability of features in spam/legitimate collections to implement step 2 and the Naive Bayes theorem or some variant of the Bayes rule to implement step 3. Although this approach creates an promising system for catching spam, the continuing problem of false positives prompts us to have a fresh look at each of the steps. Several variants of Bayesian spam filtering algorithms are being proposed with different ways of implementing steps 2 and 3 in spam classification algorithms. Sects. 2.2-2.4 introduces three most popular variants. Since all three steps play important roles in the filter performance, a machine learning algorithm that implements a unique mechanism for each of the three steps is devised. Chap. 3 presents feature extraction techniques used and Chaps. 4 and 5 discuss the implementation of last two steps for two approaches introduced in Sect. 1.4. Like all other supervised learning algorithms, Bayesian spam filtering systems need to go through two phases:Training and Classification. • During training phase, the filter is trained with predefined number of spam and legitimate email collections and collect all information required for calculating probability estimations(weights) described in last two steps above.

9

• Then the given email is classified as spam or legitimate based on the combined probability estimation for spam and legitimate, calculated using the information collected during training phase and features available in the test email. Classifier can also take various parameters that depend upon the filter algorithm. Fig. 2.1 shows the block diagram of spam filtering system with Trainer and Classifier as its two vital components. Specific training and classification algorithms for two proposed approaches are outlined in Chaps. 4 and 5.

Figure 2.1: Block diagram of Bayesian spam filter

10

2.2

Naive Bayes (NB) Algorithm

Naive Bayes algorithm is the simplified version of Bayes theorem with the assumption of feature independence. The computation of probabilities with this algorithm is briefly introduced here. By Bayes theorem, the probability of a Class ∈ {Spam, Legitimate} given an Email is given by: P (Class|Email) =

P (Class)P (Email|Class) P (Email)

(2.2.1)

Where P (Class) is the prior probability for the Class, P (Email) is the probability of the Email and P (Email|Class) is the conditional probability of the Email given the Class. Since Computing of the conditional probability P (Email|Class) is very complicated, Naive Bayes assumes individual features (tokens) are conditionally independent, given we know the Class. Then the conditional probability is given by: P (Email|Class) =

Y

P (ti |Class)

(2.2.2)

i

Where ti is the ith token in the given Email and P (ti |Class) is the probability of the token ti given the Class. For a random email, P (Email) is a constant divider common to every calculation and hence can be disregarded. Then the posterior probabilities for Spam and Legitimate are given by: P (Spam|Email) = P (Spam)P (Email|Spam)

(2.2.3)

and P (Legitimate|Email) = P (Legitimate)P (Email|Legitimate)

(2.2.4)

11

The conditional probability of a token t, given Class is calculated as in [20] using the following formula. 1 + no(t, Class) i no(ti , Class) + |V |

P

(2.2.5)

Where no(t, Class) is the number of occurrences of the token t in the Class, P

i

no(ti , Class) is the total number of occurrences of all the tokens in the Class

and |V | is the size of the vocabulary, a distinct set of words built from all spam and legitimate emails during training. A new unknown token (i.e. the one never occurred during training) is either ignored or assigned a constant probability like P

1/(

i

no(ti , Class) ∗ |V |). The filter then classifies an email as Spam or Legitimate

according to whether the P (Spam|Email) is greater than P (Legitimate|Email) or not. Although the feature independence is a poor assumption, many implementations have shown that the algorithm is fairly robust and powerful in filtering spams and outperforms many other knowledge base filters [13, 21].

2.3

Paul Graham’s (PB) Algorithm

Paul Graham in [15] defined tokenizer rules and further improved in [16]. The rules are summarized as follows: • Alphanumeric characters, dashes, apostrophes, Exclamation points, and dollar signs to be part of tokens, and everything else to be a token separator. Tokens are case-sensitive. • Periods and commas are constituents if they occur between two digits. A price range like $20-25 yields two tokens, $20 and $25.

12

• Tokens that occur within the To, From, Subject, and Return-Path lines, or within urls, get marked accordingly. Probability estimates for a given email being spam and legitimate, given a token appears in that are calculated using the formulas: tbad nbad tbad + 2∗tgood nbad ngood

(2.3.6)

P (Legitimate|t) = 1 − P (Spam|t)

(2.3.7)

p(t) = P (Spam|t) = and

Where tbad and tgood are the number of times the token t occurred in all the spam and legitimate emails respectively, and nbad and ngood are the number of spam and legitimate emails respectively. tgood is multiplied by 2 to bias towards legitimate emails. These token probability estimates for spam and legitimate are calculated for all the tokens present in the email and the combined probability for spam is obtained using Bayesian approach which makes the assumption of these probabilities being independent to each other using the formula: Q

P (Spam|Email) = Q

P (Spam|ti ) Q i P (Spam|ti ) + i P (Legitimate|ti ) i

(2.3.8)

Paul used only 15 most interesting tokens, measured by how far their spam probability is from a neutral 0.5, in computing the combined probability for spam. He ignored tokens whose total number of occurrences is less than 5. Moreover, Paul assigned probability value of 0.4(obtained by trial and error) to an unknown token and 0.99 to tokens occurred in one class but not in the other. A test email is then classified as spam if the combined probability is more than a defined threshold value of 0.9. In [15], Paul got the miss of less than 5 per 1000 spams, with zero false positive; the result was promising at the time of the publication of the algorithm.

13

2.4

Gary Robinson’s (GR) Algorithm

Gary in [17] pointed out several drawbacks with Paul’s algorithm and suggested improvements: • Probability estimation calculated in Paul’s algorithm is not exactly the probability as it ignored the prior probabilities of spam and legitimate emails, which in effect, approximates the probability (Gary termed it as guesstimate) that a randomly chosen email containing token would be spam in a world where half the emails were spams and half were legitimates. But still it works fine as we want emails to be judged purely based on their own characteristics(contents) rather than relative proportion of spams and legitimates. • Paul’s approach is subtly asymmetric with respect to how it handles words that indicate spamminess compared to how it handles words indicating nonspamminess. Gary suggested that we should get best performance by not lessening our handling both kinds of evidence equally well, so the bias factor 2 can be removed. • Gary argued Paul’s assignment of value 0.4 to unknown tokens and 0.99 to tokens occurred in one class but not in the other; just two extreme cases without treatment to low and high data situations, as inconsistent and unsmooth; and proposed a consistent and smooth way of dealing rare words by using the Bayesian approach to compute the token probability guesstimate, termed as degree of belief : f (t) =

s ∗ x + n ∗ p(t) s+n

(2.4.9)

14

Where p(t) is the Paul’s probability estimation (2.3.6) for the token t, s is the strength to be given to background information, x is the assumed probability for an unknown token and n is the number of emails containing the token t. The values for s and x are obtained through testing to optimize performance with the reasonable starting points of 1 and 0.5 for s and x, respectively. • In [18], Gary further suggested to use Fisher’s inverse chi-square function to compute combined probabilities using the formulas: H = C −1 (−2 ln

Y

f (t), 2n)

(2.4.10)

(1 − f (t)), 2n)

(2.4.11)

i

and S = C −1 (−2 ln

Y i

H and S are combined probabilities that allow rejecting the null hypotheses and assume instead the alternative hypotheses that the email is a ham and spam respectively. C −1 () is the Fisher’s inverse chi-square function used with 2n degrees of freedom. The combined indicator of spamminess or hamminess for the email as a whole is then obtained using the formula: I=

1+H −S 2

(2.4.12)

The email is classified as spam if the indicator value I is above some threshold value otherwise as legitimate. Emails with I values near 0.5 can also be classified as uncertain. It has been found that the performance of the filter is significantly improved with Gary’s modifications. Experiments with Bogofilter by Greg Louis [22] showed that Robinson’s method of calculation is more likely to yield correct results than is the original method proposed by Graham.

15

The proposed new approaches work on the top of Naive Bayes probability model, Paul Graham’s calculation and Gary Robinson’s modification with further improvements.

2.5

Dealing with Small Probabilities and Normalization

During implementation, actual computation of probabilities may run into some difficulties due to the very small size of the numbers that may cause underflows. This case may occur especially in NB algorithm as it considers all tokens in an email, however the problem might not be so prone in PB and GR algorithms as they consider only few number of interesting tokens in computing combined probability estimations. One way to deal with small probabilities is to calculate with logarithms instead of the actual probabilities. log(P (Class|Email)) = log(P (Class)P (Email|Class))

(2.5.13)

Furthermore, probabilities are normalized, i.e. the sum of spam and legitimate probabilities equals one, to make them more readable and uniform. Let P s = log(P (Spam|Email)) and P l = log(P (Legitimate|Email)) , then the normalized probabilities for spam and legitimate are given by:

P (Spam) =

1 1 + eP l−P s

(2.5.14)

and P (Legitimate) =

1 1 + eP s−P l

(2.5.15)

Chapter 3 Preprocessing and Feature Selection Feature extraction is one of the three important steps (see Sect. 2.1) in spam filtering upon which the filter performance heavily depends. In this chapter the preprocessing and feature extraction process used in both the approaches proposed in this thesis are presented.

3.1

Preprocessing

Due to the prevalence of headers, html and binary attachments in modern emails, preprocessing is required on email messages to allow effective feature extraction. The following preprocessing steps are used in this thesis: • The whole email structure is divided into 4 areas: 1. Normal header comprising of ‘From’, ‘Reply to’, ‘Return-path’, ‘To’, ‘Cc’, 2. Subject header, 3. Body, and

16

17

4. Html tags comprising of three tags , and . Font, text color in and hyper link text in and provide good clues in identifying spams. • All other headers and html tags (except those mentioned above) are ignored and hence removed from the email text. • Only MIME parts having content-type as “text” or “message” are used for extraction thus ignoring binary attachments. The remaining text is then used for feature extraction.

3.2

Feature extraction or Tokenization

The proposed approaches consider tokens as the sole features for the spam filtering. The following tokenizer rules are used: • All terms constituting alphanumeric characters, dash(-), underscore( ), apostrophe(’), Exclamation(!), asterisk(*) and currency signs(like $,£) are considered valid tokens and tokens are case-sensitive. • IP addresses, domain names, money values (numbers separated by comma and/or with currency symbols) are considered valid tokens. Pure numbers are ignored. • For domain name, it is broken into sub-terms (like www.hnu.net is broken into www.hnu.net, www.hnu, hnu.net, www, hnu and net) and the sub-terms are also considered valid tokens [23].

18

• Contents within ,
and html tags are tokenized as usual except tag names and attribute names are ignored. • Spammer’s one of newest tricks of non-HTML text, interspersed with HTML tags like “You can buy valium here!” is handled and obtains the text as “You can buy valium here” and tokenize it normally. • Stemming and stop lists are not used, as performances are better without them.

Chapter 4 Filtering Based on Co-weighted Multi-estimations This chapter presents the first approach of improving the Bayesian spam filtering based on co-weighted multi-estimations. Training and classification algorithms for the approach are also outlined.

4.1

Main Idea and Algorithm Description

The main idea in this approach lies in the fact that both document and token frequencies play vital role in determining whether a random email is a spam or not, hence needs to be considered in the filtering system. So rather than computing probability estimations merely based on either document or token frequencies, probabilities based on both document and token frequencies as well as based on average number of token occurrences in spam and legitimate emails are computed. Then the combined integrated estimate is obtained by normalized co-weighting them. Let nos(t) and nol(t) be the number of occurrences of token in spam and legitimate emails respectively, T s and T l be the total number occurrences of all tokens in

19

20

spam and legitimate emails, ns(t) and nl(t) be the number of spam and legitimate emails containing the token, N s and N l be the number of spam and legitimate emails respectively. Then for each token t in the given email, three individual probability estimates are calculated as follows: 1. Estimation based on document frequencies: p1 (t) =

ns(t) Ns ns(t) + nl(t) Ns Nl

(4.1.1)

2. Estimation based on token frequencies: p2 (t) =

nos(t) Ts nos(t) + nol(t) Ts Tl

(4.1.2)

3. Estimation based on average number of occurrences of a token in spam and legitimate emails: p3 (t) =

nos(t) Ns nos(t) + nol(t) Ns Nl

(4.1.3)

These three probability estimations corresponds to PG’s probability estimation (2.3.6), but without bias factor. Next for each of these, the degrees of belief estimations: f1 (t), f2 (t) and f3 (t) are computed using Gary’s formula (2.4.9) replacing p(t) with p1 (t), p2 (t) and p3 (t) respectively and using n = ns(t)+nl(t) for f1 (t) and n = nos(t)+nol(t) for f2 (t) and f3 (t). s and x are, like in Gary’s Algorithm, the belief factor and the assumed probability for an unknown token whose values are determined while tuning the filter for optimal performance. Then the combined probability estimation is calculated by co-weighting the three individual estimations with Co-Operative Training (COT) approach:

f (t) = ω1 ∗ f1 (t) + ω2 ∗ f2 (t) + ω3 ∗ f3 (t)

(4.1.4)

21

Where ω1 + ω2 + ω3 = 1. The weight factors signify the importance of three individual estimations. Since f (t) has the resultant effect of both document and token frequencies as well as the average token occurrences in spam and legitimate emails, it gives better probability estimation. Furthermore, since considering fixed number of interesting tokens as suggested in PG’s algorithm is unrealistic and unreasonable, all tokens whose probability values are above and below certain offset value P ROB OF F SET from the neutral 0.5 are considered as interesting [24]. If this gives less than certain fixed number M IN IN T T OKEN S of tokens, the range is extended to obtain that number of interesting tokens. Now values for those interesting tokens are used to obtain the final indicator I of spamminess and hamminess using (2.4.12), whereby H and S are calculated by Fisher’s inverse chi-square functions (2.4.10) and (2.4.11) respectively. Finally the email is classified as spam if I is greater than certain threshold value, SP AM T HRESHOLD, otherwise classified as legitimate.

4.2

Training Algorithm

The algorithm for how the filter is trained, given a dataset is given below: • Input the training dataset consisting of predefined spam and legitimate emails. • For each email in the dataset. ∗ Preprocess it. ∗ For each email area a. ◦ Tokenize the text in that area. ◦ For each token t, compute and update nos(t), nol(t), T s, T l, ns(t), ns(t), N s and N l.

22

4.3

Classification Algorithm

Based on the training data, the filter classifies an email as spam or legitimate. The classification algorithm is as follows: • Input the test email. • Preprocess it. • Tokenize the text. • For each email area. ∗ Tokenize the text in that area. ∗ For each token t in the tokenized text ◦ Obtain p1 (t), p2 (t), p3 (t) using (5.1.1) and compute f1 (t), f2 (t), f3 (t). ◦ Then compute f (t) using (5.1.3). ∗ Then compute the combined probability estimate f (t) using (5.1.2). • Obtain the interesting tokens from among all the tokens in all areas of the email. • Compute H and S using (2.4.10) and (2.4.11) respectively from n interesting tokens. • Compute the combined indicator I using the formula (2.4.12). • If I is greater than SP AM T HRESHOLD, classify the email as Spam, otherwise classify it as Legitimate. My paper based on the approach presented in this chapter has been published in the proceedings of The International Symposium on Intelligent Computation and its Applications (ISICA-2005) [25] held in Wuhan, China, 4-6 April 2005 and I got the opportunity to attend and present in the conference.

Chapter 5 Filtering Based on Co-weighted Multi-area Information This chapter presents the second approach of improving the Bayesian spam filtering based on co-weighted multi-area information and also outlines training and classification algorithms for the approach.

5.1

Main Idea and Algorithm Description

The main idea in this approach lies in the fact that the same token occurred in different areas of an email are inter-related. So treating the token occurred in one area separately from that occurring in other areas like in all previous algorithms wouldn’t reflect the realistic estimation. This approach relates the individual area-wise token probability estimations by co-weighting and obtain the combined integrated estimate for the token. The estimation steps are described in details below. Let ns(t, a) and nl(t, a) be the number of occurrences of token t in the area a of spam and legitimate emails respectively, N s and N l be the number of spam and legitimate emails respectively. Then the probability estimation for spam given the

23

24

token t and the area a is computed as: p(t, a) =

ns(t,a) Ns ns(t,a) + nl(t,a) Ns Nl

(5.1.1)

This estimation corresponds to PG’s probability estimation (2.3.6). Next, the GR’s degree of belief estimation f (t, a) is computed using (2.4.9), replacing p(t) with p(t, a) and using n = ns(t, a) + nl(t, a). s and x are, like in GR’s algorithm, the belief factor and the assumed probability for an unknown token whose values are determined while tuning the filter for optimal performance. Then the combined probability estimation for the token is calculated by co-weighting the individual estimations corresponding to different areas: f (t) =

X

ω(t, ai ) ∗ f (t, ai )

(5.1.2)

i

Where ω(t, ai ) is the weight factor for the token t in the area ai . The weight factor for the token t corresponding to area a is computed by the ratio of the number of occurrences of the token in that area and total number of occurrences of the token in all areas in all spam emails: ns(t, a) w(t, a) = P i ns(t, ai ) This gives normalized weight coefficients, as

P

i

(5.1.3)

w(t, ai ) = 1. Since the combined

probability estimate f (t) co-relates the area-wise estimations according to their actual occurrences, the result gives the better probability estimation. Next interesting tokens are obtained, and then the final indicator I of spamminess and hamminess are calculated as in the first approach 4. And finally, the email is classified as spam if I is greater than certain threshold value, SP AM T HRESHOLD, otherwise classified as legitimate.

25

5.2

Training Algorithm

The algorithm is similar to the one in the first approach, however with different computations: • Input the training dataset consisting of predefined spam and legitimate emails. • For each email in the dataset. ∗ Preprocess it. ∗ For each email area a. ◦ Tokenize the text in that area. ◦ For each token t, compute and update ns(t, a), nl(t, a), N s, and N l.

5.3

Classification Algorithm

The classification algorithm is as follows: • Input the test email. • Preprocess it. • Tokenize the text. • For each token t. ∗ For each email area a. ◦ Obtain p(t, a) using (5.1.1) and compute f (t, a) from it. ◦ Compute ω(t, a) using (5.1.3). ∗ Then compute the combined probability estimate f (t) using (5.1.2). • Obtain the interesting tokens from among all the tokens in all areas of the email. • Compute H and S using (2.4.10) and (2.4.11) respectively from n interesting tokens. • Compute the combined indicator I using the formula (2.4.12). • If I is greater than SP AM T HRESHOLD, classify the email as Spam, otherwise classify it as Legitimate.

26

A paper authored by me based on the approach presented in this chapter has been accepted for publishing in the proceedings of The 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2005) [26] to be held in Hanoi, Vietnam, 18-20 May 2005.

Chapter 6 Dataset Collections and Evaluation Measures Three well-known publicly available corpora datasets are used in order to allow comparisons to be made with other published work and ensure that performance is tested across a varied range of email. This chapter introduces these corpora dataset collections. Furthermore, evaluation measures used in analyzing and comparing the results of the experiments (see Chap. 7) are also discussed here.

6.1

Corpora Collections

1. Ling Spam corpus which was made available by Ion Androutsopoulos [11] and has been used in a considerable number of publications. It is composed of 481 spams, and 2,412 legitimate emails. All messages were taken from a linguist mailing list. Emails in this collection contains subject and body text only. 2. Spam Assassin corpus used to optimize the open source SpamAssassin filter [27]. 2003 version used in this thesis contains 1,897 spam (spam and spam-2) and 4,150 legitimate (easy ham, easy ham-2, and hard ham) emails. Hard ham 27

28

includes 250 non-spam messages which are closer in many respects to typical spam: use of html, unusual html markup, colored text, “spammish-sounding” phrases etc. Constructed from the maintainers’ personal email it is probably the most representative corpus of a typical user’s mail that is available publicly. 3. Annexia/Xpert corpus, synthesis of 10,025 spam emails from Annexia spam archives [28] and 22,813 legitimate emails from X-Free project’s Xpert mailing list. 7,500 spam and 7,500 legitimate emails are randomly picked from this corpus. From each corpus, training and test datasets are prepared by randomly picking two-thirds of the total corpus data as training dataset and the rest one-third as test dataset. The Table 6.1 shows the sizes of the training and test datasets for all three corpora.

Table 6.1: Training and test dataset sizes Number of

Ling Spam

Spam Assassin

Annexia/Xpert

Training Dataset

Spam emails Legitimate emails

321 1,608

1,265 2,767

5,000 5,000

Test Dataset

Spam emails Legitimate emails

160 804

632 1,383

2,500 2,500

Total Spam emails Corpus size Legitimate emails

481 2,412

1,897 4,150

7,500 7,500

29

Table 6.2: Confusion/Contingency Matrix Actual Spam(p)

Legitimate(n)

Spam(p) T P = NS→S Predicated Legitimate(n) F N = NS→L Sum P = T P + F N = NS

6.2

F P = NL→S T N = NL→L N = F P + T N = NL

Evaluation Measures

Let NS and NL be the total number of spam and legitimate email messages to be classified by the filter respectively, and NX→Y the number of messages belonging to class X that the filter classified as belonging to class Y

(X, Y ∈ {Spam(S), Legitimate(L)}).

Confusion/Contingency matrix is constructed as shown in Table 6.2 and then seven evaluation measures used are calculated as described below in four categories: 1. Weighted Accuracy (W Acc) and Weighted Error (W Err): Accuracy and Error rates measure the percentages of correctly and incorrectly classified messages. Since two error types S→L and L→S are not equally costly, Weighted accuracy and error rates are used rather than absolute rates. Assume that L→S is λ times more costlier than S→L, then these measures are calculated as follows: W Acc =

NS→S + λNL→L NS + λNL

(6.2.1)

W Err =

NS→L + λNL→S NS + λNL

(6.2.2)

The measures are calculated in all experiments with two reasonable values of λ: 9 and 99 [11].

30

2. Total Cost Ratio (T CR): It is used to compare the filter with the baseline approach: the case where no filter is present whereby legitimate messages are (correctly) never blocked and spam messages (mistakenly) always pass the filter. The weighted accuracy and error rate of the baseline are: W Accb =

λNL λNL + NS

(6.2.3)

W Errb =

NS λNL + NS

(6.2.4)

Then the total cost ratio is computed as: T CR =

NS W Errb = W Err λNL→S + NS→L

(6.2.5)

Grater T CR indicates better performance. For T CR < 1, not using the filter is better. If cost is proportional to wasted time, T CR measures how much time is wasted to delete manually all spam messages when no filter is present (NS ), compared to the time wasted to delete manually any spam messages that passed the filter (NL→S ) plus the time needed to recover from mistakenly blocked legitimate messages (λNL→S ). 3. Spam Recall (SR) and Spam Precision (SP ): Spam recall measures the percentage of spam messages that the filter manages to block (intuitively its effectiveness), while spam precision measures the degree to which the blocked messages are indeed spam (the filter’s safety). They are calculated using formulas: N TP = S→S P NS

(6.2.6)

NS→S TP = TP + FP NS→S + NL→S

(6.2.7)

SR =

SP =

31

4. False Positive Rate (F P R) and False Negative Rate (F N R): False positive rate measures the percentage of legitimate messages incorrectly marked as spam, and false negative rate measures the percentage of spam messages incorrectly marked as legitimate. These measures are calculated as follows: FPR =

N FP = L→S N NL

(6.2.8)

F NR =

N FN = S→L P NS

(6.2.9)

A filter which results zero false positive and at the same time with highest accuracy, total cost ratio, spam recall, spam precision and minimum false negative rate is considered as the best filter.

Chapter 7 Experiments and Analysis This chapter discusses all the experiments and analysis carried out for both the filtering approaches put forward in Sect. 4 and 5. Experiments are performed on the filter application (see Appendices A-C for details) I developed in Java during the research work. All experiments are carried out five times for all three datasets by randomly picking training and test datasets as described above in Sect. 6.1 and the average results are reported. Two separate experiments are carried out for each approach.

7.1

Parameters Tuning

Parameter tuning experiments are carried out by thorough and exhaustive tests on the filter with all three corpora datasets in order to determine optimal values of the filter parameters: SP AM T HRESHOLD, M IN IN T T OKEN S, P ROB OF F SET , s and x, with the aim of getting minimum false positives and false negatives, at the same time with high accuracy possible. During the experiment, it is observed that it is possible to achieve near zero false positive if the classifier is well tuned to the data, but at the cost of reduced accuracy. The performances vary widely on varying the parameter combinations. During the experiment, different behaviors for different 32

33

datasets are observed in some parameter combinations. This is because of quite different type and nature of contents of three different corpora data like, Ling Spam has only one header data: subject, while header contents play vital role in other two datasets. Spam Assassin dataset comprises hard hams containing unusual html markup, colored text, spammish-sounding phrases that tends to give rise in number of false positives. Moreover, the tuning of the parameters is targeted at obtaining the similar optimal parameter values for both the filtering approaches for consistent results and fair comparisons. So during the experiments I searched for parameter combination that gives compromised result with reduced false positive, high accuracy, and stable and consistent result for all three corpora dataset and for both the filtering approaches. From the exhaustive search it was found that the values 0.9, 19, 0.42, 0.67 and 0.35 for SP AM T HRESHOLD, M IN IN T T OKEN S, P ROB OF F SET , s and x respectively that resulted balanced performance with all three datasets and at the same time for both the approaches. As false positives are much worse than false negatives, SP AM T HRESHOLD of 0.9 reasonably biases towards legitimates. The M IN IN T T OKEN S = 19 and P ROB OF F SET = 0.42 combination provides the effective number of interesting tokens for optimal performances. The assumed probability value for unknown token x = 0.35 also slightly biases towards legitimates with 0.67 belief factor. The parameter combination is used in all other experiments.

34

7.2

Experiments with Co-weighted Multiestimations

7.2.1

Experiments and Results

First, the filter is tested with three individual estimations based on: document frequencies (DF), token frequencies (TF), and average token occurrences (ATO). Next, the tests are performed with the proposed co-weighted multi-estimations (CWME). During the tests, The test search for the best weighting modulus for multiestimations is carried out by testing with all possible combinations and analyzing the result. From the exhaustive test and analysis of the results, the weighting modulus: (ω1 =0.5, ω2 =0.3, ω3 =0.2) found to exhibit consistent, stable and optimal performances. The weight values for three estimations reflect their importance or effect in the combined estimation and well represent their individual behaviors that can be observed from the result of the experiment. All the tests are carried out for all three datasets and with the same parameter combination obtained in the parameter tuning experiments above and performance measures are reported for two values of λ: 9 and 99. Test results are given in the Tables 7.1 and 7.2. Performance measures independent of λ (SR, SP , F P R and F N R) are given in Table 7.1 and λ dependent measures (W Accb , W Errb , W Acc, W Err and T CR) are given separately in Table 7.2 for both values of λ. The comparative results based on weighted accuracy and total cost ratio are shown graphically in Figs. 7.1-7.4 below.

35

Table 7.1: Test results (SR, SP , F P R and F N R) with individual and co-weighted multi-estimations Estimations

Measures

Ling Spam

Spam Assassin

Annexia/Xpert

Estimation Based on Document Frequencies

SR SP FPR F NR

0.73750 1.00000 0.00000 0.26250

0.90032 0.99476 0.00217 0.09968

0.96120 1.00000 0.00000 0.03880

Estimation Based on Token Frequencies

SR SP FPR F NR

0.70000 1.00000 0.00000 0.30000

0.87500 0.99640 0.00145 0.12500

0.95640 1.00000 0.00000 0.04360

Estimation Based on Average Token Occurrences

SR SP FPR F NR

0.81250 0.99237 0.00124 0.18750

0.90348 0.99477 0.00217 0.09652

0.96360 1.00000 0.00000 0.03640

Estimation with Co-weighted Multi-estimations

SR SP FPR F NR

0.76875 1.00000 0.00000 0.23125

0.89399 0.99647 0.00145 0.10601

0.96240 1.00000 0.00000 0.03760

36

Table 7.2: Weighted accuracy rates, error rates and total cost ratios with individual and co-weighted multi-estimations

Ling Spam

λ=9 Spam Assassin

Annexia Xpert

W Accb W Errb

0.97837 0.02163

0.95168 0.04832

0.90000 0.10000

Estimation Based On DF

W Acc W Err T CR

0.99432 0.00568 3.80952

Estimation Based On TF

W Acc W Err T CR

Estimation Based On ATO Estimation with CWME

Estimations

Ling Spam

λ = 99 Spam Assassin

Annexia Xpert

0.99799 0.00201

0.99541 0.00459

0.99000 0.01000

0.99312 0.00688 7.02222

0.99612 0.99947 0.00388 0.00053 25.77320 3.80952

0.99738 0.00262 1.75556

0.99961 0.00039 25.77320

0.99351 0.00649 3.33333

0.99258 0.00742 6.51546

0.99564 0.99940 0.00436 0.00060 22.93578 3.33333

0.99799 0.00201 2.28159

0.99956 0.00044 22.93578

W Acc W Err T CR

0.99473 0.00527 4.10256

0.99327 0.00673 7.18182

0.99636 0.99838 0.00364 0.00162 27.47253 1.24031

0.99740 0.00260 1.76536

0.99964 0.00036 27.47253

W Acc W Err T CR

0.99500 0.00500 4.32432

0.99350 0.00650 7.43529

0.99624 0.99954 0.00376 0.00046 26.59574 4.32432

0.99807 0.00193 2.38491

0.99962 0.00038 26.59574

Measures

37

Figure 7.1: Classification accuracies with individual and co-weighted multiestimations, for λ = 9

Figure 7.2: Total cost ratios with individual and co-weighted multi-estimations, for λ=9

38

Figure 7.3: Classification accuracies with individual and co-weighted multiestimations, for λ = 99

Figure 7.4: Total cost ratios with iindividual and co-weighted multi-estimations, for λ = 99

39

7.2.2

Analysis

The observations made during experiments and their analysis are outlined below. • The performances of the filter based on document frequencies are average for Ling Spam and Annexia/Xpert datasets with zero false positives, however with some false positives for Spam Assassin. This is because of the presence of hard spammy words in the Spam Assassin datasets. Accuracy and total cost ratios are on the average line for all datasets for λ = 9. But for λ = 99, the accuracy and total cost ratios drops for Spam Assassin because of false positives. • The filter based on token frequencies only resulted highest false negatives, among others, however with fewer false positives for Spam Assassin. This is due to its effectiveness in containing false positives, but at the cost of false negatives. So, the weighted accuracies and total cost ratios drop down with λ = 9, and relatively higher with λ = 99. • The filter based on the average token occurrences only, resulted the minimum false negatives compared to other two individual estimations, however at the cost of some false positives for Ling Spam and Spam Assassin. This causes relatively higher accuracies and total cost ratios for lower penalty of λ = 9 while drops down for the higher penalty of λ = 99. • Thus we see that the performances of the filter based on three individual estimations fluctuate with the datasets. The filter based on document frequencies tried to balance between false negatives and false positives; the filter based on token frequencies worked in favor of reducing the false positives but at the cost of few more false negatives; and at the same time the filter based on average

40

token occurrences performed in favor of reducing false negatives, however at the cost of few more false positives. The proposed approach of co-weighted multi-estimations, which combines the strengths of all three estimations by coweighting them, handles all cases consistently and resulted higher accuracies and spam precisions and at the same time with lower false positives. As the filter is tuned for high accuracy and low false positive, it causes slightly lower spam recall and little higher false negative rates which is acceptable since false positives are generally considered far worse than false negatives. More importantly the new approach of integrated multi-weighted estimations exhibits more stable and consistent behavior with all three corpora.

7.3

Experiments with Co-weighted Multi-area Information

7.3.1

Experiments and Results

The experiments are performed by testing the filter with four individual email areas: normal headers only, subject only, body only and html tags only, then with all areas but treating same tokens occurred in different areas as different tokens and finally with the proposed new approach of co-weighted multi-area information. As in co-weighted multi-estimations, all tests are carried out with the same parameter combination obtained during parameters tuning and performance measures are reported for two values of λ: 9 and 99. Test results are given in the Tables 7.3 and 7.4. The comparative results based on weighted accuracy and total cost ratio are shown graphically in Figs. 7.5-7.8 below.

41

Table 7.3: Test results (SR, SP , F P R and F N R) with individual area-wise and co-weighted multi-area based estimations Areas

Measures

Ling Spam

Spam Assassin

Annexia/Xpert

Normal Headers Only

SR SP FPR F NR

0.00000 NA 0.00000 1.00000

0.62180 0.98500 0.00430 0.37820

0.76680 1.00000 0.00000 0.23320

Subject Only

SR SP FPR F NR

0.43750 1.00000 0.00000 0.56250

0.40510 0.99220 0.00140 0.59490

0.60040 1.00000 0.00000 0.39960

Body Only

SR SP FPR F NR

0.81880 1.00000 0.00000 0.18120

0.84340 0.98890 0.00430 0.15660

0.92880 0.99610 0.00360 0.07120

Html Tags Only

SR SP FPR F NR

0.00000 NA 0.00000 1.00000

0.42090 0.97440 0.00510 0.57910

0.22280 0.97720 0.00520 0.77720

All but Separate Areas

SR SP FPR F NR

0.83120 1.00000 0.00000 0.16880

0.88290 0.99470 0.00220 0.11710

0.96920 1.00000 0.00000 0.03080

SR Co-weighted SP Multi-areas F P R F NR

0.83750 1.00000 0.00000 0.16250

0.88610 0.99640 0.00140 0.11390

0.97240 1.00000 0.00000 0.02760

42

Table 7.4: Weighted accuracy rates, error rates and total cost ratios with individual area-wise and co-weighted multi-area based estimations

Ling Spam

λ=9 Spam Assassin

Ling Spam

λ = 99 Spam Assassin

Annexia Xpert

Annexia Xpert

W Accb W Errb

0.97837 0.02163

0.95168 0.04832

0.90000 0.10000

0.99799 0.00201

0.99541 0.00459

0.99000 0.01000

Normal Headers Only

W Acc W Err T CR

0.97837 0.02163 1.00000

0.97760 0.02240 2.15700

0.97668 0.02332 4.28816

0.99799 0.00201 1.00000

0.99394 0.00606 0.75870

0.99767 0.00233 4.28816

Subject Only

W Acc W Err T CR

0.98783 0.01217 1.77778

0.96988 0.03012 1.60406

0.96004 0.03996 2.50250

0.99887 0.00113 1.77778

0.99583 0.00417 1.10105

0.99600 0.00400 2.50250

Body Only

W Acc W Err T CR

0.99608 0.00392 5.51724

0.98830 0.01170 4.13072

0.98964 0.01036 9.65251

0.99964 0.00036 5.51724

0.99496 0.00504 0.91198

0.99572 0.00428 2.33863

Html Tags Only

W Acc W Err T CR

0.97837 0.02163 1.00000

0.96720 0.03280 1.47319

0.91760 0.08240 1.21359

0.99799 0.00201 1.00000

0.99230 0.00770 0.59679

0.98708 0.01292 0.77399

All but Separate Areas

W Acc W Err T CR

0.99635 0.00365 5.92593

0.99228 0.00772 6.25743

0.99692 0.99966 0.00308 0.00034 32.46753 5.92593

0.99730 0.00270 1.70350

0.99969 0.00031 32.46753

Co-weighted W Acc Multi-areas W Err T CR

0.99648 0.00352 6.15385

0.99312 0.00688 7.02222

0.99724 0.99967 0.00276 0.00033 36.23188 6.15385

0.99804 0.00196 2.34074

0.99972 0.00028 36.23188

Areas

Measures

43

Figure 7.5: Classification accuracies with individual area-wise and co-weighted multiarea based estimations, for λ = 9

Figure 7.6: Total cost ratios with individual area-wise and co-weighted multi-area based estimations, for λ = 9

44

Figure 7.7: Classification accuracies with individual area-wise and co-weighted multiarea based estimations, for λ = 99

Figure 7.8: Total cost ratios with individual area-wise and co-weighted multi-area based estimations, for λ = 99

45

7.3.2

Analysis

On analyzing the experiments, the followings are being observed: • With individual area-wise estimations: ∗ Ling Spam dataset contains no headers and html tags. Testing these based on normal headers only and html tags only give the same result as the baseline case as the filter misclassifies all spams as legitimates because of biased setup of the filter and results SR = 0, F N R = 100% and T CR = 1 for both values of λ. So filtering based on individual area-wise estimations with datasets having no data in that area is meaningless. ∗ Even with Spam Assassin and Annexia/Xpert datasets, not all emails contain html tags. The filter classifies all those emails without them as legitimates. Because of this the test resulted much higher false negatives. The penalty for these errors are so high with λ = 99 that results T CR < 1 for both datasets indicating ineffectiveness of the test based on html only. However with λ = 9, T CR values are slightly greater than 1. ∗ For Ling Spam, body only test resulted better performance than subject only test with higher accuracies, SR, and T CR(above 1 in both cases), lower F N R, zero F P R and 100% SP . The presence of less spammy tokens in subject areas of spam emails leads to higher number of false negatives with subject only test. Since F P R = 0, performance trend is the same for both values of λ. ∗ For Spam Assassin, normal headers only, body only and html tags only tests resulted higher false positives because of heavy presence of spammy

46

tokens in those areas especially in hard hams. This causes heavy penalties in their accuracies with λ = 99, while even with higher F N R and lower SR, subject only test resulted higher accuracy and T CR because of fewer false positives. With lesser penalty λ = 9, the accuracies and T CRs for body only is higher because of relatively fewer false negatives and next for normal headers only, subject only and lastly with html tags in descending order. ∗ For Annexia/Xpert, normal headers only and subject only test resulted zero false positives, however with high false negatives, also because of relatively low presence of spammy tokens in spam emails in those areas. Html only test resulted high false positives and at the same time very high false negatives. High false positives are because of presence of spammy tokens in html tags of legitimate emails and very high false negatives are because of miss-classifying spam emails without html tags as legitimates. With lower value of λ = 9, better accuracy and T CR values are resulted with body only test, then with normal headers only, subject only and html tags only in the descending order. But with higher value of λ = 99, the order goes: normal headers only, subject only, body only and html tags only tests. • With all areas but treating same tokens in different areas as different tokens, the test resulted better performance than individual area-wise estimation for both values of λ, in terms of all performance measures in all three datasets due to the combined positive effect of all areas.

47

• With the co-weighted multi-area information approach, the performance is further improved with even lesser false positives as well as lesser false negatives in all three datasets, resulting better performance values of accuracy, SR, SP , F P R, F N R and T CR for both values of λ. T CR value is almost the same with Ling Spam and Annexia/Xpert datasets for both values of λ, however the value is three times higher for λ = 9 than for λ = 99. This is because of higher penalty on false positives with higher λ value. Thus the experiments showed the proposed approach of incorporating the corelation between tokens in different areas results significant improvement in the performance of the filter, and at the same time exhibit more stable, robust and consistent performances with all three corpora.

Chapter 8 Conclusions and Future Work 8.1

Conclusions

Bayesian statistical filter is one of the best spam filters in terms of simplicity, efficiency, robustness and adaptability to spammer’s new tricks. Even though it is based on simplified but poor assumption of feature independence known as Naive Bayes probability model, it works amazingly well in spam filtering. It has come up with a series of evolutions especially in terms of probability estimations from Naive Bayes to Paul Graham’s and then Gary Robinson’s algorithm with significant improvements. The performance can be further improved and the filter can be made more stable. We presented two approaches in this thesis and both the approaches showed significant improvement in the performances of the filter. The first approach co-relates three probability estimations based on token and document frequencies by co-weighting linearly, and set the normalized weighting coefficients by analyzing the experiments on individual and combined estimations. Experimental results on all three public corpora datasets showed that the new algorithm performs better in terms of stability and consistency with heterogeneous datasets than using the single individual estimations and also gives improved result on the average. 48

49

The second approach considers the fact that the same token occurred in different areas of an email are inter-related and these co-relation also play significant role in filtering process and hence incorporates this and present the new approach to statistical Bayesian filter based on co-weighted multi-area information. This new algorithm co-relates the area-wise token probability estimations using weight coefficients, which are computed according to the number of occurrences of the token in those areas. Experimental results showed significant improvement in the performance of spam filtering than using individual area-wise as well as using separate estimations for all areas. Moreover, the performances are much more stable and consistent with all three datasets. Because of the relative homogeneity and somewhat consistent nature of emails, both approaches perform even much better for a personal email system when customized to the system by proper parameter tuning and adjustments.

8.2

Future Work

Own definition of preprocessing and tokenizer rules are used in both approaches presented in this thesis. As the filter performance depends largely on these as well, future developments may include integrating the approach with phrase-based and/or other lexical analyzers and with rich feature extraction methods which can be expected to achieve even better performance. The study can also be made by integrating the approaches with the sliding window of tokens [29]. Moreover, the performance widely varies with a number of parameters which still have scope of fine tuning, further study on the approach can be continued. The approach can also be studied with three way classification with one more undetermined/uncertain class for emails with spam probability values near to neutral 0.5. One further exploration could be to continue further study on the proposed algorithms and integrate the two approaches.

Appendix A Implementation of Filter Application In this appendix we discuss how the filtering application based on the algorithms described in Chaps. 4 and 5 is implemented. Data structures used to implement the application, source files and data files are described. The application is developed in the Java 1.5 (latest version at the time of develoopment). One motivation for using Java as the implementation language is that it is platform independent, and so the compiled program can be run on many different architectures. Another motivation is that it is a strict, imperative language, which is well suited for a task with large amount of data that have to be accessed rapidly. The main motivation, however, is the well developed API (library), which simplifies and speeds up the implementation effort.

A.1

Data Structures

This section describes all the data structures used to store and manipulate data in the application. All data structures are expressed in Java syntax.

50

51

1. Stats: It stores the overall statistical information collected during training of the filter.

class Stats { FreqTable freqTable; // Frequency table structure to store frequency data int numSpams, numLegts;

// Number of spam emails // Number of legitimate emails

}

2. FreqTable: It stores the table of area-wise token and email frequencies and also stores their totals.

class FreqTable { HashMap tokensmap;

/* Stores tokens and their area-wise.

frequencies (occurrences) and number of emails containing the token. It is implemented using Java’s HashMap. */ Frequencies totalFreqs; /* Stores the total occurrences of all the tokens in different email areas into Frequencies data structure. */ }

3. Frequencies: It stores area-wise token and email frequencies with the use of two Count data structures.

52

class Frequencies { Counts

emailCounts, // Email counts occurCounts; // Token occurrence counts

} 4. Counts: Counts data structure stores area-wise count values for both spam and legitimate. It is used to place both token counts and email counts in Frequencies data structure. class Counts { int[][][] numCounts; /*

Its a three dimensional array structure. - The first dimension is used to identify whether it is 0-SPAM or 1-LEGITIMATE. - The second dimension signifies the area like: 0-HEADER, 1-BODY, 2-HTMLTAGS. - The third dimension indexes the sub-area within the area like: 0-FONT, 1-A, 2-IMG tags for HTMLTAGS area.

*/ } Counts data structure is made extensible so that it can handle any number of areas and subareas of the email. This allows further study in the spam filtering using the same application without any modification.

53

5. Category: It is used to store the category information. It supports multiple categories in addition to SPAM and LEGITIMATE so that the application can be used for further extension into email categorization with multiple categories.

class Category { String name;

/*

Name of the possible categories:SPAM or LEGITIMATE. */

double

priProb, /* Prior probability for the category */ calProb; /* Calculated probability for the category */

}

6. Stopwords: Stopwords are stored using Java’s HashSet that makes easier to test whether a word is a member of the stop words list or not.

A.2

Source Files

The complete list of source files that made up the application is given below with descriptions. • DatasetPreparer.java prepares training and test datasets from a given corpus of dataset. • Tokenizer.java implements tokenizer rules for tokenizing email text. • Trainer.java trains the filter with training dataset. The statistical information collected during training is saved into a data file (by default “data.txt”).

54

• Stats.java is used to save trained data into a data file and also to load data back from the data file when needed. • FreqTable.java implements the data structure to store frequency table using hash table. • Frequencies.java implements the data structure to store email and token frequencies using two Counts object from Counts class in Counts.java. • Counts.java implements data structure to store area-wise token or email counts. • Classifier.java classifies one or more email files using the specified algorithm and displays the results. • Tester.java like classifier, used to classify emails, but unlike classifier it can perform series of tests with changed parameters and saves the results into test output file. It is used for tuning the filter. • Algorithm.java implements the abstract algorithm structure containing common data and methods for all algorithms: NaiveBayes, PGrahamBayes, GRobinsonBayes, RShresthaBayes1 and RShresthaBayes2. • NaiveBayes.java implements the Na?ve Bayes spam filtering algorithm introduced in Sect. 2.2. • PGrahamBayes.java implements the Paul Graham’s spam filtering algorithm described in Sect. 2.3. • GRobinsonBayes.java implements the Gary Robinson’s spam filtering algorithm discussed in Sect. 2.4.

55

• RShresthaBayes1.java implements the first proposed approach discussed in Chap. 4. • RShresthaBayes2.java implements the second proposed approach discussed in Chap. 5. • PorterStemmer.java can be used for stemming of tokens. Even though we haven’t used stemming and stop words, these features are made available in the application to allow further study with them. • Category.java implements the category data structure to store category (spam, legitimate, undetermined) information during filtering process. • Utils.java contains common utility methods that could be called by several other classes. • Constants.java defines all constants and enumerated data global to the filtering application.

56

A.3

Data Files

There are three kinds of data files to be used in the application. They are: 1. Email File: MH format email files are used as input to the filter application for both training and classification. 2. Training Output File: The output from the filter training is saved into a text/or binary data file. 3. Classifier Output File: The classifier saves the result of the classification into a result file. The content of the file looks like the one shown in Fig. B.3. 4. Test Output File: The tester outputs the result of the test into a test output report file. It helps in analyzing the results of the filter. The content of the report looks like the one shown in Fig. B.4.

Appendix B Application User’s Manual This appendix outlines the complete user’s manual for using the spam filter application that we have developed for this thesis.

B.1

System Requirements

Since the application is developed in the latest version 1.5 of Java, it just needs a computer system (with windows, unix or any other operating systems) with Java Runtime Environment (JRE) 1.5 or above installed. So the system requirements for Java 1.5 are applied here too.

B.2

Installation of the Application

The application is distributed as four executable jar files: DatasetPreparer.jar, Trainer.jar, Classifier.jar and Tester.jar corresponding to four tools available in the application. Installation is quite simple, just copy the files into any directory (say, C:/RsBSFilter/), then the application is ready to use.

57

58

B.3

Running and Using the Application

The application is made up of several tools and all of them are distributed as jar files. In order to run a tool, enter the command from the command line following the syntax: >java -jar toolsname .jar arguments All the tools and their usage are discussed below.

B.3.1

Dataset Preparer

This tool, as the name indicates, is used to prepare training and test datasets from the given corpus dataset. Usage: >java -jar DatasetPreparer.jar inpath [outpath [spamSize [legtSize ]]] inpath - the input path of the original corpus data in two sub directories spam and legt which contains all spam and legitimate emails of the corpus data respectively. outpath - the output path in which the result of the train and test datasets are placed in train and test sub directories. If not supplied, the system considers inpath as the outpath as well. spamSize and legtSize - number of spam and legitimate emails respectively to be used in the training dataset. The values should be less than the total number of emails in the original corpus dataset. If these optional parameters are not supplied, it automatically picks two-thirds of the total spam/legitimate emails for spamSize /legtSize as the training data and the remaining one-third as the test data.

59

Examples: * >java -jar DatasetPreparer.jar C:/CopusData/LingSpam * >java -jar DatasetPreparer.jar C:/CorpusData/LingSpam 400 1600 The Dataset preparer randomly picks spamSize number of training and legtSize number of emails from inpath /spam and inpath /legt, and place them in the outpath /train /spam and outpath /train/legt respectively. The remaining spam and legitimate emails are used as the test dataset and placed them in outpath /train/spam and outpath /train/legt. The training and test directories, if not exist, will be created by the tool. It doesn’t affect the original corpus dataset to allow it to be used later.

B.3.2

Trainer

It is used to train the filter with the supplied training dataset containing certain number of spam and legitimate emails. Usage: >java -jar Trainer.jar path [datafile [stopwordsfile ]] path - the training data path created by Dataset preparer which consisting of two sub directories spam and legt containing spam and legitimate emails respectively. datafile - file name (with or without path) of the output data file into which the trainer saves the result of the training. If not supplied, it takes the default name data.txt. stopwordsfile - name of the text file containing list of stop words. If supplied the tool ignores the stop words from the valid tokens during tokenization.

60

Figure B.1: Snapshot of Trainer in action Examples: * >java -jar Trainer.jar C:/TrainData/SpamAssasin data-sa.txt * >java -jar Trainer.jar C:/TrainData/SpamAssassin C:/Output/data-sa.txt C:/Input/stopwords.txt The trainer trains the filter using all the spam and legitimate emails of the training dataset and saves the result into the datafile . It gives the progress status of the training on the screen. Snapshot of the trainer in action is shown in the Fig. B.1.

B.3.3

Classifier

It is used to classify a given email or bulk of emails placed in one or more directories and display the result.

61

Usage: >java -jar Classifier.jar emailfile |path datafile [algorithm [parameters [stopwordsfile ]]] emailfile |path - an email file or path containing emails to be tested. datafile - file name (with or without path) of the training data file created by the trainer. The classifier loads the data and used in probability calculations during classification. algorithm - specifies which algorithm to use in classifying. Its value can be one of six possible values: nb|pb|rb|sb|sb1|sb2, where nb - Naive Bayes, pb - Paul Graham’s algorithm, rb - Gary Robinson’s algorithm, and sb1- ShresthaBayes1 and sb2 - ShresthaBayes2 are first and second approaches presented in this thesis. The default algorithm is sb1. parameters - supplies the optional parameters to the classification algorithm. It is specified as comma separated list of key=value pairs of parameters: paramname1=value1,paramname2=value2, ... Parameter types depends upon the algorithm to be used. Complete list of valid parameters corresponding to six different algorithms are given in the Table B.1. stopwordsfile - name of the text file containing list of stop words. If supplied the tool ignores the stop words from the valid tokens during tokenization. Examples: * >java -jar Classifier.jar C:/TestData/AnnexiaXpert data-ax.txt * >java -jar Classifier.jar C:/TestData/AnnexiaXpert C:/Outputs/data-ax.txt sb1 * >java -jar Classifier.jar C:/TestData/AnnexiaXpert data-ax.txt sb1 "s=0.6,x=0.35,APPROACH=3"

62

Table B.1: Valid parameters corresponding to six algorithms Algorithm

Parameters

Default

nb

P P ROB SP AM

N +N

N

S

Description Prior probability of spam

P ROB U N KN OW N T OKEN P ROB M AX

S L 0.9 0.99

SP AM T HRESHOLD BIAS T O LEGT M IN IGN ORE COU N T N U M IN T T OKEN S U SE

0.9 2 5 15

Prob. set to unknown tokens Max. prob. for token occurred in only one training set (spam or legitimate) Threshold value Bias multiplier to legitimate Min. token occurrences reqd. No. of interesting tokens

rb, sb1, sb2

SP AM T HRESHOLD P ROB OF F SET N U M IN T T OKEN S U SE s x

0.9 0.42 19 0.6 0.35

Threshold value Prob. offset from neutral 0.5 No. of interesting tokens Degree of Belief required Prob. of unknown tokens

sb1 (additional)

AP P ROACH w1 w2 w3

3 0.05 0.70 0.25

1, 2 or 3 for three approaches Weight factor for approach 1 Weight factor for approach 2 Weight factor for approach 3

AREAM ODE

2

AREAIN DEX

0

pb

sb2 (additional)

0-One Area, 1-Separate, 2-Combined Used to index sub-area when AREAM ODE=0

63

Figure B.2: Snapshot of Classifier output The classifier reads test emails and classifies one by one whether the it is a spam or legitimate using the specified algorithm and parameters, and the result is shown on the screen and also saved into a file. The report file is created with the file name format : datafilename-algorithm- result.txt (e.g. data-ls.txt-NAIVE B-result.txt). The snapshot of screen output and the file output are shown in the Fig. B.2 and B.3 respectively. The output file contains the probability estimates for spam, confusion matrix and other statistical information as well.

B.3.4

Tester

This tool is used to test one or more datasets as well as with varying parameters, which is useful in tuning the filter with different parameter values for optimal performance. Usage: >java -jar Tester.jar algorithms datasets [fxdparameters [varparameters [stopwordsfile ]]]

64

Figure B.3: An example contents of classifier output file algorithms - comma separated list of algorithms with which to test the filter. Like in classifier there are six possible values for an algorithm: nb|pb|rb|sb|sb1|sb2. datasets - list of dataset paths with corresponding training data file names. The format of this parameter is given below: datasetpath1=datafile1,datasetpath2=datafile2, ... fxdparameters - supplies the list of fixed parameters in the format: paramname1=value1, paramname2=value2, ... Valid parameters are as introduced in the Fig. B.1 depending upon the algorithm . varparamters - supplies the list of variable parameters with their range of possible values to be tested. The format of this parameter is given below: paramname=initialvalue:finalvalue:stepvalue, ... stopwordsfile - name of the text file containing list of stop words. If supplied the tool ignores the stop words from the valid tokens during tokenization.

65

* >java -jar Tester.jar "sb1,sb2" "C:/TestData/LingSpam=C:/Outputs/data-ls.txt, C:/TestData/SpamAssassin=C:/Outputs/data-sa.txt" * >java -jar Tester.jar "sb1,sb2" "C:/TestData/LingSpam=C:/Outputs/data-ls.txt, C:/TestData/SpamAssassin=C:/Outputs/data-sa.txt, C:/TestData/AnnexiaXpert=C:/Outputs/data-ax.txt" "SPAM_THRESHOLD=0.9,NUM_INTTOKENS_USE=19" "s=0.1:1.0:0.1,x=0.3:0.7:0.05" The tester performs testing according to following algorithmic steps: • For each algorithm in algorithms ∗ For each dataset in datasets ◦ Set the fixed parameters from fxdparameters ◦ For each variable parameter in varparameters ? For each value from intialvalue to finalvalue with the increment of stepvalue , Test the filter and save the test report into the test output file: testreport.txt. ◦ Analyze the filter performances and obtain the optimal value for the variable parameter ◦ Use it as additional fixed parameter for the next test. ∗ Save all the parameter values into the test report file. The contents of the tester output file is shown in the Fig. B.4. Thus the tester tool is very much useful in performance tuning and analyzing the filter.

66

Figure B.4: An example contents of test output file

Appendix C Program Documentation This appendix provides complete program documentation for the filter application generated from the program source using Java’s javadoc utility.

C.1 Package and Class Summaries The whole application is packaged into one package named rsspambayes. Summary report on classes and enumeration types belonging to the package are given in the following subsections.

C.1.1 Class Summary Algorithm Category Classifier

Counts

DatasetPreparer

FreqTable Frequencies

The abstract Algorithm class defines the essential data and methods for Bayesian spam filtering algorithms. The Category class defines the category information for an email. The Classifier class is the application used to classify test email(s) as spam or legitimate. The Counts class represents the spam and legitimate counts in various areas of email. This class defines the DatasetPreparer application used to prepare training and test data set from given corpus data. The FreqTable class represents the token and email frequencies for training data. It represents token and email frequencies. The GRobinsonBayes class implements the Gary Robinson’s spam

GRobinsonBayes filtering algorithm with our definition of preprocessing and feature extraction.

67

68

NaiveBayes

PGrahamBayes

RShresthaBayes1

RShresthaBayes2

Stats

Tester

Tokenizer

Trainer

Utils

It implements the Naive Bayes spam filtering algorithm with our definition of preprocessing and feature extraction. This class implements the Paul Graham’s spam filtering algorithm with our definition of preprocessing and feature extraction. The RShresthaBayes1 implements our algorithm based on co-weighted multi-estimations. This algorithm is based on our approach of co-weighted multi-area information. The Stats class represents the statistical information collected during the training session. The Tester class is the application used for parameter tuning and testing the filter application. The Tokenizer class implements the preprocessing and tokenization of emails. The Trainer class is the trainer application used to train the spam filter with the defined training dataset. The Utils class defines all utility methods used by other classes in the application.

C.1.2 Enum Summary Algorithms

This enumeration defines all algorithms supported by the application.

Areas

Defines the major areas of email that are considered by the application.

EmailCats

Defines possible email categories, like spam, legitimate.

Headers

Defines email headers considered by the filter.

HtmlTags

Defines categories of all html tags.

69

C.2 Hierarchy For Package rsspambayes C.2.1 Class Hierarchy o

java.lang.Object o

rsspambayes.Algorithm o

rsspambayes.GRobinsonBayes

o

rsspambayes.NaiveBayes

o

rsspambayes.PGrahamBayes

o

rsspambayes.RShresthaBayes1

o

rsspambayes.RShresthaBayes2

o

rsspambayes.Category (implements java.io.Serializable)

o

rsspambayes.Classifier

o

rsspambayes.Counts (implements java.io.Serializable)

o

rsspambayes.DatasetPreparer

o

rsspambayes.FreqTable (implements java.io.Serializable)

o

rsspambayes.Frequencies (implements java.io.Serializable)

o

rsspambayes.PorterStemmer (implements java.io.Serializable)

o

rsspambayes.Stats (implements java.io.Serializable)

o

rsspambayes.Tester

o

rsspambayes.Tokenizer

o

rsspambayes.Trainer

o

rsspambayes.Utils

C.2.2 Enum Hierarchy o

java.lang.Object o

java.lang.Enum (implements java.lang.Comparable, java.io.Serializable) o

rsspambayes.EmailCats

o

rsspambayes.Areas

o

rsspambayes.Algorithms

o

rsspambayes.Headers

o

rsspambayes.HtmlTags

70

C.3 Class Details Details of all the classes are discussed in this section in alphabetic order.

C.3.1 Algorithm java.lang.Object rsspambayes.Algorithm

Direct Known Subclasses: GRobinsonBayes, NaiveBayes, RShresthaBayes1, RShresthaBayes2

NaiveBayesOriginal,

PGrahamBayes,

public abstract class Algorithm extends java.lang.Object

The abstract Algorithm class declares the essential data and methods for all Bayesian spam filtering algorithms to be inherited from it.

Field Detail emailCat Category emailCat

Email category information for an email under test.

trnStat Stats trnStat

Training statistical data.

Constructor Detail Algorithm public Algorithm(Stats stats)

Constructs algorithm object with given training Stats. Parameters:

stats - training statistical data.

Method Detail classify abstract Category classify(FreqTable emlFTbl)

Classifies email with given frequency table.

getParameters abstract java.lang.String getParameters()

Gets the comma separated list of key = value pairs of all parameters of the algorithm.

71

parseParameters abstract void parseParameters(java.util.ArrayList parsmap)

Parse parameters for the algorithm from the given list of parameters and sets the values of the parameters.

setParameters public void setParameters(java.lang.String parameters)

Sets the values of parameters from the given list of parameters. Parameters:

parameters - comma separated list of key = value pairs of parameters.

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.2 Category java.lang.Object rsspambayes.Category

All Implemented Interfaces: java.io.Serializable public class Category extends java.lang.Object implements java.io.Serializable

Category class defines the category information for an email.

Field Detail calProb private double calProb

Calculated probability.

name private java.lang.String name

Name of the category.

priProb private double priProb

Prior probability.

resultInsight public java.lang.String resultInsight

The in-sight of the result of the calculation, that facilitates analysis of the algorithm.

72

Constructor Detail Category public Category()

Constructs category object without initializing the information.

Category public Category(java.lang.String name, double priProb)

Constructs category object and initializes with given name and prior probability. Parameters: name - name of the category. priProb - prior probability value.

Method Detail getCalculatedProb public double getCalculatedProb()

Gets the calculated probability value of the category. Returns: probability value.

getName public java.lang.String getName()

Gets the name of the category. Returns: category name.

getPriorProb public double getPriorProb()

Gets the prior probability value of the category. Returns: probability value.

setCalculatedProb public void setCalculatedProb(double prob)

Sets the calculated probability value of the category. Parameters: prob - probability value.

setName public void setName(java.lang.String name)

Sets the name of the category.

73 Parameters: name - category name.

setPriorProb public void setPriorProb(double prob)

Sets the prior probability value of the category. Parameters: prob - probability value. Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.3 Classifier java.lang.Object rsspambayes.Classifier public class Classifier extends java.lang.Object

The Classifier class is the application used to classify test email(s) as spam or legitimate.

Field Detail alg Algorithms alg

Algorithm type used for classification.

classifierAlg Algorithm classifierAlg

Algorithm object which performs classification.

confMatrix int[][] confMatrix

Contingency or confusion matrix.

dataFile java.lang.String dataFile

Training data file from which training stats are read.

LEGT final int LEGT

Index value for legitimate in confusion matrix.

74

outStream java.io.PrintWriter outStream

Output stream to write into the output result file.

parameters java.lang.String parameters

Parameters used by the classification algorithm.

resultFile java.lang.String resultFile

Classifier output result file which contains result of classification of all test emails along with probability values and confusion matrix.

SPAM final int SPAM

Index value for spam in confusion matrix.

stem final boolean stem

Flag to indicate use or not to use stemming. By default its false.

stopFile java.lang.String stopFile

File containing list of stop words.

stopWords java.util.HashSet stopWords

Set of stop words.

tokenizer Tokenizer tokenizer

Tokenizer object which tokenizes email to be classified.

trnStat Stats trnStat

Training statistical data

Constructor Detail Classifier public Classifier()

Constructs a default classifier.

75

Classifier public Classifier(java.lang.String stpFile)

Constructs a classifier with the given stop file. Parameters: stpFile - stop file.

Classifier public Classifier(java.lang.String stpFile, java.lang.String datFile)

Constructs a classifier with stop file and training data file. Parameters:

stpFile - stop file. datFile - data file.

Classifier public Classifier(java.lang.String stpFile, java.lang.String datFile, Algorithms algorithm)

Constructs a classifier with stop file, data file and algorithm to be used. Parameters:

stpFile - stop file. datFile - data file. algorithm - algorithm used for classification.

Classifier public Classifier(java.lang.String stpFile, java.lang.String datFile, Algorithms algorithm, java.lang.String parameters)

Constructs a classifier with stop file, data file and algorithm along with parameters for the algorithm. Parameters:

stpFile - stop file. datFile - data file. algorithm - algorithm used for classification. parameters - comma separated list of key = value pairs of parameters.

Method Detail classify public Category classify(java.lang.String fileName)

Classifies the given email file using preset algorithm and parameters. Parameters: fileName - email file.

76

classify public Category classify(java.lang.String fileName, Algorithms algorithm)

Classifies given email using the given algorithm. Parameters:

fileName - email file. algorithm - classification algorithm to be used.

classify public Category classify(java.lang.String fileName, Algorithms algorithm, java.lang.String parameters)

Classifies given email using the given algorithm along with the given custom parameters. Parameters:

fileName - email file. algorithm - classification algorithm to be used. parameters - comma separated list of key = value pairs of parameters.

classifyAll public void classifyAll(java.lang.String path, java.lang.String datFile)

Classifies all the spam and legitimate files in the given path using the training stats data in the supplied data file. Parameters: path – location containing spam and legitimate emails. datFile - training data file.

classifyFiles private int classifyFiles(java.lang.String path, boolean isSpam)

Classifies spam (or legitimate) files in spam or (legitimate) sub directories of the given path. Parameters: path - location containing emails. isSpam - identifies whether to classify spam or legitimate files. Returns: number of files successfully classified.

getAllParameters public java.lang.String getAllParameters()

Gets all parameters used for the set classification algorithm. Returns: comma separated list of key = value pairs of parameters.

77

getCustomParameters public java.lang.String getCustomParameters()

Gets all customized parameters either set using setParameters() or passed during construction of the classifier. Returns: comma separated list of key = value pairs of parameters.

loadStats public void loadStats()

Loads training stats data from the set training data file.

loadStats public void loadStats(java.lang.String datFile)

Loads training stats data from the given data file.

main public static void main(java.lang.String[] args)

Main program entry point for the classifier application. Usage: Classifier path [stopwordsfile]]]

datFile

[nb|pb|sb|sb1|sb2

Parameters: args - command line arguments for the application.

setAlgorithm public void setAlgorithm(Algorithms algorithm)

Sets algorithm to be used for classification. Parameters:

algorithm - algorithm type value.

setDataFile public void setDataFile(java.lang.String datFile)

Sets training data file to be used by the classifier. Parameters:

datFile - data file.

setParameters public void setParameters(java.lang.String parameters)

Sets parameters list to be used by the classification algorithm. Parameters:

parameters - comma separated list of parameters .

[parameters

78 Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.4 Counts java.lang.Object rsspambayes.Counts

All Implemented Interfaces: java.io.Serializable class Counts extends java.lang.Object implements java.io.Serializable

The Counts class represents the spam and legitimate counts in various areas of email. The area signifies the broad areas of the email, generally: Header, Body and Html tags and each area may comprise of one or more sub-areas. The number of areas and their corresponding sub-areas are definable. As for example: • Header with 2 sub headers: Normal Header, Subject Header. • Body text. • Html tags considering 3 tags
, and .

Field Detail numCounts int[][][] numCounts

Structure to store the area-wise counts for spam and legitimates.

Constructor Detail Counts public Counts(int[] areaSizes)

Constructs the count structure according to the supplied areas and their sizes. Parameters:

areaSizes - number of areas and corresponding number of sub-areas.

Method Detail getAreaSize public int getAreaSize(Areas area)

Gets the number of sub-areas in the given area. Parameters: area - email area. Returns: number of sub-areas in the area.

79

getAreaSizes public int[] getAreaSizes()

Gets all the email-areas in consideration and sizes of their corresponding sub-areas. Returns: array containing number of sub-areas corresponding to the areas, represented the array index.

getCounts public int getCounts()

Gets the grand total count value in all email categories. Returns: total count.

getCounts public int getCounts(Areas area)

Gets the total count value in specified area of all email categories. Parameters: area - email area. Returns: total count.

getCounts public int getCounts(Areas area, int areaIndex)

Gets the total count value in specified sub-area for all email categories. Parameters: area - email area. areaIndex - index representing the sub-area in the area. Returns: total count.

getCounts public int getCounts(EmailCats emailCat)

Gets the total count value for the given email category (summed up from all areas). Parameters:

emailCat - email category.

Returns: total count.

getCounts public int getCounts(EmailCats emailCat, Areas area)

80 Gets total count value for the given email category and in the given area (summed up from all sub-areas). Parameters:

emailCat - email category. area - email area.

Returns: total count.

getCounts public int getCounts(EmailCats emailCat, Areas area, java.util.HashSet ignore)

Gets total count value for given email category summed up in all sub-areas except those in the ignore list. Parameters:

emailCat - email category. area - email area. ignore - set of sub-areas to be ignored.

Returns: total count.

getCounts public int getCounts(EmailCats emailCat, Areas area, int areaIndex)

Gets the count value for the given email category in the given specific area. Parameters:

emailCat - email category. area - email area. areaIndex - index representing the sub-area in the area.

Returns: count value.

getCounts public int getCounts(EmailCats emailCat, java.util.HashMap ignore)

Gets total count value in the given email category (spam/legitimate) by summing up all the areas and sub-areas except those in the ignore list. Parameters:

emailCat - email category. ignore - hash map of sub-areas to be ignored. It contains (area, HashSet

(subarea indices)) .

Returns: total count

81

getCounts public int getCounts(java.util.HashMap ignore)

Gets the total count value in all email categories and areas, i.e. grand total of counts obtained by summing up all the areas and sub-areas except those in the ignore list. Parameters:

ignore - set of sub-areas to be ignored.

Returns: total count.

incCounts public void incCounts(EmailCats emailCat, Areas area, int areaIndex)

Increase the number of counts for the given email category in the specific area and sub-area of the given email category by 1. Parameters:

emailCat - email category. area - email area. areaIndex - index representing the sub-area in the area.

incCounts public void incCounts(EmailCats emailCat, Areas area, int areaIndex, int count)

Increase the number of counts for the given email category in the specific sub-area by a given number. Parameters:

emailCat - email category (spam or legitimate). area - email area. areaIndex - index representing the sub-area in the area. count - count value to be incremented.

merge public void merge(Counts anotherCounts)

Merge the count values from another Counts object. Parameters:

anotherCounts - Counts object to be merged.

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

82

C.3.5 DatasetPreparer java.lang.Object rsspambayes.DatasetPreparer class DatasetPreparer extends java.lang.Object

This class defines the DatasetPreparer application used to prepare training and test data set from given corpus data.

Field Detail TRAIN_TEST_RATIO static final double TRAIN_TEST_RATIO

Training-to-test data ratio, default value is 2/3

Constructor Detail DatasetPreparer DatasetPreparer()

Default constructor.

Method Detail main public static void main(java.lang.String[] args)

Main program entry point for the DatasetPreparer application. Usage: DatasetPreparer inpath [outpath [spamSize [legtSize]]]

Parameters: args - command line arguments for the application.

prepareTrainTestFiles public static void prepareTrainTestFiles(java.lang.String inPath, java.lang.String outPath, java.lang.String subPath, int size)

Prepares training data files by randomly choosing the given number of spam and legitimate email files from the /spam and /legt sub-directories in the input path and placing them into /Train sub-directory of the output path. The remaining files are placed into /Test sub-directory to use as test data set. Parameters:

inPath - input path containing original corpus data. outPath - output path in which the prepared training and test datasets are

placed.

83 Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.6 FreqTable java.lang.Object rsspambayes.FreqTable

All Implemented Interfaces: java.io.Serializable public class FreqTable extends java.lang.Object implements java.io.Serializable

The FreqTable class represents the token and email frequencies.

Field Detail areaSizes int[] areaSizes

Email areas and their sizes (number of sub-areas)

tokensmap java.util.HashMap tokensmap

Tokens and corresponding frequencies.

totalFreqs Frequencies totalFreqs

Sum total of frequencies in different areas of different category of emails.

Constructor Detail FreqTable public FreqTable(int[] areaSizes)

Constructs frequency table according to given areaSizes. Parameters:

areaSizes - list of size of all the areas represented by array index.

Method Detail contains public boolean contains(java.lang.String token)

Checks if the frequency table contains the token. Parameters:

token - token to check.

84 Returns: result of the check.

getFrequencies public Frequencies getFrequencies(java.lang.String token)

Gets the Freqeuncies object for the given token. Parameters:

token - token text.

Returns: Frequency object. If the frequency table doesn’t contain the token, it returns null.

getTokensMap public java.util.HashMap getTokensMap()

Gets the tokens map. Returns: hash map of tokens in the frequency table.

getTotalFrequencies public Frequencies getTotalFrequencies()

Gets the total frequencies object containing total token frequencies and total email frequencies. Returns: Frequencies object containing total frequencies.

insert public void insert(java.lang.String token)

Insert a token into the frequency table. Parameters:

token - token to be inserted.

insert public void insert(java.lang.String token, EmailCats emailCat, Areas area, int areaIndex)

Inserts the token into the frequency table and increment the token occurrences by 1, but does not increment of email count Parameters:

token - token to be inserted. emailCat - email category in which the token occurred. area - email area. areaIndex - sub-area in which the token occurred.

85

insert public void insert(java.lang.String token, EmailCats emailCat, Areas area, int areaIndex, int occurCount)

Inserts the token into the frequency table and increment the token occurrences by given count value, but does not increment of email count. Parameters:

token - token to be inserted. emailCat - email category in which the token occurred. area - email area. areaIndex - sub-area in which the token occurred. occurCount - token occurrence count.

insert public void insert(java.lang.String token, EmailCats emailCat, Areas area, int areaIndex, int emailCount, int occurCount)

Inserts the token into the frequency table and set its frequency values in the given sub-area of the given area in the given email category. Parameters:

token - token to be inserted. emailCat - email category in which the token occurred. area - email area. areaIndex - sub-area in which the token occurred. emailCount - email count. occurCount - token occurrence count.

iterator public java.util.Iterator iterator()

Use to obtain iterator for the tokens in the frequency table. Returns: Iterator object.

merge public void merge(FreqTable fTbl)

Merges the the given frequency table. Parameters: fTbl - frequency table to be merged.

86

merge public void merge(FreqTable fTbl, boolean incEmailCount)

Merges the given frequency table and if incEmailCount is true, increment email counts for every token in it by 1. Parameters: fTbl - frequency table to be merged. incEmailCount - flag to indicate whether to increment or not email counts.

print public void print()

Prints frequency values for each token on the console. (useful for debugging and testing)

size public int size()

Gets the size of the frequency table, i.e. the number of tokens in it. Returns: frequency table size. Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.7 Frequencies java.lang.Object rsspambayes.Frequencies

All Implemented Interfaces: java.io.Serializable class Frequencies extends java.lang.Object implements java.io.Serializable

Represents token and email frequencies. The frequency values of both email and token are stored and computed using Counts objects.

Field Detail emailCounts Counts emailCounts

Area-wise email counts for spam and legitimate.

87

occurCounts Counts occurCounts

Area-wise token occurrences count for spam and legitimate.

USE_BOTH static final int USE_BOTH

Constant field indicating use of both email and token counts.

USE_EMAIL_COUNTS static final int USE_EMAIL_COUNTS

Constant field indicating use of only email counts.

USE_TOKEN_COUNTS static final int USE_TOKEN_COUNTS

Constant field indicating use of only token counts.

Constructor Detail Frequencies public Frequencies(int[] areaSizes)

Allocates storage for frequencies according to given area sizes. Parameters:

areaSizes - number of sub-areas in every area, identified by array index.

Frequencies public Frequencies(int[] areaSizes, int use)

Constructs Frequencies object according to given area sizes and use of count types. Parameters: areaSizes - number of sub-areas in every area identified by array index. use - indicates use of email count, token count or both.

Method Detail getAreaSize public int getAreaSize(Areas area)

Gets the number of sub-areas in the given area. Parameters: area - email area. Returns: number of sub-areas in area.

88

getAreaSizes public int[] getAreaSizes()

Gets all the email-areas in consideration and sizes of their corresponding sub-areas. Returns: array containing number of sub-areas corresponding to area, represented by the array index.

getEmailCounts public Counts getEmailCounts()

Gets email counts. Returns: Counts object.

getNumEmails public int getNumEmails()

Gets the total number of emails in all email categories and areas. i.e. the grand total of emails. Returns: number of emails.

getNumEmails public int getNumEmails(EmailCats emailCat)

Gets the total number of emails in the given email category (spam/legitimate). Parameters:

emailCat - email category.

Returns: number of emails.

getNumEmails public int getNumEmails(EmailCats emailCat, Areas area)

Gets total number of emails for the given email category and the area, summed up in all its sub-areas. Parameters:

emailCat - email category. area - email area.

Returns: number of emails.

getNumEmails public int getNumEmails(EmailCats emailCat, Areas area, java.util.HashSet ignore)

89 Gets the total number of emails for the given email category and the given area, summed up in all sub-areas except those in the ignore list. Parameters:

emailCat - email category. area - email area. ignore - set of sub-areas to be ignored.

Returns: number of emails.

getNumEmails public int getNumEmails(EmailCats emailCat, Areas area, int areaIndex)

Gets the number of emails in the given specific area and for the given email category. Parameters:

emailCat - email category. area - email area. areaIndex - sub-area.

Returns: number of emails.

getNumEmails public int getNumEmails(EmailCats emailCat, java.util.HashMap ignore)

Gets the total number of emails in the given email category by summing up all the areas and sub-areas except those in the ignore list. Parameters:

emailCat - email category. ignore - hash map of ignore areas: (area,HashSet(subarea indices)).

Returns: number of emails.

getNumEmails public int getNumEmails(java.util.HashMap ignore)

Gets the total number of emails in all email categories and areas. i.e. grand total of emails obtained by summing up all the areas and sub-areas except those in the ignore list. Parameters:

ignore - hash map of ignore areas: (area,HashSet(subarea indices)).

Returns: number of emails.

90

getNumOccurs public int getNumOccurs()

Gets the total number of occurrences in all email categories and areas, i.e. the grand total of occurrences. Returns: number of occurrences.

getNumOccurs public int getNumOccurs(Areas area)

Gets the total number of occurrences in area-areaIndex for all email categories. Parameters: area - email area. Returns: number of occurrences.

getNumOccurs public int getNumOccurs(Areas area, int areaIndex)

Gets the total number of occurrences area-areaIndex in all email categories. Parameters: area - email area. areaIndex - sub-area. Returns: number of occurrences.

getNumOccurs public int getNumOccurs(EmailCats emailCat)

Gets the total number of occurrences in the given email category. Parameters:

emailCat - email category.

Returns: number of occurrences.

getNumOccurs public int getNumOccurs(EmailCats emailCat, Areas area)

Gets total number of occurrences for the given email category and given area. Parameters:

emailCat - email category. area - email area.

Returns: number of occurrences.

91

getNumOccurs public int getNumOccurs(EmailCats emailCat, Areas area, java.util.HashSet ignore)

Gets total number of occurrences for the given area and email category, summed up in all sub-areas except those in the ignore list. Parameters:

emailCat - email category. area - email area. ignore - set of sub-areas to be ignored.

Returns: number of occurrences.

getNumOccurs public int getNumOccurs(EmailCats emailCat, Areas area, int areaIndex)

Gets number of occurrences for the given email category and the given specific area. Parameters:

emailCat - email category. area - email area. areaIndex - sub-area.

Returns: number of occurrences.

getNumOccurs public int getNumOccurs(EmailCats emailCat, java.util.HashMap ignore)

Gets the total number of occurrences in the given email category by summing up all the areas and sub-areas except those in the ignore list. Parameters:

emailCat - email category. ignore - hash map (area, HashSet(subarea indices)) of areas to be ignored.

getNumOccurs public int getNumOccurs(java.util.HashMap ignore)

Gets the total number of occurrences in all email categories and areas, i.e. grand total of occurrences, obtained by summing up all the areas and sub-areas except those in the ignore list. Parameters:

ignore - hash map (area, HashSet(subarea indices)) of areas to be ignored.

getOccurCounts public Counts getOccurCounts()

Gets the token occurrences counts.

92 Returns: Counts object.

incEmailsAndOccurs public void incEmailsAndOccurs(EmailCats emailCat, Areas area, int areaIndex, int emailCount, int occurCount)

Increase the number of email and token occurrence counts for the given email category in the specific area and sub-area by a given number. Parameters:

emailCat - email category (SPAM or LEGITIMATE). area - email area. areaIndex - index representing the sub-area in the area. emailCount - email count value to be incremented. occurCount - token occurrences count value to be incremented.

incNumEmails public void incNumEmails(EmailCats emailCat, Areas area, int areaIndex)

Increase the number of email counts in the specific sub-area by 1. Parameters:

emailCat - email category. area - email area. areaIndex - index representing the sub-area in the area.

incNumEmails public void incNumEmails(EmailCats emailCat, Areas area, int areaIndex, int count)

Increase the number of email counts for the given email category and specific area and sub-area by given count value. Parameters:

emailCat - email category. area - email area. areaIndex - index representing the sub-area in the area. count - count value to be incremented.

incNumOccurs public void incNumOccurs(EmailCats emailCat, Areas area, int areaIndex)

Increase the number of occurance counts in the specific sub-area for the given email category by 1.

93 Parameters:

emailCat - email category. area - email area. areaIndex - sub-area.

incNumOccurs public void incNumOccurs(EmailCats emailCat, Areas area, int areaIndex, int count)

Increase the number of occurrence counts for the given email category and the specific area and sub-area by given count value. Parameters:

emailCat - email category. area - email area. areaIndex - sub-area. count - count value to be incremented.

merge public void merge(Frequencies anotherFreqs)

Merges the frequency values from another Frequencies object. Parameters:

anotherFreqs - frquencies object to be merged.

print public void print()

Prints frequency values on the console, useful for testing and debugging.

print public void print(java.util.HashMap ignore)

Prints frequency values on the console, with ignore areas. Parameters:

ignore - hash map (area, HashSet(subarea indices)) of areas to be ignored.

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.8 GRobinsonBayes java.lang.Object rsspambayes.Algorithm rsspambayes.GRobinsonBayes

94 class GRobinsonBayes extends Algorithm

GRobinsonBayes class implements the Gary Robinson’s spam filtering algorithm with our definition of preprocessing and feature extraction.

Field Detail APPROACH int APPROACH

Approach of computing combined probability estimate.

NUM_INTTOKENS_USE int NUM_INTTOKENS_USE

Number of the most interesting tokens to use in computations.

PPROB_LEGT double PPROB_LEGT

Prior-probability for legitimate.

PPROB_SPAM double PPROB_SPAM

Prior-probability for spam.

PROB_INTTOKEN_OFFSET double PROB_INTTOKEN_OFFSET

Probability offset from neutral 0.5 to be considered for interesting tokens. For example: offset value of 0.45 means above 0.5+0.45=0.95 and below 0.5-0.45=0.05.

s double s

Gary Robinson’s strength of belief parameter.

SPAM_THRESHOLD double SPAM_THRESHOLD

Threshold probability value used to make decision on spam.

x double x

Gary Robinson’s assumed probability for unknown token. Fields inherited from class rsspambayes.Algorithm emailCat, trnStat

95

Constructor Detail GRobinsonBayes public GRobinsonBayes(Stats stat)

Constructs the algorithm object using given training stats. Parameters: stat - training statistics.

GRobinsonBayes public GRobinsonBayes(Stats stat, java.lang.String parmString)

Constructs the algorithm object using given training stats and custom parameters list. Parameters: stat - training stats data. parmString - comma separated list of key = value pairs of parameters.

Method Detail calcGRobinsonSpamProb private double calcGRobinsonSpamProb(FreqTable emlFTbl)

Calculates spam probabilities using Gary Robinson’s approach. Parameters:

emlFTbl - frequency table representing the email under test.

Returns: spam probability.

calcProb private double calcProb(java.util.List topTokens)

Calculates the probability for the spam using Gary Robinson's approaches: 1. Geometric-Mean formula: S = 1 - exp(sum(ln(1-f(w)))/n) H = 1 - exp(sum(ln(f(w)))/n) I = (1 + (S-H)/(S+H))/2

2. Fisher-Robinson's Chi-Square Method S = CInv(-2*sum(ln(1-f(w))),2n) H = CInv(-2*sum(ln(f(w))),2n) I = (1+(H-S))/2

Parameters:

topTokens - list of top/interesting tokens.

Returns: combined spam probability.

classify public Category classify(FreqTable emlFTbl)

Classifies the test email represented by given frequency table.

96 Specified by:

classify in class Algorithm

Parameters:

emlFTbl - frequency table for the email.

Returns: classification result, category the email belongs to.

computeSpamTokenProb private double computeSpamTokenProb(java.lang.String token, Areas area, int areaIndex)

Computes f(t) guestimate value for the token in the ith email area using the formula: f(t) = (s * x + n*p(t))/(s +n). Parameters:

token - token for which to compute spam probability. area - email area. areaIndex - sub-area.

Returns: spam probability for the token.

getAllTokensProbs private java.util.HashMap getAllTokensProbs(FreqTable emlFTbl)

Calculates spam probabilities for all tokens in the given email. Parameters:

emlFTbl - frequency table representing the email under test.

Returns: hash map of tokens and corresponding spam probabilities.

getParameters public java.lang.String getParameters()

Description copied from class: Algorithm Gets the comma separated list of key = value pairs of parameters. Specified by:

getParameters in class Algorithm

getTopSpamTokensProbs private java.util.List getTopSpamTokensProbs(java.util.HashMap allTokens)

Gets top/interesting spam tokens, measured by the distance from the neutral probability, i.e. 0.5. Parameters:

allTokens - hash map of all tokens and corresponding spam probabilities.

Returns: list of top/interesting tokens.

97

isValidParameter public static boolean isValidParameter(java.lang.String param)

Checks whether the given text string is a valid parameter for the algorithm. Parameters:

param - parameter to be checked.

Returns: validation test result.

parseParameters void parseParameters(java.util.ArrayList parameters)

Parses parameters from the list of parameters. Specified by:

parseParameters in class Algorithm

Parameters:

parameters - comma separated list of key = value pairs of parameters.

Methods inherited from class rsspambayes.Algorithm setParameters

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.9 NaiveBayes java.lang.Object rsspambayes.Algorithm rsspambayes.NaiveBayes class NaiveBayes extends Algorithm

It implements the Naive Bayes spam filtering algorithm with our definition of preprocessing and feature extraction.

Field Detail APPROACH int APPROACH

Approach of computing combined probability for spam and legitimate.

PPROB_LEGT double PPROB_LEGT

Prior probability of legitimate.

98

PPROB_SPAM double PPROB_SPAM

Prior probability of spam.

w1 double w1

Weight factor for document based estimation.

w2 double w2

Weight factor for token based estimation.

w3 double w3

Weight factor for the estimation based on average occurrences of token. Fields inherited from class rsspambayes.Algorithm emailCat, trnStat

Constructor Detail NaiveBayes public NaiveBayes(Stats stat)

Constructs the algorithm object using given training stats. Parameters: stat - training statistics.

NaiveBayes public NaiveBayes(Stats stat, java.lang.String parmString)

Constructs the algorithm object using given training stats and custom parameters list. Parameters: stat - training statistics. parmString - comma separated list of key = value pairs of parameters.

Method Detail calcNaiveBayesProb private double calcNaiveBayesProb(FreqTable emlFTbl)

Calculates normalized probabilities for all the categories (spam/legitimate). Parameters:

emlFTbl - frequency table representing the email under test.

99 Returns: the probability of spam.

calcProb private double calcProb(FreqTable emlFTbl, EmailCats emailCat, double priorProb)

Calculates the logarithm of probability values for the given email category and using supplied prior probability for the category. Parameters:

emlFTbl - frequency table representing the test email. emailCat - email category. priorProb - prior probability.

Returns: logarithm of probability value.

classify public Category classify(FreqTable emlFTbl)

Classifies the test email represented by its frequency table. Specified by: classify in class Algorithm Parameters:

emlFTbl - frequency table for the email.

Returns: classification result, the category to which the email belongs to.

computeTokenProb private double computeTokenProb(java.lang.String token, EmailCats emailCat, Areas area, int areaIndex, int te)

Computes token probability for the given token occurred in the given email category and the specified area and sub-area. Parameters:

token - token for which to compute spam probability. area - email area. areaIndex - sub-area. te - token occurrences in the test email.

Returns: spam probability for the token.

getNaiveBTokenProb private double getNaiveBTokenProb(double tokenOccurs, double allTokensOccurs, double vocSize)

100 Computes the Naive Bayes probability for a token using Tom Michel’s formula as follows: if tokenOccurs>0 i.e. if the token is available in training data tokenProb = (1+tokenOccurs)/(allTokensOccurs + vocSize) else tokenProb = 1/(allTokensOccurs*vocSize), a constant

Parameters:

tokenOccurs - number of occurrences of the token. allTokensOccurs - total number of occurrences of all tokens. vocSize - size of the vocabulary.

Returns: token probability.

getParameters public java.lang.String getParameters()

Description copied from class: Algorithm Gets the comma separated list of key = value pairs of parameters. Specified by:

getParameters in class Algorithm

isValidParameter public static boolean isValidParameter(java.lang.String param)

Checks whether the given is a valid parameter for the algorithm. Parameters:

param - parameter to be checked.

Returns: the result of the check.

parseParameters void parseParameters(java.util.ArrayList parameters)

Parses parameters from the list of parameters. Specified by:

parseParameters in class Algorithm

Parameters:

parameters – comma separated list of key = value pairs of parameters.

Methods inherited from class rsspambayes.Algorithm setParameters

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

101

C.3.10 PGrahamBayes java.lang.Object rsspambayes.Algorithm rsspambayes.PGrahamBayes class PGrahamBayes extends Algorithm

This class implements the Paul Graham’s spam filtering algorithm with our definition of preprocessing and feature extraction.

Field Detail APPROACH int APPROACH

Approach of computing combined probability estimate.

BIAS_TO_LEGT double BIAS_TO_LEGT

Bias multiplier to legitimate, Paul Graham used 2

MIN_IGNORE_COUNT int MIN_IGNORE_COUNT

Minimum total token occurrence required for consideration

NUM_INTTOKENS_USE int NUM_INTTOKENS_USE

Number of most interesting tokens to use in computations.

PPROB_LEGT double PPROB_LEGT

Prior-probability for legitimate.

PPROB_SPAM double PPROB_SPAM

Prior-probability for spam.

PROB_MAX double PROB_MAX

Max. probability for the token found only in one training set either spam or legitimate.

PROB_UNKNOWNTOKEN double PROB_UNKNOWNTOKEN

Probability for unknown token.

102

SPAM_THRESHOLD double SPAM_THRESHOLD

Threshold probability value used to make decision on spam.

w1 double w1

Weight factor for document based estimation.

w2 double w2

Weight factor for token based estimation.

w3 double w3

Weight factor for the estimation based on average occurrences of the token. Fields inherited from class rsspambayes.Algorithm emailCat, trnStat

Constructor Detail PGrahamBayes public PGrahamBayes(Stats stat)

Constructs the algorithm object using given training stats. Parameters: stat - training statistics.

PGrahamBayes public PGrahamBayes(Stats stat, java.lang.String parmString)

Constructs the algorithm object using given training stats and custom parameters list. Parameters: stat - training stats data. parmString - comma separated list of key = value pairs of parameters.

Method Detail calcPGrahamSpamProb private double calcPGrahamSpamProb(FreqTable emlFTbl)

Calculates normalized probabilities for all the categories. Parameters:

emlFTbl - frequency table representing the email under test.

103 Returns: spam probability.

calcProb private double calcProb(java.util.List topTokens)

Calculates the probability for the spam using Paul Graham’s formula. Parameters:

topTokens - list of top/interesting tokens.

Returns: combined spam probability.

classify public Category classify(FreqTable emlFTbl)

Classifies the test email represented by its frequency table. Specified by:

classify in class Algorithm

Parameters:

emlFTbl - frequency table for the email.

Returns: classification result.

computeSpamTokenProb private double computeSpamTokenProb(java.lang.String token, Areas area, int areaIndex, int te)

Computes the token probability for spam. The probability values are calculated as follows. For a given token in the specified email area: if (spam.tokenCount(token)=0 && legt.tokenCount(token)!=0) spamProb(token) = (1-PROB_MAX) else if (spam.tokenCount(token)!=0 && legt.tokenCount(token)=0) spamProb(token) = PROB_MAX else if (spam.tokenCount(token)!=0 && legt.tokenCount(token)!=0) l = BIAS_TO_LEGT * legt.tokenCount(token) s = spam.tokenCount(token) spamProb(token)= (s/numSpamEmails)/((s/numSpams)+(l/numLegts)) else spamProb(token) = PROB_UNKNOWNTOKEN legtProb(token) = 1 - spamProb(token)

Parameters:

token - token for which to compute spam probability. area - email area. areaIndex - sub-area. te - token occurrences in the test email.

Returns: spam probability for the token.

104

getAllTokensProbs private java.util.HashMap getAllTokensProbs(FreqTable emlFTbl)

Calculates spam probabilities for all tokens in the given email. Parameters:

emlFTbl - frequency table representing the email under test.

Returns: hash map of tokens and corresponding spam probabilities.

getParameters public java.lang.String getParameters()

Description copied from class: Algorithm Gets the comma separated list of key = value pairs of parameters. Specified by:

getParameters in class Algorithm

getPGrahamTokenProb private double getPGrahamTokenProb(double double double double

bad, good, nbad, ngood)

Calculates spam probability for the token using Paul Graham’s approach. Parameters: bad - bad count. good - good count. nbad - total bad count. ngood - total good count. Returns: spam probability.

getTopSpamTokensProbs private java.util.List getTopSpamTokensProbs(java.util.HashMap allTokens)

Gets top/interesting spam tokens, measured by the distance from the neutral probability, i.e. 0.5. Parameters:

allTokens - hash map of all tokens and corresponding spam probabilities.

Returns: list of top/interesting tokens.

isValidParameter public static boolean isValidParameter(java.lang.String param)

Checks whether the given is a valid parameter for the algorithm. Parameters:

param - parameter to be checked.

105 Returns: result of the check.

parseParameters void parseParameters(java.util.ArrayList parameters)

Parses parameters from the list of parameters. Specified by:

parseParameters in class Algorithm

Parameters:

parameters – comma separated list of key = value pairs of parameters.

Methods inherited from class rsspambayes.Algorithm setParameters

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.11 RShresthaBayes1 java.lang.Object rsspambayes.Algorithm rsspambayes.RShresthaBayes1 class RShresthaBayes1 extends Algorithm

The RShresthaBayes1 implements our algorithm based on co-weighted multi-estimations.

Field Detail APPROACH int APPROACH

Approach of computing combined probability estimate.

BIAS_TO_LEGT double BIAS_TO_LEGT

Bias multiplier to legitimate. Its default value is 1.

NUM_INTTOKENS_USE int NUM_INTTOKENS_USE

Number of the most interesting tokens to use in computations.

106

PPROB_LEGT double PPROB_LEGT

Prior-probability for legitimate.

PPROB_SPAM double PPROB_SPAM

Prior-probability for spam.

PROB_INTTOKEN_OFFSET double PROB_INTTOKEN_OFFSET

Probability offset from neutral 0.5 to be considered for interesting tokens. For example: offset value of 0.45 means above 0.5+0.45=0.95 and below 0.5-0.45=0.05.

s double s

Gary Robinson’s strength of belief parameter.

SPAM_THRESHOLD double SPAM_THRESHOLD

Threshold probability value used to make decision on spam.

w1 double w1

Weight factor for document based estimation.

w2 double w2

Weight factor for token based estimation.

w3 double w3

Weight factor for the estimation based on average occurrences of the token.

x double x

Gary Robinson’s strength of belief parameter. Fields inherited from class rsspambayes.Algorithm emailCat, trnStat

107

Constructor Detail RShresthaBayes1 public RShresthaBayes1(Stats stat)

Constructs the algorithm object using given training stats. Parameters: stat - training statistics.

RShresthaBayes1 public RShresthaBayes1(Stats stat, java.lang.String parmString)

Constructs the algorithm object using given training stats and custom parameters list. Parameters: stat - training statistics. parmString - comma separated list of key = value pairs of parameters.

Method Detail calcProb private double calcProb(java.util.List topTokens)

Calculates the probability for the spam using Fisher-Robinson's chi-square method: S = CInv(-2*sum(ln(1-f(w))),2n) H = CInv(-2*sum(ln(f(w))),2n) I = (1+(H-S))/2

Parameters:

topTokens - list of top/interesting tokens.

Returns: combined spam probability.

calcRShresthaSpamProb private double calcRShresthaSpamProb(FreqTable emlFTbl)

Calculates probabilities using the R. Shrestha’s co-weighted multi-estimations. Parameters:

emlFTbl - frequency table representing the email under test.

Returns: spam probability.

classify public Category classify(FreqTable emlFTbl)

Classifies the test email represented by the given frequency table. Specified by:

classify in class Algorithm

108 Parameters:

emlFTbl - frequency table for the email.

Returns: classification result.

computeSpamTokenProb private double computeSpamTokenProb(java.lang.String token, Areas area, int areaIndex, int te)

Computes f(t) guestimate value for the token in the ith email area using the formula: f(t) = (s * x + n*p(t))/(s +n) . Parameters:

token - token for which to compute spam probability. area - email area. areaIndex - sub-area. te - token occurrences in the test email.

Returns: spam probability for the token.

getAllTokensProbs private java.util.HashMap getAllTokensProbs(FreqTable emlFTbl)

Calculates spam probabilities for all tokens in the given email. Parameters:

emlFTbl - frequency table representing the email under test.

Returns: hash map of tokens and corresponding spam probabilities.

getParameters public java.lang.String getParameters()

Description copied from class: Algorithm Gets the comma separated list of key = value pairs of parameters. Specified by:

getParameters in class Algorithm

getPGrahamTokenProb private double getPGrahamTokenProb(double double double double

bad, good, nbad, ngood)

Calculates spam probabilitiy for the token using Paul Graham’s approach. Parameters: bad - bad count. good - good count. nbad - total bad count. ngood - total good count.

109 Returns: spam probability.

getTopSpamTokensProbs private java.util.List getTopSpamTokensProbs(java.util.HashMap allTokens)

Gets top/interesting spam tokens, measured by the distance from the neutral probability, i.e. 0.5. Parameters:

allTokens - hash map of all tokens and corresponding spam probabilities.

Returns: list of top/interesting tokens.

isValidParameter public static boolean isValidParameter(java.lang.String param)

Checks whether the given is a valid parameter for the algorithm. Parameters:

param - parameter to be checked.

Returns: validation test result.

parseParameters void parseParameters(java.util.ArrayList parameters)

Parses parameters from the list of parameters. Specified by:

parseParameters in class Algorithm

Parameters:

parameters - comma separated list of key = value pairs of parameters.

Methods inherited from class rsspambayes.Algorithm setParameters

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.12 RShresthaBayes2 java.lang.Object rsspambayes.Algorithm rsspambayes.RShresthaBayes2 class RShresthaBayes2 extends Algorithm

110 This algorithm is based on our approach of co-weighted multi-area information. This is the extended algorithm to RShresthaBayes1 with additional consideration of the different weights given to the word appearing in different areas (normal header, subject, body, html tags) in the email. Irrespective of previous algorithms whereby the same token appeared in the different areas is treated as different token, the token is treated here the same and only one, i.e. weighted sum of probabilities take parts in the final calculations of the spam probability.

Field Detail APPROACH int APPROACH

Approach of computing combined probability estimate.

AREA Areas AREA

Email area.

AREAINDEX int AREAINDEX

Index representing sub-area in the area.

AREAMODE int AREAMODE

Defines the only one area, all but separate areas and combined areas. 0 - ONEAREA (AREA, AREAINDEX), 1 - SEPARATE, 2 - COMBINED

areaWeights double[][] areaWeights

Area-wise weight values.

BIAS_TO_LEGT double BIAS_TO_LEGT

Bias multiplier to legitimate. Its default value is 1.

NUM_INTTOKENS_USE int NUM_INTTOKENS_USE

Number of the most interesting tokens to use in computations.

PPROB_LEGT double PPROB_LEGT

Prior-probability for legitimate.

PPROB_SPAM double PPROB_SPAM

Prior-probability for spam.

111

PROB_INTTOKEN_OFFSET double PROB_INTTOKEN_OFFSET

Probability offset from neutral 0.5 to be considered for interesting tokens. For example: offset value of 0.45 means above 0.5+0.45=0.95 and below 0.5-0.45=0.05.

s double s

Gary Robinson’s strength of belief parameter.

SPAM_THRESHOLD double SPAM_THRESHOLD

Threshold probability value used to make decision on spam.

w1 double w1

Weight factor for document based estimation.

w2 double w2

Weight factor for token based estimation.

w3 double w3

Weight factor for the estimation based on average occurrences of the token.

WT_BD double WT_BD

Weight value for the normal header. (For manual testing purpose only)

WT_NH double WT_NH

Weight value for the normal header. (For manual testing purpose only)

WT_OH double WT_OH

Weight value for the normal header. (For manual testing purpose only)

WT_SH double WT_SH

Weight value for the normal header. (For manual testing purpose only)

WT_TG double WT_TG

Weight value for the normal header. (For manual testing purpose only)

112

x double x

Gary Robinson’s assumed probability for unknown tokens. Fields inherited from class rsspambayes.Algorithm emailCat, trnStat

Constructor Detail RShresthaBayes2 public RShresthaBayes2(Stats stat)

Constructs the algorithm object using given training stats. Parameters: stat - training statistics.

RShresthaBayes2 public RShresthaBayes2(Stats stat, java.lang.String parmString)

Constructs the algorithm object using given training stats and custom parameters list. Parameters: stat - training statistics. parmString - comma separated list of key = value pairs of parameters.

Method Detail calcProb private double calcProb(java.util.List topTokens)

Calculates the probability for the spam using Fisher-Robinson's chi-square method: S = CInv(-2*sum(ln(1-f(w))),2n) H = CInv(-2*sum(ln(f(w))),2n) I = (1+(H-S))/2

Parameters:

topTokens - list of top/interesting tokens.

Returns: combined spam probability.

calcRShresthaSpamProb private double calcRShresthaSpamProb(FreqTable emlFTbl)

Calculates probabilities using the R. Shrestha’s co-weighted multi-area information. Parameters:

emlFTbl - frequency table representing the email under test.

Returns: spam probability.

113

classify public Category classify(FreqTable emlFTbl)

Classifies the test email represented by the given frequency table. Specified by:

classify in class Algorithm

Parameters:

emlFTbl - frequency table for the email.

Returns: classification result.

getAllTokensProbs private java.util.HashMap getAllTokensProbs(FreqTable emlFTbl)

Calculates spam probabilities for all tokens in the given email. Parameters:

emlFTbl - frequency table representing the test email.

Returns: hash map of tokens and corresponding spam probabilities.

getAreawiseTokenProb private double getAreawiseTokenProb(java.lang.String token, Areas area, int areaIndex, int te, int approach)

Computes f(t) guestimate value for the token in the specified area of the email using the said approach. f(t) = (s * x + n*p(t))/(s +n) Parameters:

token - token for which to compute spam probability. area - email area. areaIndex - sub-area. te - token occurrences in the test email. approach - estimation approach to use in computation.

Returns: spam probability for the token.

getCombinedAreaTokenProb private java.util.HashMap getCombinedAreaTokenProb(FreqTable emlFTbl, java.lang.String token)

Computes token probability based on co-weighted multi-area estimations Parameters:

emlFTbl - frequency table representing the test email. token - token for which probability to be computed.

Returns: hash map of token and its spam probability.

114

getIntToArea private Areas getIntToArea(int areaValue)

Converts an integer area into Areas type. Parameters:

areaValue - integer area value.

Returns: Areas type value corresponding to integer area value.

getOneAreaTokenProb private java.util.HashMap getOneAreaTokenProb(Areas area, int areaIndex, FreqTable emlFTbl, java.lang.String token)

Computes token probability based on specified area of occurrence Parameters: area - email area. areaIndex - sub-area. emlFTbl - frequency table representing the test email. token - token for which probability to be computed. Returns: hash map of token and its spam probability.

getParameters public java.lang.String getParameters()

Description copied from class: Algorithm Gets the comma separated list of key = value pairs of parameters. Specified by:

getParameters in class Algorithm

Returns: comma separated list of key = value pairs of parameters.

getPGrahamTokenProb private double getPGrahamTokenProb(double double double double

bad, good, nbad, ngood)

Calculates spam probability for the token using Paul Graham’s approach. Parameters: bad - bad count. good - good count. nbad - total bad count. ngood - total good count. Returns: spam probability.

115

getSeparateAreaTokenProbs private java.util.HashMap getSeparateAreaTokenProbs(FreqTable emlFTbl, java.lang.String token)

Computes token probability based on all but separate areas of occurrences. Parameters:

emlFTbl - frequency table representing the test email. token - token for which probability to be computed.

Returns: hash map of token and its spam probability.

getTokenProb private double getTokenProb(java.lang.String token, Areas area, int areaIndex, int te)

Computes f(t) guestimate value for the token in the ith email area using the formula: f(t) = (s * x + n*p(t))/(s + n). Parameters:

token - token for which to compute spam probability. area - email area. areaIndex - sub-area. te - token occurrences in the test email.

Returns: spam probability for the token.

getTopSpamTokensProbs private java.util.List getTopSpamTokensProbs(java.util.HashMap allTokens)

Gets top/interesting spam tokens, measured by the distance from the neutral probability, i.e. 0.5. Parameters:

allTokens - hash map of all tokens and corresponding spam probabilities.

Returns: list of top/interesting tokens.

isValidParameter public static boolean isValidParameter(java.lang.String param)

Checks whether the given is a valid parameter for the algorithm. Parameters:

param - parameter to be checked.

Returns: result of the check.

116

parseParameters void parseParameters(java.util.ArrayList parameters)

Parses parameters from the list of parameters. Specified by:

parseParameters in class Algorithm

Parameters:

parameters - list of key = value pairs of parameters.

Methods inherited from class rsspambayes.Algorithm setParameters

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.13 Stats java.lang.Object rsspambayes.Stats

All Implemented Interfaces: java.io.Serializable public class Stats extends java.lang.Object implements java.io.Serializable

The Stats class represents the statistical information collected during the training session. It is also used to load the trained data from and save into a data file.

Field Detail areaSizes int[] areaSizes

Email areas and their sub-areas used.

freqTable FreqTable freqTable

Frequency table which stores email and token frequencies.

numLegts public int numLegts

Number of spam emails used to train.

117

numSpams public int numSpams

Number of spam emails used to train.

Constructor Detail Stats public Stats()

Constructs and initialize the object.

Method Detail addEmailStats public void addEmailStats(FreqTable emlFTbl)

Adds the email statistics from frequency table of an email, incrementing email count. Parameters:

emlFTbl - frequency table representing an email whose stats is to be added.

Initialize private void Initialize()

Initializes the stats.

load public void load(java.lang.String file)

Loads trained data from a file. Parameters: file - file name.

load public void load(java.lang.String file, boolean isText)

Loads trained data from either a text file or a binary file. Parameters: file - file name isText - if true, loads from the text file, otherwise from a binary file

loadFromBinary private Stats loadFromBinary(java.lang.String file) throws java.io.IOException, java.lang.ClassNotFoundException

Loads trained data from a binary file. Parameters: file - file name.

118

Returns: the Stats object. Throws:

java.io.IOException - on any file i/o error. java.lang.ClassNotFoundException – on any object read error.

loadFromText private void loadFromText(java.lang.String file) throws java.io.IOException

Loads trained data from a text file. Parameters: file - file name. Throws:

java.io.IOException - on any file i/o error.

save public void save(java.lang.String file)

Saves the training data into a file. Parameters: file - file name.

save public void save(java.lang.String file, boolean isText)

Saves stats data into either a text or binary file. Parameters: file - file name isText - if true, saves into text file, otherwise saves into a binary file

saveToBinary private void saveToBinary(java.lang.String file) throws java.io.IOException

Saves training data into a text file. Parameters: file - file name Throws:

java.io.IOException - on any error on file i/o.

saveToText private void saveToText(java.lang.String file) throws java.io.IOException

Saves training data into a text file.

119 Parameters: file - file name Throws:

java.io.IOException - on any error on file i/o.

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.14 Tester java.lang.Object rsspambayes.Tester class Tester extends java.lang.Object

The Tester class is the application used for parameter tuning and testing the filter application.

Field Detail classifier Classifier classifier

Classifier object used for classifying emails.

confMatrix int[][] confMatrix

Contingency/Confusion matrix to store test result.

falsePs java.lang.String falsePs

False positive emails.

LEGT final int LEGT

Index value for legitimate in confusion matrix.

outStream java.io.PrintWriter outStream

Output stream to write the test result.

SPAM final int SPAM

Index value for spam in confusion matrix.

120

stem final boolean stem

Flag to use or not to use stemming. By default its false.

TOTAL final int TOTAL

Index value for totals in confusion matrix.

UNDT final int UNDT

Index value for undetermined in confusion matrix.

Constructor Detail Tester public Tester()

Constructs the default Tester.

Tester public Tester(java.lang.String measureWts, java.lang.String corpusWts)

Constructs Tester using given measure and corpus weights for analysis based on corpus and measure weights. Parameters:

measureWts - weights of different performance measures. corpusWts - weights for different corpus.

Method Detail computeMeasures private double computeMeasures(java.lang.String varParamValue)

Computes all the test measures: accuracy, error, spam recall, spam precision false positive and false negatives and saves the result into the report file Parameters:

varParamValue - variable parameter value with which the test is to perform.

Returns: combined weighted measure.

getAlgorithm private static Algorithms getAlgorithm(java.lang.String salg)

Converts from text algorithm name to Algorithms type. Parameters: salg - algorithm code (nb|pb|rb|sb1|sb2).

121 Returns: corresponding Algorithms type value.

getSummedCombMeasure private double[] getSummedCombMeasure(double[] sum, double[] measure, int weight)

Computes the weighted sum of combined measures. Parameters: sum - previous sum values. measure - performance measure values. weight - weight value. Returns: summed combined measures.

main public static void main(java.lang.String[] args)

Main program entry point for Tester application. Usage: Tester algorithms datasets [fxdparameters [description [outfile [stopwordsfile]]]]]

[varparameters

Parameters: args - command line arguments for the application.

printHeadings private void printHeadings(java.lang.String fixedParams, java.lang.String[] varParams)

Prints heading labels in the test report output file. Parameters:

fixedParams - comma separated list of fixed parameters. varParams - comma separated list of variable parameters.

test private void test(java.lang.String testDataPath, Algorithms algorithm, java.lang.String parameters)

Performs tests on all spam and legitimate emails in the given path using the given algorithm and custom parameters. Parameters:

testDataPath - path containing test data set. algorithm - to be used. parameters - comma separated list of key = value pairs of custom parameters.

122

testAll public void testAll(java.lang.String[][] dataSets, Algorithms[] algorithms, java.lang.String desc, java.lang.String fixedParams, java.lang.String[][] varParams, double[][] varParamValues, java.lang.String resultFile)

Tests all emails for given datasets using the algorithms and parameters supplied. Parameters: dataSets - list of datasets in the format: algorithms - comma separated list of algorithms like nb,pb,gb,sb1,sb2. desc - description of the test. fixedParams - comma separated list of custom key = value pairs of fixed parameters. varParams - comma separated list of variable parameters. varParamValues - values of the corresponding variable parameters in varParams in the format: start_value:final_value:step_value. resultFile - file name to save the result of the test.

testFiles private void testFiles(java.lang.String path, int type, Algorithms algorithm, java.lang.String parameters)

Performs tests on spam or legtimate emails depending upon the type in the given path using the specified algorithm and custom parameters. Parameters: path - path containing test data set. type - type (spam or legitimate) of emails to test. algorithm - to be used. parameters - comma separated list of key = value pairs of custom parameters.

testForMultipleVarParams private double[] testForMultipleVarParams(java.lang.String testDataPath, Algorithms algorithm, java.lang.String fixedParams, java.lang.String[] varParams, double[] wtConds)

Tests for more than one variable to change simultaneously. Parameters: testDataPath - data path containing test dataset. algorithm - algorithm to be used in testing. fixedParams - fixed parameters. varParams - variable parameters. wtConds - contains array of three weighting conditions. wtConds[0]- start value , wtConds[1]- total value and wtConds[2]- step value. Returns: combined measures.

123

testForSingleVarParams private double[] testForSingleVarParams(java.lang.String testDataPath, Algorithms algorithm, java.lang.String fixedParams, java.lang.String varParams, double[] varParamValues)

Tests for single variable parameter to vary. Parameters: testDataPath - data path containing test dataset. algorithm - algorithm to be used in testing. fixedParams - fixed parameters. varParams - variable parameters. varParamValues - contains starting value, final value and step value in order. Returns: combined measures. Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.3.15 Tokenizer java.lang.Object rsspambayes.Tokenizer class Tokenizer extends java.lang.Object

The Tokenizer class implements the preprocessing and tokenization of emails.

Field Detail freqTable FreqTable freqTable

Frequency table representing the email being tokenized.

inputStream java.io.BufferedReader inputStream

Input stream to read email file.

MAX_TOKEN_LENGTH static final int MAX_TOKEN_LENGTH

Maximum token length to be considered. Default value is 55.

124

stopWords java.util.HashSet stopWords

Set of stop words.

toStem boolean toStem

Flag to indicate whether to use stemming or not.

Constructor Detail Tokenizer public Tokenizer()

Constructs default tokenizer.

Tokenizer public Tokenizer(java.util.HashSet stopWords, boolean toStem)

Constructs tokenizer with stop words and stemming options. Parameters:

stopWords - set of stop words to be ignored. toStem - flag indicating whether to use stemming or not.

Method Detail containsDomainTitle private boolean containsDomainTitle(java.lang.String text)

Checks whether given text contains any domain names like “hnu.net”, “microsoft.com”. Parameters: text - block of text. Returns: result of the check.

containsHTML private boolean containsHTML(java.lang.String text)

Checks whether given text contains html tag(s). Parameters: text - string of text to check. Returns: result of the check.

headerStart private boolean headerStart(java.lang.String lineText)

Checks whether a line of text is the start of an email header.

125 Parameters:

lineText - text string.

Returns: true if it is a start of a header, false otherwise.

insertToFreqTable public void insertToFreqTable(java.lang.String token, EmailCats emailCat, Areas area, int areaIndex)

Inserts the frequency information for the given token occurred in the specified area of the email belonging to the given category into the frequency table. Parameters:

token - token value. emailCat - email category. area - email area. areaIndex - sub-area.

isHeader private boolean isHeader(java.lang.String token)

Checks whether the token is a header label. Parameters:

token - token to be tested.

Returns: true if the token is a header label, false otherwise.

isIgnoredSingleChar private boolean isIgnoredSingleChar(java.lang.String token)

Checks if the token is an ignorable single character like ‘,’ ‘.’, ‘:’ etc. Parameters:

token - token to be checked.

Returns: result of the check.

isNumeric private boolean isNumeric(java.lang.String text)

Checks whether the input string contains valid numeric data (digits and a decimal point). Parameters: text - block of text. Returns: result of the check.

126

isTooLong private boolean isTooLong(java.lang.String token)

Checks if a token is too long. A token of length more than MAX_TOKEN_LENGTH characters is considered as too long. Parameters:

token - token to be checked.

Returns: result of the check.

removeEncodedText private java.lang.String removeEncodedText(java.lang.String text)

Removes binary/encoded text (like attachments etc.), if any. Parameters: text - string of text. Returns: text with binary text removed.

removeHeaderLabel private java.lang.String removeHeaderLabel(java.lang.String text)

Removes header label (first token ending with ‘:’) from the text. Parameters: text - string of text. Returns: text without header label.

removeTagAttributes private java.lang.String removeTagAttributes(java.lang.String text, java.lang.String wordsToRemove)

Removes all words in the given wordlist from the text. It can be used, for example, to remove tag names and their attributes. Parameters: text - block of text. wordsToRemove - comma separated list of words to be removed. Returns: remaining text.

removeTimes private java.lang.String removeTimes(java.lang.String text)

Removes all date and time values from the block of text. Parameters: text - block of text. Returns: remaining text after dates and times removed.

127

tokenize public FreqTable tokenize(java.lang.String fname, EmailCats emailCat) throws java.io.IOException

Tokenizes an email file belonging to the given category. Parameters:

fname - email file name. emailCat - email category (spam or legitimate).

Returns: frequency table representing the email. Throws:

java.io.IOException - on any error during reading email file.

tokenize public static java.util.ArrayList tokenize(java.lang.String file, java.lang.String delimiters) throws java.io.IOException

Tokenizes a stop word file containing stop words separated by given delimiters. Parameters: file - stop word file. delimiters - delimiter(s) used in separating stop words. Returns: list of stop words. Throws:

java.io.IOException – on any i/o error.

tokenizeBody private void tokenizeBody(EmailCats emailCat) throws java.io.IOException

Tokenizes body area of the email belonging to the given category. Parameters:

emailCat - email category.

Throws:

java.io.IOException - on any i/o error.

tokenizeDomainNames private java.lang.String tokenizeDomainNames(java.lang.String text, EmailCats emailCat, Areas area, int areaIndex)

Tokenizes domain names. It collects alls domain names (like msn.com.uk, in the text and for each domain name, find the combination of words (like msn.com.uk, msn.com, com.uk, msn, com, uk) and add them as additional tokens.

128 Parameters: text - block of text. emailCat - email category. area - email area. areaIndex - sub-area. Returns: the remaining text after removing all domain names.

tokenizeHeaders private void tokenizeHeaders(EmailCats emailCat) throws java.io.IOException

Tokenizes headers of the email. Parameters:

emailCat - email category.

Throws:

java.io.IOException - on any i/o error.

tokenizeHtmlTags private java.lang.String tokenizeHtmlTags(java.lang.String text, EmailCats emailCat)

Tokenizes text within html tags occurred in the email. Parameters: text - blocks of text containing html tags. emailCat - email category. Returns: remaining text with all html tags and text within them removed.

tokenizeIPs private java.lang.String tokenizeIPs(java.lang.String text, EmailCats emailCat, Areas area, int areaIndex)

Tokenizes IP addresses from the given text. Parameters: text - block of text. emailCat - email category. area - email area. areaIndex - sub-area. Returns: the remaining text after removing all IPs.

129

tokenizeMoneys private java.lang.String tokenizeMoneys(java.lang.String text, EmailCats emailCat, Areas area, int areaIndex)

Tokenize money values. Pure numbers are ignored, but numbers with currency symbols (‘$’, ‘€’, ‘£’, ‘¥’), and commas (‘,’) like $10,000 will be retained. Parameters: text - block of text. emailCat - email category. area - email area. areaIndex - sub-area. Returns: remaining text after removing all money values.

tokenizeSpecials public java.lang.String tokenizeSpecials(java.lang.String text, EmailCats emailCat, Areas area, int areaIndex)

Tokenizes special tokens like IPs, domain names, money values. Parameters: text - block of text. emailCat - email category. area - email area. areaIndex - sub-area. Returns: remaining text after removing all special tokens.

tokenizeText private void tokenizeText(java.lang.String text, EmailCats emailCat, Areas area, int areaIndex)

Tokenizes given text with the predefined delimiters and update frequency table according to the area of occurrence of the token. Parameters: text - block of text. emailCat - email category. area - email area. areaIndex - sub-area. Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

130

C.3.16 Trainer java.lang.Object rsspambayes.Trainer public class Trainer extends java.lang.Object

The Trainer class is the trainer application used to train the spam filter with the given training dataset.

Field Detail dataFile java.lang.String dataFile

Training data file.

stats Stats stats

Trainining stats data.

stem final boolean stem

Flag to indicate use or not to use stemming. By default it is false.

stopFile java.lang.String stopFile

File containing list of stop words.

stopWords java.util.HashSet stopWords

Set of stop words.

tokenizer Tokenizer tokenizer

Tokenizer object used to tokenize emails.

Constructor Detail Trainer public Trainer(java.lang.String stpFile, java.lang.String datFile)

Constructs Trainer with stop words and sets data output file. Parameters:

stpFile - file name containing stop words. datFile - file name of the output data.

131

Method Detail getStats private FreqTable getStats(java.lang.String emailFile, EmailCats emailCat)

Gets the frequency stats for the given email belonging to a specified category. Parameters:

emailFile - email file name. emailCat - email category.

Returns: frequency table representing the email.

main public static void main(java.lang.String[] args)

Main program entry point for the Trainer application. Usage: Trainer path [datafile [stopwordsfile]]

Parameters: args - command line arguments for the application.

saveStats public void saveStats()

Saves the training stats into data output file.

train public void train(java.util.ArrayList fileList, EmailCats emailCat)

Trains from the given list of email files, all belonging to a defined category. Parameters:

fileList - list of email files. emailCat - email category.

train public void train(java.lang.String fileName, EmailCats emailCat)

Trains the filter with the given email file belonging to specified category. Parameters:

fileName - email file name. emailCat - email category to which the email belongs to.

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

132

C.3.17 Utils java.lang.Object rsspambayes.Utils public class Utils extends java.lang.Object

The Utils class defines all utility methods that can be used in other classes. All methods of this class are static.

Constructor Detail Utils public Utils()

Constructs default Utils object.

Method Detail arrayToString public static java.lang.String arrayToString(java.lang.String[] a, char sepchar)

Converts an array of strings to one string. Elements are separated by the given separator character. Parameters: a - array of strings. sepchar - character to separate strings. Returns: combined string.

chiSquareInverse public static double chiSquareInverse(double chi, double df)

Computes Fisher’s inverse chi-square value. Parameters: chi - value of chi. df - degree of freedom, must be even and it is as we are sending 2*n. Returns: inverse chi square value.

copyFile public static boolean copyFile(java.io.File src, java.io.File dst)

Copies a file from one place to another.

133 Parameters: src - source file. dst - destination file.

copyFiles public static void copyFiles(java.util.ArrayList files, java.lang.String destPath)

Copies given list of files into the specified destination folder. Parameters:

files - list of files to be copied. destPath - dentition path to which files to be copied.

formatDouble public static java.lang.String formatDouble(double value, int min_max)

Formats a double value with the specified number of decimal places. Parameters:

value - double value to be formatted. min_max - number of decimal places.

Returns: formatted string.

formatDouble public static java.lang.String formatDouble(double value, int min, int max)

Formats a double value with the minimum of ‘min’ and the maximum of ‘max’ number of decimal places. Parameters:

value - double value to be formatted. min - minimum number of decimal places. max - maximum number of decimal places.

Returns: formatted string.

getAllFiles public static java.util.ArrayList getAllFiles(java.lang.String path)

Gets all files in the given path. Parameters: path - path from which files to be extracted. Returns: list of files.

134

getFilesRecurs private static java.util.ArrayList getFilesRecurs(java.io.File aStartingDir) throws java.io.FileNotFoundException

Recursively walk a directory tree and return a list of all files found in the given starting directory. Parameters:

aStartingDir - is a valid directory, which can be read.

Returns: list of files. Throws: java.io.FileNotFoundException - on given directory couldn’t be found.

getMinPos public static int getMinPos(double[] array)

Gets the index (position) of the minimum value in the given array of decimal values. Parameters:

array - list of decimal values.

Returns: index position of the minimum value in the array.

getTokenCombinations public static java.util.ArrayList getTokenCombinations (java.lang.String text, char delimiter)

Takes a token text like “microsoft.com.np” and returns the sub-token combinations like: microsoft.com.np, microsoft.co, com.np, Microsoft, com, np. Parameters: text - token text. delimiter - delimiter character separating sub-tokens. Returns: list of all possible sub-token combinations.

getTokenCombRecurs private static java.util.ArrayList getTokenCombRecurs (java.util.ArrayList tokens, int i, char delimiter)

Helper recursive procedure to obtain all possible token combinations. Parameters:

tokens - list of tokens. i - index position, which starts from zero. delimiter - delimiter character separating sub-tokens.

Returns: list of all sub-tokens.

135

getValueCombinations public static java.util.ArrayList getValueCombinations(int numVars, double total, double step)

Gets all possible combinations of variable values for given number of variables, starting from zero value for variables. Parameters:

numVars - number of variables. total - conditional sum of the values of all parameters. step - incremental step of changing values.

Returns: list of all possible combination of values.

getValueCombinations public static java.util.ArrayList getValueCombinations(int numVars, double start, double total, double step)

Gets all possible combination of variable values for given number of variables. Parameters:

numVars - number of variables. start - start value. total - sum of the values of all parameters, limiting condition for the values

of the variables.

step - incremental step of changing values.

Returns: list of all possible combination of values.

getValueCombRecurs private static java.util.ArrayList getValueCombRecurs(double[] values, double start, double total, double step, int index, double sumSoFar)

Computes all possible combination of values (summed up to ‘total’) with ‘step’ change. Parameters:

values - array (with values zero or whatever) of size of combination variables say wts[] = {0,0}. total - condition that the sum of all variable values equals to this ‘total’. step - step increment for values to vary. index - current index in the array for recursive computation. sumSoFar - sum so far of combinations from array index 0 to ‘index’.

Returns: array list of all combinations like: {{0.0,1.0},{0.2,0.8},{0.4,0.6}......{1.0,0.0}}

136

listToString public static java.lang.String listToString(java.util.List a, char sepchar)

Converts an array list of strings to one string with elements separated by the given separator character. Parameters: a - array list of elements (strings). sepchar - character to separate elements. Returns: combined string.

moveFiles public static void moveFiles(java.util.ArrayList files, java.lang.String destPath)

Moves list of files in into the given destination folder. Parameters:

files - list of files to be copied. destPath - destination path to which files to be moved.

readChar public static char readChar()

Reads a character from the console. Returns: character read.

readInt public static int readInt()

Reads an integer value from the console. Returns: integer read.

readLine public static java.lang.String readLine()

Reads a line of text from the console. Returns: read text string read.

readParameters public static java.util.ArrayList readParameters(java.io.File parmFile)

Reads list of key = value pair of parameters line-by-line from the given file and forms array list of parameters. Parameters:

parmFile - parameter file.

137 Returns: parameters list.

readParameters public static java.util.ArrayList readParameters (java.lang.String parmString)

Reads list of parameters from the given string. Parameters:

parmString - key = value pair of parameters (separated by ‘,’ ‘;’ ‘\t’ ‘\n’ or ‘ ’)

Returns: the list of parameters.

readParameters public static java.util.ArrayList readParameters (java.lang.String parmString, char sepchar)

Reads list of parameters separated by given separator character from the given string. It assumes the value as numeric. Parameters: parmString - string containing key = value pair of parameters. sepchar - separator character. Returns: list of parameters.

readParameters public static java.util.ArrayList readParameters (java.lang.String parmString, char sepchar, boolean isNumValue)

Reads list of parameters separated by given separator character from the given string. It also takes an argument identifying value type. Parameters:

parmString - string containing key = value pair of parameters. sepchar - separator character. isNumValue - flag to indicate whether the value is numeric or not.

Returns: list of parameters.

renameFiles public static void renameFiles(java.lang.String path, java.lang.String prefix)

Renames all files in the path with prefix string. It is useful in preparing datasets. Parameters: path - files path. prefix - prefix string.

138

validateDirectory private static void validateDirectory(java.io.File aDirectory) throws java.io.FileNotFoundException

Checks if the given directory exists. If so whether it is the directory and if so readable. Parameters:

aDirectory - directory to check.

Throws:

java.io.FileNotFoundException - on given directory not found, or it is

unreadable or not a directory.

Methods inherited from class java.lang.Object clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

C.4 Enum Details Details of all the enumerated types are outlined in this section in alphabetic order.

C.4.1 Algorithms java.lang.Object java.lang.Enum rsspambayes.Algorithms

All Implemented Interfaces: java.io.Serializable, java.lang.Comparable enum Algorithms extends java.lang.Enum

This enumeration defines all Bayesian spam filtering algorithms supported by the application.

Enum Constant Detail GROBN_B public static final Algorithms GROBN_B

Gary Robinson’s algorithm.

NAIVE_B public static final Algorithms NAIVE_B

Naive Bayes algorithm.

PAULG_B public static final Algorithms PAULG_B

Paul Graham’s algorithm.

139

RSTHA1_B public static final Algorithms RSTHA1_B

RShrestha’s i.e my algorithm based on co-weighted multi-estimations.

RSTHA2_B public static final Algorithms RSTHA2_B

RShrestha’s algorithm based on co-weighted multi-area information.

C.4.2 Areas java.lang.Object java.lang.Enum rsspambayes.Areas

All Implemented Interfaces: java.io.Serializable, java.lang.Comparable enum Areas extends java.lang.Enum

Defines the three major areas of email.

Enum Constant Detail BODY public static final Areas BODY

Body area of the email.

HEADER public static final Areas HEADER

Headers area of the email.

HTMLTAG public static final Areas HTMLTAG

Html tags area of the email.

C.4.3 EmailCats java.lang.Object java.lang.Enum rsspambayes.EmailCats

All Implemented Interfaces: java.io.Serializable, java.lang.Comparable enum EmailCats extends java.lang.Enum

Defines possible email categories.

140

Enum Constant Detail LEGT public static final EmailCats LEGT

Legitimate category.

SPAM public static final EmailCats SPAM

Spam category.

UNDF public static final EmailCats UNDF

Undefined category. It is implemented for the future extension of the application.

C.4.4 Headers java.lang.Object java.lang.Enum rsspambayes.Headers

All Implemented Interfaces: java.io.Serializable, java.lang.Comparable enum Headers extends java.lang.Enum

Defines major headers considered by the filter.

Enum Constant Detail NHEADERS public static final Headers NHEADERS

Normal headers comprising of “From”, “Reply-To”, “Return-Path”, “To” and “Cc”.

OTHERS public static final Headers OTHERS

All other headers.

SUBJECT public static final Headers SUBJECT

Subject header.

C.4.5 HtmlTags java.lang.Object java.lang.Enum rsspambayes.HtmlTags

141 All Implemented Interfaces: java.io.Serializable, java.lang.Comparable enum HtmlTags extends java.lang.Enum

Defines categories of html tags.

Enum Constant Detail A public static final HtmlTags A

tag.

COMMENT public static final HtmlTags COMMENT

Html comment tag.

FONT public static final HtmlTags FONT

tag.

IMG public static final HtmlTags IMG

tag.

OTHERS public static final HtmlTags OTHERS

All other html tags.

C.4.6 Method Detail for All Enum Types Methods of all enum types are the same. Details on those common methods are outlined below.

Method Detail valueOf public static HtmlTags valueOf(java.lang.String name)

Returns the enum constant of this type with the specified name. The string must match exactly an identifier used to declare an enum constant in this type. (Extraneous white space characters are not permitted.) Parameters: name - the name of the enum constant to be returned.

142 Returns: the enum constant with the specified name Throws:

java.lang.IllegalArgumentException - if this enum type has no constant

with the specified name

values public static final HtmlTags[] values()

Returns an array containing the constants of this enum type, in the order they’re declared. This method may be used to iterate over the constants as follows: for(HtmlTags c : HtmlTags.values()) System.out.println(c);

Returns: an array containing the constants of this enum type, in the order they’re declared Methods inherited from class java.lang.Enum clone, compareTo, toString, valueOf

equals,

getDeclaringClass,

hashCode,

Methods inherited from class java.lang.Object finalize, getClass, notify, notifyAll, wait, wait, wait

name,

ordinal,

Appendix D List of Papers Published 1. Shrestha Raju, Lin Yaping, Chen Zhiping.: Bayesian Spam Filtering Based on Co-weighting Multi-estimations. In: Proceedings of The International Symposium on Intelligent Computation and its Applications (ISICA-2005), China University of Geosciences (2005) , Wuhan, China, Pages 500-508.

I had attended the conference and presented in its session.

143

144 2. Shrestha Raju, Lin Yaping: Improved Bayesian Spam Filtering Based on Co-weighed Multi-area Information. In: Proceedings of The 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2005). Volume 3518 of LNCS/LNAI., Hanoi, Vietnam, Springer-Verlag GmbH (2005), Pages 650-660.

3. Wang Lei, Lin Yaping, Yang Xiao-lin, and Shrestha Raju, A Hybrid Algorithm based on Local Optimal Clusters for Ad-hoc Networks, The Third International Conference on Intelligent Multimedia Computing and Networking (IMMCN-2003), In conjunction with The Seventh Joint Conference on Information Sciences (JCIS-2003), September 26-30, 2003, North Carolina, USA.

Bibliography [1] National technology readiness survey (ntrs). Technical report, Robert H. Smith School of Business and Rockbridge Associates Inc., University of Maryland (2005) [2] Cranor, L.F., LaMacchia, B.A.: Spam!. Communications of the ACM, 41(8) (1998) [3] Paganini, M.: Ask: Active spam killer. In: FREENIX Track: USENIX Annual Technical Conference, San Antonio, Texas, USA (2003) 51–62 [4] Cohen, W.: Learning rules that classify email. In: AAAI Spring Symposium on Machine Learning in Information Access. (1996) [5] Sebastini, F.: Machine learning in automated textcategorization. ACM Computing Surveys, 34(1). (2002) 1–47 [6] Yang, Y.: An evaluation of statistical approaches to text categorization. Journal of Information Retrieval, 1(1/2). (1999) 67–88 [7] Drucker, H., Wu, D., Vapnik, V.: Support vector machines for spam categorization. In: IEEE Transactions on Neural Networks, 10. (1999) 1048–1054 [8] Trudgian, D.C.:

Spam classification using nearest neighbour techniques. In:

Intelligent Data Engineering and Automated Learning (IDEAL 2004). (2004) 578–585

145

146

[9] Carrera, X., Marquez, L.: Boosting treesfor anti-spam email filtering. In: Proceedings of RANLP-2001, 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, BG (2001) [10] Katirai, H.: Filtering junk e-mail: A performance comparison between genetic programming and naive bayes. (1999) citeseer.ist.psu.edu/katirai99filtering.html. [11] Pantel, P., Lin, D.: Spamcop-a spam classification and organization program. In: Proceedings of AAAI-98 Workshop on Learning for Text Categorization. (1998) [12] Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A bayesian approach to filtering junk e-mail. In: Learning for Text Categorization, Madison, Wisconsin (1998) [13] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An evaluation of naive bayesian anti-spam filtering. In: Potamias G., Moustakis, V., Someren, M.N., editors, Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain (2000) 9–17 [14] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: An experimental comparison of naive bayesian and keyword-based antispam filtering with personal e-mail messages. In: Belkin N.J., Ingwersen, P., Leong, M.K., editors, Proceedings of SIGIR-2000, 23rd ACM International Conference on Research and Development in Information Retrieval, Athens, Greece, ACM Press, New York, US (2000) 160–167 [15] Androutsopoulos, I., Koutsias, J., Chandrinos, K.V., Paliouras, G., Spyropoulos, C.D.: Learning to filter spam e-mail: A comparison of a naive bayesian and a memory-based approach. In: Zaragoza H., Gallinari P., Rajman, M., editors, Proceedings of the Workshop on Machine Learning and Textual Information Access, 4th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD 2000), Lyon, France (2000) 1–13

147

[16] Sakkis, G., Androutsopoulos, I., Paliouras, G., Karkaletsis, V., Spyropoulos, C., Stamatopoulos, P.: Stacking classifiers for anti-spam filtering of e-mail. In: Lee L. and Harman D., editors, Empirical Methods in Natural Language Processing (EMNLP 2001), Pittsburgh, PA (2001) 44–50 [17] Graham, P.: A plan for spam. (2002) http://www.paulgraham.com/spam.html. [18] Graham, P.: Better bayesian fitering. In: Proceedings of the First Annual Spam Conference, MIT (2003) http://www.paulgraham.com/better.html. [19] Robinson, G.: Spam detection (2003) http://radio.weblogs.com/0101454/stories /2002/09/16/spamDetection.html. [20] Robinson, G.: A statistical approach to the spam problem (2003) Linux Journal, Issue-107, http://www.linuxjournal.com/article.php?sid=6467. [21] Yerazunis, W.S.: The spam-filtering accuracy plateau at 99.9% accuracy and how to get past it. In: Proceedings of MIT Spam Conference 2003, Cambridge, Massachusetts (2004) [22] Naive bayes classifier. Wikipedia, The Free Encyclopedia (2005) http://en.wiki pedia.org/wiki/Naive Bayesian classification. [23] Rish, I.: An empirical study of the naive bayes classifier. In: Proceedings of IJCAI-01 workshop on Empirical Methods in AI. (2001) [24] Domingos, P., Pazzani, M.J.:

Beyond independence: Conditions for the op-

timality of the simple bayesian classifier. In: Proceedings of the Thirteenth International Conference on Machine Learning, Bari,Italy, Morgan Kaufmann (1996) 105–112 [25] Webb, G.I., Pazzani, M.J.: Adjusted probability naive bayesian induction. In: Proceedings of the Eleventh Australian Joint Conference on Artificial Intelligence, Brisbane,Australia, Springer (1998) 285–295

148

[26] Joachims, T.: A probabilistic analysis of the rocchio algorithm with tfidf for text categorization. Technical report, School of Computer Science, Carnegie Mellon University (1996) [27] McCallum, A., Nigam, K.: A comparison of event model for naive bayes text classification. In: AAAI-98 Workshop on Learning for Text Categorization of the Fifteenth International Conference(ICML’98). (1998) 359–367 [28] Rennie, J.D., Shih, L., Teevan, J., Karger, D.R.: Susana eyheramendy, david lewis, david madigan. In: Proceedings of the Twentieth International Conference on Machine Learning. (2003) [29] McCallum, A., Nigam, K.: On the naive bayes model for text categorization. In: Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics. (2003) 359–367 [30] Mitchell, T.M.: Machine learning. McGraw-Hill (1997) 177–184 [31] Provost, J.: Naive-bayes vs. rule-learning in classification of email. Technical report, Department of Computer Science, The University of Texas at Austin (1999) [32] Louis, G.:

Greg’s bogofilter page (2003) http://www.bgl.nu/bogofilter/index

.html. [33] Yang, Y., Pedersen, J.O.:

A comparative study on feature selection in text

categorization. In: Proceedings of the 14th International Conference on Machine Learning (ICML-97). (1997) 412–420 [34] Burton,

B.:

Spamprobe - bayesian spam filtering tweaks (2003)

http://spamprobe.sourceforge.net/paper.html.

149

[35] Hauser, S.:

Statistical spam filter works (2003) http://www.tc.umn.edu/

˜hause011/article/Statistical spam filter.html. [36] Shrestha, R., Lin, Y., Chen, Z.: Bayesian spam filtering based on co-weighting multi-estimations. In: Proceedings of The International Symposium on Intelligent Computation and its Applications (ISICA-2005), Wuhan, China, China University of Geosciences (2005) 500–508 [37] Shrestha, R., Lin, Y.: Improved bayesian spam filtering based on co-weighed multi-area information. In: Proceedings of The 9th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD-2005). Volume 3518 of LNCS/LNAI., Hanoi, Vietnam, Springer-Verlag GmbH (2005) 650–660 [38] Justin, M.: The spam assassin public corpus (2003) http://www.spamassassin. org/publiccorpus. [39] Jones, R.W.: Great spam archive (2003) http://www.annexia.org. [40] Yerazunis, W.S.: Sparse binary polynomial hashing and the crm114 discriminator. In: Proceedings of MIT Spam Conference 2003, Madison, Wisconsin (2003)