Schematizing a Global SPAM Indicative Probability

Schematizing a Global SPAM Indicative Probability NIKOLAOS KORFIATIS∗

MARIOS POULOS†

SOZON PAPAVLASSOPOULOS† Department of Management Science and Technology Athens University of Economics and Business∗ Athens, Greece Department of Archive and Library Sciences Ionian University † Corfu, Greece Abstract In this paper we propose a middleware infrastructure to address the problem of filtering unsolicitated mail messages (known as SPAM). In our approach we use Bayesian Classifications of SPAM messages built upon categorization models that map a probability to a word using text analysis not only to unsolicitated mails but also to legitimate mail messages, making easier to extract a cumulate inference about the nature of the e-mail message. Our proposed architecture is based on the extension of these models using the advances of Collaborative Filtering Methods expressed via Peer-to-peer networks will help to built more effective and accurate anti-spam filters.

Key-Words

1

: e-mail, SPAM , Privacy, Peer-to-peer, Bayesian Classifiers

Introduction

and EU as a way to punish senders that are responsible for a large number of unwanted emails been sent to internet users, making a violation of their privacy rights. The other side of the coin is cost-sensitive applications of already developed techniques from fields such us information retrieval or text categorization. Following this side we are making a collaborative filtering approach that uses the concept of node interconnections for information exchange which is the main architecture of a peer to peer network. Collaborative filtering reflects the method of exchanging preferences and annotations regarding the same corpora of documents and information. Following the axiom that SPAM is not send only to certain type of users, thus making it a global ”phenomenon”, we address the need for a collaborative filtering infrastructure that will make accurate recommendations about the intention of an e-mail message. In the next paragraphs we make a categorization of current SPAM

SPAM [1] also known as mass commercial unsolicitated email, is a fast growing phenomenon to all levels of internet users. Varying from end users to large enterprises such us Internet Service Providers (ISP’s), SPAM is the most usual type of email that a typical internet user receives every day. Socio-Technical aspects of SPAM vary from bandwidth costs to security and privacy manners. Furthermore the development of sophisticated types of software crawlers whichmakes easier for SPAMers to acquire the email addresses from people who have made them public via a website or a participating to an internet community such us the USENET news, poses a threat to the use of e-mail as the primary mean for computer mediated communication. SPAM protection currently has two approaches, the first is the legal measures approach which is now being applied to US 1

2.1 Header Based Filtering

Table 1: Classification Methods and targeted part of the Mail Message Method Structured Text Filters Verification Filters Distributed Adaptive Blacklists Rule Based Rankings Bayesian Classifiers

Header based filtering is the basic method for large scale implementations of SPAM protection. In that case when an e-mail message is being received by the mail transfer agent (MTA) it is separated from its’ body. The header contains fields such us the sender and the recipient of the message. Then it is being analyzed using a lexical parser in order to identify the values of the fields. Having header values we can apply two different types of filtering:

Message Part Body Header Header Body Body/Header

filtering techniques from the part of the mail mes• Verification filtering (also known as sage that they target. Next we analyse our modelling ”whitelist” filtering) approach by using a certain type of Peer-to-peer Network Architecture in order to realize the system so • Trusted Filtering (also known as ”blackthat internet users could benefit in terms of reducing list” filtering) the amount of SPAM they receive everyday and also reduce the cost from misclassification of legitimate Verification and Trusted filtering are the polar valmessages as SPAM. ues of the same method, in which the header values are being validated against a vector which is being constructed by the user. Given a domain S of 2 Current Approaches on Spam Filter- mail messages Si and a set C of predefined cateing gories as C = {SP AM, LEGIT IM AT E}. We con~ that represents the messages such us sider a vector S Technically speaking SPAM Filtering is a cost sensi- S ~ = hS1 , S2 , S3 , ...i and a vector P~ that refers to the tive application of text categorization. We character- classifiers for C such us P~ = hP1 , P2 , P3 , ...i. ize SPAM filtering as ”cost sensitive” since the cost The task of categorizing an email message as SPAM in terms of information loss by misclassifying a mes- may be formalized in the form of approximating the sage as SPAM cannot be predicted. Taking this un- target function Φ : S × C → {T, F }. In case of der consideration, many classification methods have Φ(Si , Ci ) = T , Si represents a positive example of Ci been emerged, influenced from fields such as natural where the classification of the mail message accordlanguage processing or Information Retrieval. SPAM ing to the P~ is the same with the user’s preference filtering methods can be classified from the part of of classification, otherwise F represent the negative the email message that is targeted. A typical email example of the classification process where the user’s message is consisted of two parts: preferences are different from the automated classification. • The mail header,which contains informa- In both cases (Verification, Trusted filtering) the tartion about the origin of the email message get function uses the classification vector P~ to prosuch us the appropriate route that it fol- cess the message. The difference of the above two lowed to come to the user’s mail server methods lays in the construction of the classification vector. Header based filtering is very efficient for the • The content of the message which is going user since the message can be filtered before it arto be read by the user rives to the user’s mailbox by using automated processes based on the classification method discussed Following the above taxonomy several families of above. From the other side it cannot be considered SPAM filtering have been witnessed depending on the as a trusted method of SPAM filtering since it classifies messages based in a very little portion of the part of the mail message they target. 2

message which often can be changed by the senders tent part of the e-mail message. Examinations of existing SPAM mail corpora show that a typical SPAM using network programming techniques. message contains suspicious terms in both parts of the message. Radical implementations of this filter2.2 Message /Content Based Filtering ing policy use a mixed mode of interaction with the While header based filtering represents the majority user in order to find the frontier of influence in the of SPAM filtering policies implemented especially in classification decision by the message part. large organizations, there are also several implementations of SPAM Protection in commercial and Open Source mail clients such us Outlook and Mozilla [2] 3 The Bayesian Filtering Approaches that target the content part of the mail message. The A special type of mixed mode filtering can be a basic underlaying procedure of classifying a mail mes- bayesian filter that analyses both header and content sages as SPAM is ”rule based filtering”. Typically a parts of the message. This type of filtering uses a user can construct rules aka sets of classifiers that are probability model to characterize a message based on activated when a new message arrives. the total probability that is accumulated by specific In the simple form the user names a set of words and terms in the header and content parts of the e-mail the rule engine (part of the e-mail software client) message. Graham [3] suggested building Bayesian tries to match the values of these word with the con- probability models of SPAM and non-spam words. tent of the mail message. Similarly with header based The general pattern is that some words occur more filtering the message is being categorized as SPAM frequently in known SPAM, and other words occur only if it reveals the rule. more frequently in legitimate messages. Using this In a more advanced form of content filtering a combi- approach we can generate certain probabilities for nation of rules is being applied to classify a messages each attribute of the e-mail message and following a based on the overall score that a messages collects supervised learning period a probability distribution when several rules are applied to it. This score then is of terms that occur both to SPAM and legitimate being validated against a predefined threshold, which messages can be created. Naive Bayes classifiers is usually defined by the user and the message is being have been recently proved extremely accurate for classified when the overall score gets over the thresh- SPAM classification [4]. This family of filters works old. in mixed mode by analyzing both content and Using message content as the targeted part of the header values of spam corpora. A spam message SPAM filtering method, gives the advantage of an ac- s is represented with a vector ~x = hx , x . . . , x i 1 2 n curate mechanism that classifies mail messages based where x , x . . . , x are the values of attributes 1 2 n on user preferences about classification terms thus X , X , . . . , X . In our case attributes correspond 1 2 n making the filtering method more targeted to the in- to words, i.e. each attribute shows if a particular dividual user characteristics of mail messages. The word (e.g. ”offer”) appears in the message. Taking main disadvantage of this type of filtering model is apart Bayes theorem and total probability theorem, that is being built upon certain characteristics of the probability that a mail message s with vector SPAM messages that are not global and interaction ~x = hx , . . . , x i belongs to category c is 1 n with the user is always needed to enter new values or modify existing ones in the rule vector. ~ = ~x) = P (C = c) · P (Xi = xi |C = c) f (C = c|X P (C = k) · P (Xi = xi |C = κ) 2.3 Mixed Mode Filtering Mixed mode filtering addresses a combination of the filtering methods discussed above. Typically mixed mode filtering cumulates classifier decisions by applying filtering methods both to the header and the con-

Having κ ∈ {SP AM, Legitimate} The above simple formula is the basis for many spam-filtering approaches that have been developed in order to improve the accuracy of SPAM classification for e-mail 3

users (eg. SPAMBayes∗ ). This approach can be customized due to the values of the document vector. The overall ”SPAM probability” of a novel message, based on the collection of words it contains, can be computed as follows:

Query Agent P(Xi,TermXi)

Q

Sd = Q

Pi (T ermi ) Q Pi (T ermi ) + (1 − Pi (T ermi ))

Peer

P(Xi,TermXi) P(Xi,TermXi)

Sd can be changed during the learning supervision. Following this method we have the above benefits: P(Xi,TermXi)

1. It can generate a filter automatically from corpora of categorized messages rather than requiring human effort in rule development.

Peer

Peer

2. It can be customized to individual users’ characteristic spam and legitimate messages.

Peer

3. A probability set can be built and well Figure 1: Peer to Peer deployed graph known methods from decision theory can be applied to improve the accuracy of the exchange information and services directly with each filter (for example decision trees). other.[5] The most well known characteristic of a Peer As mentioned before in our system the SPAM into Peer network is the decentralized architecture that dicative probability comes from a supervised learning characterizes it by giving advantages such us Notprocess which constructs a user profile that is applied Singe Point of Failure or independence of escalation. not only to a corpus of SPAM messages but also to Most types of Peer to Peer are being built upon an a corpus of legitimate messages thus making a more anonymous policy that permits anyone to join the coherent probabilistic model of the indicative probanetwork and exchange certain types of information bilities of the message terms. with other peers. In our approach we use a type of We have reviewed SPAM filtering techniques that are peer to peer network that requires a slightly authenwidely been implemented to a large variety of e-mail tication process implemented by invitation exchange software . We now schematize a collaborative filterbetween peers. This type of peer to peer network also ing mediator using peer to peer networks in order known as trusted network can also be used efficiently to permit an exchange of these filters thus making in security applications [6]. a global ”SPAM indicative probability” that can be By joining end users of the same workgroup in a filused in workgroups and in cases where often SPAM ter exchange platform we then can be able to combine message terms correlate. their filter characteristics in order to apply a probability classification scheme that is going to be more 4 Exchanging SPAM Indicative Proba- effective as peers join the network and exchange their bilities over Peer to Peer Networks indicative probability. Recently the concept of Peer-to-peer networks has been witnessed as a new kind of decentralized archi- 4.1 Overall System Architecture tecture in which nodes of equal roles and capabilities As can be seen from Figure 1 we define two certain ∗ types of nodes to our Peer-to-peer network: Query http://spambayes.sourceforge.net 4

~ be a vector representation of a simple whitelist Let W filter. We declare RI as the header value of a misclassified message ri having ri ∈ C(legit → SP AM ).So ~ = hR1 , R2 , . . . , Rn i. The Query Agent now let W takes a query in the form [

Having ~x ∈ / RI . Adding more values to the whitelist vector accurate an extra effort to query each time certain nodes about the rule Ri , so we are currently examining the concept of creating a certain type of peer (supernode) in our architecture that stores a query history making the querying process more efficient. Reflecting this architecture the classification can be as much accurate as the indicative probability in the network is also accurate about the specific term. Similar architectures regarding SPAM can be found in Zhou[7] based on the approximate object location method in order to identify spam by mediating the matching of email header fields and constructing a fingerprint verification that can be used by the Mail Transfer Agent. The wide use of spoofing techniques where senders hide their address or use false addresses makes the above architecture sensitive to SPAM attacks that use this method to bypass filters.

Figure 2: Client-Network Architecture

λ% 68% 95% 99,7%

Table 2: λ Values Threshold Supervision Low High Medium Medium High Low

Agents and Simple Peers. Query agents handle the service requests that come across the client application. Then the user agent makes a reverse lookup with the total probability distribution of the messages terms that come from the network. We now define a parameter λ that perceives the classification criterion such us Pi=n ³

~ = ~xglobal ) Q(s) = P (C = SP AM |X

5 Ongoing and Feature Work

´

This system is a research in progress work. We are currently evaluating the accuracy of the proλ= ~ global = ~xglobal ) P (C = spam|X posed system by creating a functional prototype deployed on top of JXTA API [8]. Considering the having λ ∈ (0, 1], and assuming that λ follows the pheinomenon of SPAM as an emerging problem for normal distribution (norm(0, 1)) all the users of the internet community we would be Following certain values of λ as can be seen from happy to collaborate with researchers who have some the Table 2, a supervised interaction with the user is interest in extending our prototype. required. i=1

4.2

~ i = ~xi ) P (C = spam|X

References

Querying & LoopBack Service

[1] Cranor L.F. and LaMacchia B.A. Spam! ComThe second component of our architecture is the munications of ACM, vol. 41(8), 1998, pp. 74–83. LoopBack service that is used in order to supervise the classification parameter. We now define the cost [2] Mozilla Spam Filtering. of classifying a legitimate messages as SPAM as a Http://www.mozilla.org/mailnews/spam.html. vector space [3] Graham A. A Plan for SPAM. Online, August C(legit → SP AM ) 2002. Http://www.paulgraham.com/spam.html. 5

[4] Sahami M., Dumais S., Heckerman D., and Horvitz E. A Bayesian Approach to Filtering Junk E-Mail. In Learning for Text Categorization: Papers from the 1998 Workshop. AAAI Technical Report WS-98-05, Madison, Wisconsin, 1998. [5] Androutsellis-Theotokis S. A Survey of Peerto-Peer File Sharing Technologies. Tech. Rep. WHP-2002-03, Athens University of Economics and Business, Athens, Greece, 2003. [6] Vlachos V., Androutsellis-Theotokis S., and Spinellis D. Security applications of peer-to-peer networks. Computer Networks, vol. 45(2), June 2004, pp. 195–205. [7] Zhou F., Zhuang L., Zhao B.Y., Huang L., Joseph A.D., and Kubiatowicz J. Approximate Object Location and SPAM Filtering on Peer-toPeer Systems. In Endler M. and Schmidt D. (eds.), Proceedings of ACM/IFIP/USENIX International Middleware Conference (Middleware 2003), vol. Vol. 2672 of Lecture Notes in Computer Science. Springer Verlag, Rio de Janeiro, Brazil, 2003. [8] Gong L. Project JXTA: A technology overview, Technical report. Tech. rep., SUN Microsystems, April 2001.

6