Using J48 Tree Partitioning for Scalable SVM in ... - Semantic Scholar

Computer and Information Science; Vol. 8, No. 2; 2015 ISSN 1913-8989 E-ISSN 1913-8997 Published by Canadian Center of Science and Education

Using J48 Tree Partitioning for Scalable SVM in Spam Detection Mohammad-Hossein Nadimi-Shahraki1, Zahra S. Torabi1 & Akbar Nabiollahi1 1

Faculty of Computer Engineering, Najafabad branch, Islamic Azad University, Najafabad, Isfahan, Iran

Correspondence: Zahra. S Torabi, Faculty of Computer Engineering, Najafabad branch, Islamic Azad University, Najafabad, Isfahan, Iran. Tel: 98-314-229-2440. E-mail: [email protected] Received: January 31, 2015

Accepted: February 9, 2015

Online Published: April 25, 2015

doi:10.5539/cis.v8n2p37

URL: http://dx.doi.org/10.5539/cis.v8n2p37

Abstract Support Vector Machines (SVM) is a state-of-the-art, powerful algorithm in machine learning which has strong regularization attributes. Regularization points to the model generalization to the new data. Therefore, SVM can be very efficient for spam detection. Although the experimental results represent that the performance of SVM is usually more than other algorithms, but its efficiency is decreased when the number of feature of spam is increased. In this paper, a scalable SVM is proposed by using J48 tree for spam detection. In the proposed method, dataset is firstly partitioned by using J48 tree, then, features selection are applied in each partition in parallel. Consistently, selected features are used in the training phase of SVM. The propose method is evaluated conducted some benchmark datasets and the results are compared with other algorithms such as SVM and GA-SVM. The experimental results show that the proposed method is scalable when the number of features are increased and has higher accuracy compared to SVM and GA-SVM. Keywords: spam detection, support vector machine (SVM), partitioning, J48 Tree 1. Introduction Spammers are continuously pioneering new methods to bypass email anti-spam filter solutions forcing companies to invest in spam filtering that can keep up with the evolving methods. Up to now, various methods have been presented in order to fight and spam detection (Jindal & Liu, 2007). According to this reason there is need to use hybrid multi-layer architectures method for Spam detection problems. Most of the hybrid multi-layer architectures filters include a combination of methods, such as applying key words, rule based filters, black and white list and other data mining for detecting the spams that are more important (Cook et al., 2006, Nakulas et al., 2009). In the last decade one of the most important issues to detect spam based on content is machine learning and data mining algorithm (Chiroma et al., 2014). SVM is supervised learning models that recognize patterns and analyze data in machine learning. In many studies SVM performance in terms of error rate, speed and accuracy are higher than other algorithms such as Naïve Bayesian, Ripper, Rocchio, and Boosting Decision trees and non-parametric classification such as Neural Networks, Nearest Neighbor (Drucker et al., 1999; Liu et al., 2002; FENG & ZHOU, 2013). On the other hand, when the size of dataset increases due to the complexity of computation, it encounter to the lack of time and memory space (Tsang et al. 2005; FENG & ZHOU, 2013). In this paper, we proposed a method to solve this problem. Partitioning is one of the method is used in modern database like IBM DB2 (McInerney, 2006), Microsoft SQL Server 2012 (Microsoft, 2012) and Oracle Database 11g (Oracle.DB) to improve manageability, performance and availability when we have big data. Then, a simple and liner algorithm is necessary to partition the dataset into minimum number of sub-tree. For an approach to be effective in the real-world continuous attributes deal sensibly with missing values, numeric attributes and pruning to deal with for noisy data. J48 is one of best well-known and widely used learning algorithm include the features is mentioned. In this paper, we are going to offer a good method to remove the problem of SVM and applying it for detecting spam. We propose an approach which decomposes a large dataset into smaller ones and trains SVM algorithm on each part of them. This method can decrease the whole training time because the complexity of training time on SVM algorithm is in the order of N2, where n is the number of features in training dataset (Chang et al., 2010). If each part deals with k samples, then to solve the complexity of the difficulties is in the order of [(N/k) ×k2= N.k], if N

Using J48 Tree Partitioning for Scalable SVM in ... - Semantic Scholar

Using J48 Tree Partitioning for Scalable SVM in ... - Semantic Scholar

Suggest Documents

Comparison between J48 Decision Tree, SVM and ... - ssrg-journals

Using a Binary Space Partitioning Tree for

Fingerprint Gender Classification using Univariate Decision Tree (J48)

Scalable Transactions in the Cloud: Partitioning ... - Semantic Scholar

Face Membership Authentication Using SVM Classification Tree ...

Face Membership Authentication Using SVM Classification Tree ...

bridging the semantic gap using ranking svm for ... - Semantic Scholar

Managing Verification Activities Using SVM - Semantic Scholar

Succinct interval-splitting tree for scalable similarity ... - Semantic Scholar

scalable algorithms for parallel tree search - Semantic Scholar

SVMTOCP: A Binary Tree Base SVM Approach ... - Semantic Scholar

Application of J48 Decision Tree for the Identification of ... - Sciforum

Using a multi-objective genetic algorithm for SVM ... - Semantic Scholar

Using Multidimensional ADTPE and SVM for ... - Semantic Scholar

Trends of Tree-Based,Set-Partitioning ... - Semantic Scholar

TreeJuxtaposer: Scalable Tree Comparison using Focus+ ... - CiteSeerX

Binary Tree SVM Based Framework for Mining

Data Partitioning for Semantic Web - Semantic Scholar

Scalable Video Summarization Using Skeleton ... - Semantic Scholar

Scalable Distributed Reasoning using MapReduce - Semantic Scholar

Power Scalable Processing Using Distributed ... - Semantic Scholar

Scalable Bayesian Optimization Using Deep ... - Semantic Scholar

Scalable Probabilistic Computing Models using ... - Semantic Scholar

Scalable Video Summarization Using Skeleton ... - Semantic Scholar