Journal of Convergence Information Technology - Semantic Scholar

6 downloads 304104 Views 1MB Size Report
Journal of Convergence Information Technology ... degree is negative and the probability of news ... large output degree and spam information keywords.
Journal of Convergence Information Technology Volume 5, Number 1, February 2010

Content Mining and Network Analysis of Microblog Spam Shen Yang*1 ,Li Shuchen*2,Ye Xiaoxiao*3,He Fangping*4 School of Information Management, Wuhan University, Hubei ,China *2 International School of Software, Wuhan University, Hubei, China *3 International School of Software, Wuhan University, Hubei, China *4 International School of Software, Wuhan University, Hubei, China [email protected], [email protected], [email protected], [email protected] *1

doi: 10.4156/jcit.vol5.issue1.16

for publishing a large quantity of spam information and making users passive readers. We can divide spam microblogs into two types by their features. The first category is news. These microblogs frequently post messages automatically or manually. Although news is more or less valuable in a sense, they bring information interference to users given its large scale. That’s why we label those large quantities of automatic news spam. The second category is advertisements. Spammers first of all register lots of microblogs, and then they spread advertisements manually or automatically in large scale. The contents are usually quite lousy including links and contact information which are posted in microblogs as hot keywords so that visitors tend to click them and therefore they get more advertising flow (Fig.1).

Abstract The number of microblogs’ user is growing rapidly with the increase of spam. Firstly, we give microblog a formal definition, and then divide spam into two types: news and advertisements. We collect 1,760,314 items of 188MB microblog news to complete the process of content mining. Using ROST Content Mining, we work on topology macro statistics, time series mining, and so on. We find that the group of microblog presents the feature of small world. Its coefficient with the same degree is negative and the probability of news microblog followers is 0.0002, while the rate of second spread is 0.011.We put forward a recursive filtering method to estimate the rate of spread on many occasions and we import cross-relation method that switches the node that are difficult for network analysis to easy forms and do social network analysis.

2. Related Work

Keywords

Right now, research on spam microblogs is in the primary stage, but research on spam, spam short messages and spam blogs are quite mature.

Microblog, Content mining, Social network, Information dissemination and Recursive filtering method

1. Microblogs And Spam Information Microblogs, which start to blossom in March, 2006 are increasing in geometric series. They are informal miniblogs for information communication. Users can post mini messages using tools like mobile phone, QQ, skype and web API whenever and wherever they want. Tencent taotao(6000 million users) 、Twitter、Fanfou and other microblogs stipulate that each message should be shorter than 140 words. With the flourishing development of microblogs, spam information comes along and form a large number of spam microblogs. The network environment deteriorates due to the spammers who are sharply blamed by cyber citizens

Figure 1. Worldwide Ranking trend of the representative microblog website Twitter Research on spam: PCarlos Castill and other experts proposed a spam detection system in 2007; it

135

Content Mining and Network Analysis of Microblog Spam Shen Yang, Li Shuchen, Ye Xiaoxiao, He Fangping builds classifiers based on links and content features, combined with the topology feature of network graphics. In 2008, Balakumar M and Vaidehi V maintained that when an email is legitimate to one user while it is not to another, spam filtering machine should be individualized. Therefore they understood email contents with the ontology and they implemented Bayes classification. Masahiro Uemura and Toshihiro Tabata have implemented a Bayes filtering technique to deal with increasing image spam. What is more, Yang Li and Jun Wu judged spam by comparing the similarity between immediately extracted URLs and URLs in the history database. Research on spam short messages: in 2007, Gordon V. Cormack pointed out that spam filtering technique needs to be improved so as to be applied in the SMS area where the number of spam short messages is growing. Besides, Gordon V. Cormack thought that content-based spam filtering method can be effectively applied to mobile phone (SMS) communication, blog comments and brief emails. Research on spam blogs: In 2007, Akshay Java believed that internet languages, spelling and grammar mistakes and spam comments have added to the difficulty of content analysis. Therefore they proposed a blog analyzing engine to remove spam information from blog dataset. Yu-Ru Lin and Sundaram, H pointed out that the highly regularity of content,posting time and linking mode of spam blogs can be applied to detect spam blogs. In 2008 Kazunari Ishida proposed the method of using common citation clusters to extract spam blogs. The experiment result showed that large output degree and spam information keywords are effective in extracting spam blogs. More about spam information: in 2007 Hiroo Saito, Masashi Toyoda and some other experts did research on how spammers deceive the Pagerank algorithm of search engines by linking with each other. They studied 580 million Japanese websites and 2.83 billion links. They detected more than 60 million using three kinds of image analyzing algorithm. Pemma Rooksby pointed out that spam information from political parties, social organizations and noncommercial governmental associations may be more destructive than commercial ones because these spam information cannot be supervised and are more widespread. Given the analysis above, we can conclude that research on spam information at home and abroad is very complete. Because of the protection of personal privacy, collecting microblogs’ data is quite difficult. Therefore, reports about this are very rare, let alone studies about it in a content mining perspective. This

paper discusses the following questions. 1. Microblog data collection and structuring spam information.2.spam content and social network analysis. We need to take advantage of natural language processing, social network and content mining technique to discover and extract spam and understand the dynamic principles of harmful information in informal information communication.

3. The Idea Of Analyzing The premise to do Microblog study is that we can get the clear definition of what is to be studied. Right now nature description focuses on two points: the briefness of information and the variety of posting tools. What is different from chatting websites is that microblogs provide time line webpages,they post chatting records randomly according to admission protocol so that those webpages can be supervised at settled time, data can be collected and further research can be done with a organized language corpora. Though every timeline webpages provide RSS subscribing method, RSS offers limited information and no social relationship data which leads to the result that we can only grab HTML webpages. According to the Alexa website ranking, we decide to take Fanfou, Jiwai and Twitter to collect data. Using network reptile, we supervise the time line webpages of microblogs, parse information items with ROST MicroBlog InfoExtractor and build ROST MicroBlog DataSET(it includes 1760314 messages,188mb) to complete our latter research). Then we use ROST Content Mining (ROST series, which are developed by the author, can be googled and downloaded in thousands of websites(as Fig.2), to measure the macro topology construction and do the content mining job.

Figure 2. The interface of ROST Content Mining Tokenization is required when analyzing the

136

Journal of Convergence Information Technology Volume 5, Number 1, February 2010 content, so we use self-made ROST compound Parsing method. First of all we use the mostly match method, find the deviation part after partly backdating, then we use probabilistic method to erase the various interpretations. Probabilistic method of name recognition is included in the algorithm. After tokenization, wemake use of the word frequency statistics based on Hashtrie construction. Taking the accuracy of word frequency into consideration, we’ve decided to use ROST word root merge algorithm which increases the accuracy rate by building the corresponding form of irregular words despite the fact that the open source PortStem word root merging algorithm exists. Cutting the dataset into several documents according to time slices and take each as a document can calculate some ineffective common word frequency and build the filtering word list with the traditional TFIDF deforming formula.

4. Content Mining And Social Network Analyses To dig out the characteristics of microblog network thoroughly, we divide the experiment into overall topology attribute statistic, news item analysis, advertisement item analysis and comparing analysis.

4.1 Macro Characteristic Analysis of Three Biggest Microblogs Table 1.The topology feature Performance measure

The steps of content mining for microblog information items are as follows: 1) Transforming the simple and complicated form as well as the capitals and lowercases. 2) Measuring and calculating the word frequency of ineffective words in language corpora and build filtering word corpora with the improved TFIDF formula. 3) Constructing compound Spam detection algorithm and extract spam chatting record with the credit level- based and feature word Bayes learning. 4) Counting the word frequency of spam content of chatting records based on time slice with Chinese and English tokenization and word frequency statistics algorithm. 5) Building the timeline word frequency data form of spam chatting record. 6) Discovering spam information subgroups with the subgroup discovering algorithm and content filtering based on information fingerprint. 7) Making use of the neighboring matrix method to compute the social network relationships among spam users. 8) Making use of UCINET、SPSS、MatLab and other softwares to do further research. For the extracted information items, take users as nodes and communication relationship as verges. The link between two users is built as soon as a message is sent. The whole network is a weighted and directed graph.

Twitter

Fanfou

Jiwai

Number of posts

1354562

175047

230705

Number of users

170489

21504

24405

Frequency of friends

340542

19877

24386

Friends relations

272101

6886

7389

Self response

51

31

747

The visualized statistic chatting conditions of the first users of Fanfou and Jiwai are like what Fig.3 shows.

Figure 3.the relationship map of the first 150 users of Fanfou (left) and Jiwai (right) We can see from these pictures that the Jiwai users tend to response to themselves more while this phenomenon is inexistent among Fanfou users. The reason is that Jiwai is lack of the information filtering mechanism which leads to the above result. On the contrary, Fanfou does the job of blocking. As a result, the good ranking users are basically natural ones and don’t reply to their own messages.

137

Content Mining and Network Analysis of Microblog Spam Shen Yang, Li Shuchen, Ye Xiaoxiao, He Fangping information items, we found that only 9 users responded. This shows that the probability of followers of the news publishers is only 9/39454(the total conversation number of the first 14 users) = 0.002. Compared with the vast number of the conversations, they don’t have many followers. We extract the user relationship of information items and do statistic of directionless information. We also do network analysis of the result and get Fig.4.

4.2 Analyses of News Spam Microblogs on Jiwai The difficult part of deciding whether a piece of information is spread for two times or more is that it is tough job to distinguish whether an information item is spread for the first or second time. Therefore, we can use approximate method. First of all, we filter some spammers(who have the feature of spam keyword, post information frequently, high outgoing degree and low incoming degree) and get the first spread recipients from spammers. Then we search and judge whether the recipients send out similar information again, summarize the related data and get the rate of second spread. With this we get every rate of spread approximately. This is what we call a recursive filtering method. The closer the number of spammers is to the real number, the smaller the deviation rate is. First of all we extract the user fields, count the

Figure4.second spread user relations of Jiwai news (left) and simplified weight picture(right)

frequency of it and get the top 14 spokesmen who speak the most of all. Table.2 shows the result.

Observing the simplified picture we can see that several nodes are very important and once they are taken away, the whole network will be broken into many unlinked child networks. This phenomenon is what we call the information bridge. This is quite similar to those key terrorists in the “911” event. We continue tracing the users group which these news microblogs influence. From these 14 information recipients of Jiwai, We get the information of 154 users and continue to extract the subgroup network of the 154 users. It is related to 54956 phone conversations and 692 users. We extract the information items and do content mining with the user relationship pairs. We find there are 396 directionless communication pairs and 426 directed communication pairs mentioned these news in their talks and the directed is 1.07 times of the directionless. This illustrates that most of the twice spreading subgroups is one-way. The rate of second spread is 426/39454=0.011. What is more, we find that the users of Jiwai whose rankings are in the top 100, only 13 are in the top 100 of those who spoke mostly. Of the 1000 relationship pairs, 411 are in the top 1000 of speaking. Top 1 in friend interaction is only top17 in times of speaking. Therefore, the fact is that the most active in interaction does not necessarily speak the most of all. We can put those who speak a lot while have few friends into spam information candidate list and use them in the spam extracting algorithm.

Table 2.The statistic of speaking frequency of Fanfou users Times

ranking

User

Times of speaking

Ranking

user

of speaki ng

Baidu

1

Xinhua

4465

8

2

mininova

4465

9

Tencent

2095

3

NBCN

4069

10

fater

1912

4

gold

3733

11

Normman

1884

5

hnnyzxh

3202

12

verycd

1860

60

NewsRSS

3153

13

Zhonghao

1756

7

wht30003 000

news

2124

Douban 3033

14

movie

1703

comments

Using content mining we can find that their personal information is quite simple, only user names are filled, mostly user information is updated and published frequently by Feedlr or RSS. Searching the 14 users from the dataset and extract the related user

138

Journal of Convergence Information Technology Volume 5, Number 1, February 2010

4.3 Analysis of Advertising Spam Microblogs To achieve the goal of advertising, spam microblog publishers will put their mobile phone number, home number, qq and email in the advertisements. Normal users will not publish information on the internet easily, nor will they release their personal information frequently. With these two characteristics, we can build the analyzing algorithm. First of all, we use flexible matching and normal regular expression technique to abstract information items that contain contact information. At the same time, we count the frequency of information items and the number of users. Since Fanfou has already taken some anti-spam measures, there is very little spam information. We continue to use a recursive filtering method and get 529 pieces of information of 8 spammers, among which 53 people spread it twice, 2 people three times and none four times. We count the time intervals of the three spreading period and find that there is either immediately spreading or spreading another day. The rate of second spread is 53/529=10% which is higher than that of news. This illustrates that some advertisements are very skillfully designed while some news are not carefully selected. Therefore we can conclude that the technique of erasing spam (used by Fandou, not used by Jiwai) can decrease the rate of first spread but cannot definitely decrease the rate of second. From the tendency of turning narrower we can see that the spread of spam is convergent which is greatly different from that of computer virus.

4.4Comparative Analysis Of Advertising Spam Microblogs

News

1.724

3.631

Cliques

6

17

From the four groups of data above, the spread rate of news publishers is higher than that of advertisers which illustrates that news workers also communicate with other users besides posting spam news. The function quality of news is weaker than advertisements; what is more, reciprocal quota (user's response rate) of advertisements are greater; the average distance of advertisements are much far way than news which shows the sparse nature of advertising promulgators. The subgroups of advertisements are more than that of news which illustrates that the publisher network of advertisements is more varied. We come up with the definition of cross relationship set. There are {(A,B),(A,C),(A,F),(B,G),(B,C) }. From the related nodes B,C,F of A, we can get {(B,C,1),(B,F,1),(C,F,1)}. From G and C, we get the {( G,C,1) }group. In sum, we will have the {(B,F,1),(B,C,1), (C,F,1) , ( G, C, 1) } relationship set and we can do network analysis. In the experiment, we build user fields and information relationship set, user fields and advertisement information field relationship set. Then we make use of cross-relation method to get cross-relation network and news information field network. Fig.5 illustrates the result.

And Figure 5. cross-relation map of messages of news publisher(left)、advertisements publisher(right)

Analyzing the user relationship of Jiwai news and advertisement items, we get the publisher network (who have posted spam information, but not always) and can measure the network attribute. Table 3 shows that:

We can draw the conclusion that news information items are more concentrated while advertisements are scattered, which displays the consistency of this picture with the previous research.

Table 3.Information network feature list Of Jiwai

5. Conclusion

spammers Item

news

Advertisements

Transitivity: i-->j and

9.52%

7.65%

0.0710

0.1429

This paper completes the following tasks: analyzing the current situations of research on spam at home and abroad, categorizing spam into two types, collecting 1760314 microblog messages, building free microblog dataset, proposing a set of content mining method with

j-->k Hybrid Reciprocity:

Average distance

139

Content Mining and Network Analysis of Microblog Spam Shen Yang, Li Shuchen, Ye Xiaoxiao, He Fangping the dataset, developing ROST Content Mining for content mining, exploring the principles of spam spreading and putting forward the recursive filtering method to evaluate the rate of repeat spread. We propose cross-relation method to transform those relationships that are not easily studied by the network directly so that they can be analyzed by social network method. Out next research is concentrated on further exploration of current data, comparison and contrast on the dynamic transmitting principles of spam with the data we collected from blogs, chatting records and emails, and then improve and promote ROST content mining further on the current basis.

Tseng, B.:Splog Detection using Content, Time and Link Structures.2-5 July 2007.Page(s):2030-2033 [9] Kazunari Ishida : Extracting spam blogs with co-citation clusters.2008. Page(s):1043-1044 [10] Hiroo Saito, Masashi Toyoda, Masaru Kitsuregawa, Kazuyuki Aihara : A large-scale study of link spam detection by graph algorithms.2007. Page(s):45-48 [11]PEmma Rooksby : The ethical status of non-commercial spam.2007. Page(s):141-152

6. Acknowledgment This paper is financially supported by National Natural Science Foundation of China (No. 60803080) and Ministry of Education of the P.R.C. Humanities and Social Science Youth Project (08JC870010) and “Internet Age Sci-tech paper for Fast share” Project (2009112) .

7. References [1] PCarlos Castillo, PDebora Donato, Aristides Gionis, Vanessa Murdock, Fabrizio Silvestri: Know your neighbors : web spam detection using the web topology. 2007. Page(s): 423-430 [2] Balakumar, M.; Vaidehi, V. : Ontology based classification and categorization of email. 4-6 Jan. 2008.Page(s):199-202 [3] Masahiro Uemura, Toshihiro Tabata:Design and Evaluation of a Bayesian-filter-based Image Spam Filtering Method. 2008. Page(s):46-51 [4] Wu Jun, Li Yang. A new url-filtering based anti-spam technique(J). Study Of Computer Application,2008(5) [5] Gordon V. Cormack, Jose Maria Gomez Hidalgo, Enrique Puertas Sanz:Spam filtering for short messages. 2007. Page(s):313-320 [6] Gordon V. Cormack, Jose Maria Gomez Hidalgo, Enrique Puertas Sanz : Feature engineering for mobile (SMS) spam filtering.2007.Page(s): 871-872 [7] Akshay Java, Pranam Kolari, Tim Finin, Anupam Joshi and Justin Martineau:BlogVox: Separating Blog Wheat from Blog Chaff.2007. Page(s):115-121 [8] Yu-Ru Lin; Sundaram, H.; Yun Chi; Tatemura, J.;

140