Event Ranking Based on Social Network Services - IEEE Xplore

0 downloads 0 Views 2MB Size Report
PageRank-Based Approach on Ranking Social. Events: a Case Study with Flickr. Tuong Tri Nguyen, Hoang Long Nguyen, Dosam Hwang. Department of ...
2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science

PageRank-Based Approach on Ranking Social Events: a Case Study with Flickr Tuong Tri Nguyen, Hoang Long Nguyen, Dosam Hwang

Jason J. Jung*

Department of Computer Engineering Yeungnam University, Korea Email: {tuongtringuyen,longnh238,dosamhwang}@gmail.com

Department of Computer Engineering Chung-Ang University, Korea Email: [email protected]

Abstract—Exploring social events from Social Network Services (SNSs) (known as detecting events) has been studied in many researches because of its challenges. Most of researches focus on detecting events based on textual context. In this paper, we propose a novel framework using media data for not only systematically identifying events but also ranking these events. Firstly, we detect events from the photos textual annotations as well as visual features (e.g., timestamp, location); and then effectively identify events by considering the spreading effect of events in the spatio-temporal space. Secondly, we use these relationships among events (e.g., event spatial, temporal and content) for enhancing the precision of the algorithm. Finally, we rank events by analyzing relationships between them (e.g., locations, timestamps, tags) at different period of time. The experiments are conducted with two different approaches: i) using a collected dataset (offline approach), and ii) using a realtime dataset (online approach).

I.

I NTRODUCTION

Identifying and ranking events from SNSs (e.g., Facebook1 , Instagram2 and Flickr3 ) is one of the most interesting tasks recently [1], [2], [3]. In our approach, we especially focus on media data because above mentioned SNSs own billions of images and videos. This data has been shared between entities in SNSs (e.g., among friends, or among communities), and has covered certain interesting topics [4], [5], [6], [7]. In fact, user comments on the contents in forms of tags, ratings, preferences give these data sources an extremely dynamic nature that reflects these events [8]. There are a lot of studies using tags as well as other attributes for predicting system [9], recommendation system [10], [11], clustering algorithm, and classification algorithm, etc. Besides, the non-textual features such as the geotags, the time, and the visual content have been involved in the event identification recently [1], [2]. Focusing on the use of these geotagged resources from SNSs will find out many useful information for social network applications (e.g., for traveling problem [4], [9], [12]). Usually, events can be categorized into three different types: Regular event: this kind of event is organized many times in many places in a regular period (e.g., World Cup, Miss World, Grammy, Oscars). We make a list of ordered locations (these places where organized this event) following its famous Corresponding author 1 www.facebook.com 2 www.instagram.com 3 www.flickr.com

978-1-4673-6640-3/15/$31.00 ©2015 IEEE

Data Collecting

Event Identifying

Event Ranking Using geotagged information and keyword to collect from SNS

Fig. 1.

filter tags ( stop words, stemming), title and description attributes were combined

Ranking Results

The workflow of social event ranking.

to find out the place that people feel the most interesting with a given event. Regional event: this kind of event is organized in a particular place. By identifying Regional event, we can answer the question: “Which is the most interesting event that was happened at a location?” (e.g., the question: “Which is the biggest event that was organized at Seoul last month?”). Real-time event: this kind of event appears in a short period of time (e.g., one hour, one day, or one week). To solve this issue, we have to find out: “What did the most interesting event happen all over the world yesterday?”. With Real-time event, we have to face with huge online data on many SNSs (e.g., Twitter, FaceBook, Instagram, Tumbr, etc.) to extract the needed information. In this paper, we focus on the Regular event. Firstly, we collect a set of Flickr photos, with both user-supplied tags and other metadata, including time and locations (consisting of latitude-longitude coordinates) of an event. The purpose is to discover a sequence of time and locations corresponding to this event. Associated through photos, each tag usage occurrence can be attached with temporal and locational encodings. Secondly, we simultaneously analyze the temporal and locational distributions of tag usage occurrences to discover locationrelated tags. Following this step, we can built a feature vector (including a set of tags). Finally, we discover ordered periodic events which are extracted from a sequence of event by ranking using tags. The work-flow is shown in the Figure 1. Its components are described as follows. •

147

Data analyzing

Data Collecting: datasets are collected from SNSs by using Flickr API. Due to the privacy of Flickr, we

2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science

have to add some tips for collecting more than 4000 photos with a specific tag; •







700

Data Analyzing: the tags are first pre-processed by removing stop-words, conducting stemmings and then the term frequency (TF) is calculated;

600 500 400

Event Identifying: this step aims to determine the occurrence events based on the tags and their relationships with locations and time;

300 200

Event Ranking: HITS algorithm is applied [13], [14] for ranking using the list of events that were identified from the above step;

100 0

Ranking Results: the final results are the list of ordered events.

The rest of this paper is organized as follows. In Section II, we refer to some studies related to event detection and event ranking. The Section III introduces backgrounds that relates to ranking issues. Section IV shows the algorithms for ranking events. In Section V, the experiments was conducted to evaluate the results. Section VI draws some conclusion and states some future works. II.

Fig. 2.

The data distribution of EW orldCup .

1200

1000

R ELATED W ORKS

800

In order to rank events on SNSs, the first task is to detect them. Event detection can be divided into two categories [15]: retrospective detection and online detection. The former refers to the detection of previously unidentified events from accumulated historical collection, while the latter entails the discovery of the onset of new events from live feeds in realtime. The study of ranking events from these documents is mainly present in automatic summarization [16]. In [16], they applied a new event ranking method to a single document based on event relation graph. In other issue, to consider the event relationships and the impact of events to the society, the authors [2] analyzed event relationships with an adapted self-exciting point process model and ranking events with random walk algorithm. They introduced to effectively identify events by considering the spreading effect of event in the spatio-temporal space. To capture the triggering relationships among events, the authors adapted self-exciting point process model by jointly considering event spatial, temporal and content similarities. They first defined the event impact and estimated it via the random walk algorithm based on the triggering relationships and then ranked events with different time stamps. Particularly, in [17], the authors focused on spatial ranking service, which can retrieve a set of relevant resources with these certain tags to find out a list of locations which are collected from geotagged resources on SNSs. We proposed a novel method (called LocHITS) that can analyze an undirected 2-mode graph composed with a set of tags and a set of locations. Thereby, meaningful relationships between the locations and a set of tags are discovered by integrating several weighting schemes and HITS algorithm. Our approach is related to identifying social events and then ranking them based on geotagged photos which are collected from SNSs with the case study is Flickr.

148

600

400

200

0

Fig. 3.

The data distribution of EBlossom .

III.

P ROBLEM F ORMALIZATION

We defined two types of event: •

Complex event (E) is a sequence of events which are focused on a particular topic;



Event (e) is a specific thing that occurs in a certain place at a certain time [18].

For example, “Miss World 2010” is an event and “Miss World” is a sequence of events that refers to the world beauty contest yearly. Therefore, for different Complex event, the number of events are different (e.g., EOscars is annual, while EW orldCup is organized every four years). Let P denote a set of geotagged photos that contains coordinates (latitude and longitude), timestamps, tags and other information. These photos are collected from Flickr with some criteria (e.g., keyword, upload date, privacy level). We use E to denote a Complex event which is extracted from P as follows E(P) = hS, L, Θi = {ei }

(1)

2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science

where S is a set of event labels, L is a set of locations where events occurred, Θ is a set of timestamps when events occurred. They are extracted from a set of photos P.

300

250

An event e is a triple ei = hsi , li , θi | si ∈ S; li ∈ L; θi ∈ Θi

200

(2)

150

In our study, we focus on extracting information from certain Complex events to identify events and ranking them. The ranking results can be used for not only evaluating events in the present but also determining where is the best place for organizing these events in the future. IV.

100

50

0

S OCIAL E VENT R ANKING

A. Identifying social events Giving a set of geotagged photos (P), that is referred to the Section III, we first have to determine events that belong to a Complex event. However, this paper only focus on detecting events with a simple approach that has main steps as follows.

Fig. 4.

The data distribution of EW orldCup after smoothing.

Photo

First, we calculate the distribution of resource based on timestamps. For example, the results of EW orldCup and EBlossom are shown in Figure 2 and Figure 3. Then, to make our time-series smoothing, we use Moving Average Algorithm (MVA) [19] for capturing the important data and removing all the unnecessary signal by calculating the mean based on the numbers of previous data.

450

Smoothing is a way to remove short peaks in a series of measurements. Let’s assume that we have a time-series function Y = {yθ : θ ∈ Θ}, where Θ is the set of timestamps when events occurred (also called the index set), n is the sliding windows. The value at a point will be smoothed by using the following function:

100

n P

yθ =

n

350 300

250 200 150

50 0

Fig. 5.

yθ−i

i=1

400

The data distribution of EBlossom after smoothing.

(3)

The advantage of MVA is that it can work well with irregular fluctuation and easy to apply for different types of data. The output of data smoothing makes it easy to detect the events as shown in Figure 4 and Figure 5. We combine these results with the collected information from the web query to find out these events. B. Ranking social events The ranking approach allows to input a natural language query without boolean syntax to find out the list of ranked records that answer the query more oriented toward end-users. For discovering a Complex event (E), we rank a list of events ei ∈ E by using the relationships between them based on collected data from Flickr. The algorithm that is used in this issue is formalized according to [14]. The HITS algorithm is used for webpages ranking with the hyperlinks from these webpages form a directed web graph G = hV, Ei, where V is the set of nodes representing webpages, and E is the set of hyperlinks. The hyperlink topology of the web graph is contained in the asymmetric adjacency matrix L = {lij },

149

where lij = 1 if pagei → pagej and lij = 0 otherwise. And each webpage pi has both a hub score pHub and an authority i score pAut . i In this study, we use the relationship between tags and events same as hubs and authorities in [13]. However, we use an undirected graph G = hV, Ei, where V is the set of nodes representing tags or events, and E is the set of edges. We denote T = {ti } is a set of tags, E = {ej } is a set of events, E ti is a set of events which contain tag ti , T ej is a set of tags that belong to event ej . We use a function f in order to determine whether tag ti is contained in event ej as follows: ( 1, if tag ti occurs in event ej . f (ti , ej ) = (4) 0, otherwise. ej =

m X 1 ti ; |E ti | i=1

ti =

n X j=1

1 ej ; |T ej |

(5)

where |E ti | is the number of events which contain tag ti , |T ej | is the number of tags of event ej .

2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science

TABLE I.

The formula to compute the value of nodes as follows: T = AE

and

E = AT T

(6)

where A is an adjacency matrix (m×n) and aij is determined by Equ. 4, T = {t1 , t2 , . . . , tm }T , E = {e1 , e2 , . . . , en }T . We can compute Equ. 6 by recursive computing as follows T

T = AA T

and

T

E = A AE

(7)

For each iteration step, the value of nodes are recomputed and normalized as follows. ei ti ei = P and ti = P (8) m n ej tj j=1

Event labels

DATASET

#Resource photos

#Collected photos

Blossom

139,066

134227

1/1/2004-1/5/2015

Grammy

2,117

2,117

1/1/2004-1/5/2015

Missworld

1,046

1,046

1/1/2004-1/5/2015

Oscars

4,901

4,200

1/1/2004-1/5/2015

Bundesliga

27,168

26934

1/1/2004-1/5/2015

Premierleague

10,270

7,263

1/1/2004-1/5/2015

5,157

5,143

1/1/2004-1/5/2015

348

348

1/1/2004-1/5/2015

FAcup

7,853

7,827

1/1/2004-1/5/2015

Eurocup

2,918

2,918

1/1/2004-1/5/2015

Worldcup

49,688

46,806

1/1/2004-1/5/2015

Seriea Seagames

#Data collecting time

j=1 0.43

From studying results of [14], we have proposed the EventRanking Algorithm (as shown in Figure 6). In which each node is represented by a tag or an event.

0.41 0.39

0.37

procedure E VENT R ANKING(E) Comment: The set of events Initialization(T hreshold) Determine matrix A(m × n) Comment: m number of events; n number of tags T ← {t1 , . . . , tn } denote the vector {1,. . . ,1} E ← {e1 , . . . , em } denote the vector {1,. . . ,1} Iterations ← 0 Cond ← maxi=2,..m−1 (kei − ei−1 k) while Iterations = 0 or Cond ≥ T hreshold do j←1 while j ≤ n do m P 13: tj ← (aij .ei )

1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12:

0.35 0.33 0.31 0.29 0.27 0.25 1

2

3

4

Germany2006

Fig. 7.

5

6

Africa2010

7

8

9

10

Brazil2014

Ranking result with EW orldCup using HITS offline algorithm.

i=1

j ←j+1 end while i←1 while i ≤ m do n P ei = (aij .tj )

14: 15: 16: 17: 18:

[20]. Secondly, we use the term frequency of tags to create a set of popular common tags. Finally, we filter tags by considering the relationships between these tags and events. We assume that a tag is meaningless in case a tag only appears once in a Complex event and these kind of tags are removed also.

j=1

19: 20: 21: 22: 23: 24: 25: 26:

i←i+1 end while Normalize(T ) Normalize(E) Iteration ← Iteration + 1 end while return E . The list of ranked events end procedure

Fig. 6.

High results that we get belong to EW orldCup , EM issW orld ,

0.43 0.41 0.39

0.37

The procedure code of EventRanking algorithm.

0.35

V.

0.33

E XPERIMENTAL R ESULTS

0.31

A. Social event identification result

0.29

By collecting a set of geotagged photos from Flickr, we identified some Complex events and their resources as shown in Table I.

0.27 0.25 1

61

121 181 241 301 361 421 481 541 601 661 721 781 841 901 961 Germany2006

The number of tags are used to represent a Complex event may grow in large scale. So, we need to employ pre-processing step to find the most valuable feature tags. Firstly, all stopwords are removed from tags and tags are conducted stemming

150

Fig. 8.

Africa2010

Brazil2014

Ranking result with EW orldCup using HITS online algorithm.

2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science

TABLE II.

T HE RANKING RESULTS OF EW orldCup ( USING HITS OFFLINE ALGORITHM ) 0.3

e1 (Germany2006)

e2 (Africa2010)

e3 (Brazil2014)

1

0.312490018

0.416307299

0.271202683

2

0.315654510

0.406034405

0.278311085

3

0.315424661

0.405311140

0.279264198

4

0.315308777

0.405253552

0.279437671

5

0.315278055

0.405247529

0.279474415

6

0.315270745

0.405246625

0.279482630

7

0.315269053

0.405246450

0.279484497

8

0.315268665

0.405246412

0.279484923

9

0.315268576

0.405246403

0.279485021

10

0.315268555

0.405246401

0.279485043

Iterations

0.28

0.26

0.24

0.22

0.2 1

2 Quarter-finals

and EBlossom . With EW orldCup , we detected the day of event that is the final match, with EM issW orld is the gala day, and EBlossom is weeks that blossom grows.

Fig. 9.

3 Semi-finals

4 Match for third place

5

6 Final

Ranking result with EW orldCup2010 using HITS offline algorithm.

ESeriea is not well detect because this Complex event is organized for an long time and there are no highlight day during the event. 0.29

In case of the ESeaGames , the collecting data is small and they are distributed in many days distinguished. Hence, our system could not detect events exactly.

0.28 0.27

0.26 0.25

B. Social event ranking result The result of EW orldCup with the tag search “Worldcup” (collected from 2004 to 2015) is shown in Table II. By using the event identification method in section IV-A, we got three events: World Cup 2006 (Germany2006), World Cup 2010 (Africa2010), and World Cup 2014 (Brazil2014). After determining events, we try to rank them with both HITS offline and HITS online algorithm: i) With online approach we add 50 pictures for enriching the data at every step. The iteration will finish in case the order of events become stable; ii) The offline method is applied for Complex event that has completed dataset at first iteration. The algorithm is stopped when the ranking value has changed less than the threshold (we choose the value of threshold equal to 10−8 in our study). To test the algorithm again, we conduct the experiment on four particular days of World Cup 2010: Quarter-finals, Semifinals, Match for third place and Final match. The ranking result is ordered as follow: highest ranking is the Final match, next is Match for third place, next is Quarter-finals, and lowest ranking is Semi-finals. This result is quite reasonable and can somehow prove the precision of our approach. C. Result discussion The Complex event that happen in the short time will be easily detected due to the concentration of data. For example, blossom only grows in two, or three weeks of April (as shown in Figure 5). So the data of EBlossom will increase quickly in these days. The number of data getting from Flickr is imbalance due to the kind of Complex event. For example, people join EBlossom for taking the photo. On the opposite, people will focus on

151

0.24 0.23 0.22 0.21 0.2 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 Quarter-finals

Semi-finals

Match for third place

Final

Fig. 10. Ranking result with EW orldCup2010 using HITS online algorithm.

the match and less spend time for taking the pictures with EW orldCup . Events on Flickr are not real time. Some detected events will have the difference about two, or three days compared to the real events due to the habit of Flickr’s users. With Instagram, people often take the pictures by smart phone and then they upload immediately to the SNSs. With Flickr, the pictures are almost taken by the camera. Because of this reason, people need some days for processing and then public to Flickr later (we can see that most of the data have the upload date different from the taken date). Comparing the results of two empirical methods (online, offline) we found that ranking offline method is usually complete earlier than online with the EventRanking algorithm (as shown in Figure 7, Figure 8, Figure 9, Figure 10). This reason are explained by the value of each node (tag or event) in offline method do not increase in each iteration but the number of tags in online method are increased at each steps, it depends on the collected data and also the given threshold.

2015 2nd National Foundation for Science and Technology Development Conference on Information and Computer Science

VI.

C ONCLUSION AND F UTURE W ORKS

[10]

Ranking is one of the best ways to find many things out within valuable. We propose an algorithm for not only ranking events but also identifying them in a novel framework using media data from social network. We detect events from the photos textual annotations as well as include visual features and effectively identify events by considering the spreading effect of events in the spatio-temporal space. We discover the relationships among events (e.g., event spatial, temporal and content) for enhancing the precision of detecting events. Using the similarity measurement between tags for computing the value of events, our approach calculates the term frequency of tags that occur in each event to modify the value of tags for ranking. We implement and show the experimental results with a set of events from the geotagged resources which are collected from Flickr only for Regular event. Therefore, Regional event and Real-time event will be considered in our next study. Also, we will use the data from different SNSs using matching algorithm [21] for enriching the data. We believe that with the feature properties of each SNSs, we can rank events completely. Other issue, we will build a recommendation system based on the ranking results using user’s searching keywords.

[11]

[12]

[13]

[14] [15]

[16]

[17]

[18]

[19]

ACKNOWLEDGMENT [20]

This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (NRF-2014R1A2A2A05007154). Also, this work was supported by the BK21+ program of the National Research Foundation (NRF) of Korea. R EFERENCES [1] [2] [3]

[4]

[5]

[6]

[7]

[8]

[9]

H. Becker, M. Naaman, and L. Gravano, “Event identification in social media.” in WebDB, 2009. X. Li, H. Cai, Z. Huang, Y. Yang, and X. Zhou, “Social event identification and ranking on flickr,” World Wide Web, pp. 1–27, 2014. L. Chen and A. Roy, “Event detection from flickr data through waveletbased spatial analysis,” in Proceedings of the 18th ACM conference on Information and knowledge management. ACM, 2009, pp. 523–532. Clements and et al., “Using flickr geotags to predict user travel behaviour,” in Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2010, pp. 851–852. J. J. Jung, “Discovering community of lingual practice for matching multilingual tags from folksonomies,” The Computer Journal, vol. 55, no. 3, pp. 337–346, 2012. P. Morrison, “Tagging and searching: Search retrieval effectiveness of folksonomies on the world wide web,” Information Processing and Management, vol. 44, no. 4, pp. 1562–1579, 2008. J. J. Jung, “Cross-lingual query expansion in multilingual folksonomies: A case study on flickr,” Knowledge-Based Systems, vol. 42, no. 0, pp. 60–67, 2013. Diplaris and et al., “Emerging, collective intelligence for personal, organisational and social use,” in Next generation data technologies for collective computational intelligence. Springer, 2011, pp. 527–573. T. T. Nguyen, D. Hwang, and J. J. Jung, “Social tagging analytics for processing unlabeled resources: A case study on non-geotagged photos,” in Intelligent Distributed Computing VIII. Springer, 2015, pp. 357–367.

152

[21]

F. M. Bel´em, E. F. Martins, J. M. Almeida, and M. A. Gonc¸alves, “Personalized and object-centered tag recommendation methods for web 2.0 applications,” Information Processing and Management, vol. 50, no. 4, pp. 524–553, 2014. X. H. Pham, T. T. Nguyen, J. J. Jung, and N. T. Nguyen, “-spear: A new method for expert based recommendation systems,” Cybernetics and Systems, vol. 45, no. 2, pp. 165–179, 2014. I. Lee, G. Cai, and K. Lee, “Exploration of geo-tagged photos through data mining approaches,” Expert Systems with Applications, vol. 41, no. 2, pp. 397–405, 2014. C. Ding, X. He, P. Husbands, H. Zha, and H. D. Simon, “Pagerank, hits and a unified framework for link analysis,” in Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 2002, pp. 353–354. J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 604–632, 1999. Y. Yang, T. Pierce, and J. Carbonell, “A study of retrospective and online event detection,” in Proceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, 1998, pp. 28–36. Z. Zhong and Z. Liu, “Ranking events based on event relation graph for a single document,” Information Technology Journal, vol. 9, no. 1, pp. 174–178, 2010. T. T. Nguyen and J. J. Jung, “Exploiting geotagged resources to spatial ranking by extending hits algorithm,” Computer Science and Information Systems, no. 00, pp. 91–91, 2014. H. Becker, M. Naaman, and L. Gravano, “Learning similarity metrics for event identification in social media,” in Proceedings of the third ACM international conference on Web search and data mining. ACM, 2010, pp. 291–300. S. W. Smith, The Scientist and Engineer’s Guide to Digital Signal Processing. San Diego, CA, USA: California Technical Publishing, 1997. H. L. Nguyen and et al., “Kelabteam: A statistical approach on figurative language sentiment analysis in twitter,” in Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015), 2015, pp. 679–683. H. L. Nguyen and J. J. Jung, “Privacy-aware framework for matching online social identities in multiple social networking services,” Cybernetics and Systems, vol. 46, no. 1-2, pp. 69–83, 2015.