Activities in the Mobile Telephone Network - CiteSeerX

1 downloads 0 Views 1MB Size Report
Two paths of glory-structural positions and trajectories of websites within their topical territory, in: Intern. Conf. on Weblogs and Social Media (ICWSM),. 2011.
Journal of Pattern Recognition Research 1 (2013) 59-65 Received July 3, 2013. Revised August 3, 2013. Accepted August 8, 2013.

Some Features of the Users’ Activities in the Mobile Telephone Network Thomas Couronn´ e

[email protected]

Orange Labs - France Telecom R&D, 38 40 Av. du General Leclerc, Issy les Moulineaux, Cedex Paris, 92794, France

Valery Kirzhner

[email protected]

Institute of Evolution, University of Haifa Haifa, 31905, Israel

Katerina Korenblat

[email protected]

Zeev Volkovich

[email protected]

Department of Software Engineering, ORT Braude College, P.O.B. 78 Karmiel 21982, Israel

www.jprr.org

Abstract Daily activity of the users of a mobile-phone network is represented as a sequence of input and output calls and of input and output text messages. Each such sequence corresponds to its spectrum, the distribution of short two-letter sequences of the same type. It is shown that the spectra of any user’s sequences are stable, i.e. reproduced daily. Based on this, the notion of a user’s strategy is introduced. The number of different strategies appears to be limited, in the sense that the number of user groups with the same strategy is sufficiently small. Keywords: Mobile phone data, Human behavior learning, Sparsity

1. Introduction With the development of information technologies and ubiquituous communication, records of human activities become more and more available. This allows studying human behavior in a wide range of fields such as online consumption, mobility or social network activity, and the use of IPTV services. However, the yearly exponential growth of datasets makes it challenging to extract relevant patterns. Thus, new methods should be developed in order to reveal “hidden” laws or mechanisms that underlie these huge amount of data. Mobile phone logs have been intensively employed not only for better understanding human communication patterns, but also for explaining human mobility [1, 2, 3, 4, 5], dynamics of viral phenomena [6, 7], social networks [8], and opinion diffusion [9]. All of these can be considered as time-dependent complex systems, which require building major patterns to be understood. The basic types of the activity of mobile-phone network users are input and output calls and input and output text messages. We code them by the symbols 1,2,3 and 4, respectively. To study the peculiarities of the users’ daily activities, we consider each basic activity type as a symbol (“letter”) over a 4-letter alphabet (1, 2, 3, 4). As a result of the user’s daily activity, a certain sequence of the symbols, which we further denote as a user’s activity sequence (UAS), is formed. The goal of this work is the comparison of such sequences generated by the same user during different days or by different users on the same day.

c 2013 JPRR. All rights reserved. Permissions to make digital or hard copies of all or part of

this work for personal or classroom use may be granted by JPRR provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or to republish, requires a fee and/or special permission from JPRR.

Activities in the Mobile Telephone Network

Similar symbolic, but not linguistic, sequences are being produced while analyzing different systems. In particular, the comparison of genomic sequences is one of the fastest growing fields. There exist two main groups of methods of such comparison: the sequence alignment and the linguistic methods. The alignment methods purports such a substitution of letters and/or distribution of blank positions in one sequence that makes it most similar to the other sequence. This approach is based on the perception of the genomic evolution as successive substitutions, insertions, or deletions of units (symbols) in the genetic material. Obviously, this method is relevant only for closely-related genomes. Alternatively, the linguistic methods are based on the distributions (compositional spectra, (CS)) of the frequencies of the same-length “words” which the sequences comprise. The CS method allows estimating the sequence similarity without referring to any evolution model. Moreover, if the transition to the next symbol in a sequence is of the Markovprocess type, then the two realizations of this process are, generally speaking, completely different from the alphabetical standpoint. However, they have similar distributions of the word frequencies (at least, if the sequences are long enough). This feature fully corresponds to a UAS model, which presets the probabilities of the transitions between different-type activities. So, the sequence of symbols itself is a realization of the corresponding random process. In this case, the similarity of activities can be assessed on the basis of the spectra similarity even though the UASs may be quite different from the symbol-wise. In this paper, we analyze the activity of mobile-phone users employing the linguistic approach of sequences comparison developed earlier for comparing genomes (see, e.g. [10]). Obviously, this method can be used for comparing sequences written over any finite alphabet. However, it should be pointed out that in a similar to genomic sequences fashion, our UAS sequences are also “written” over a 4-letter alphabet. This fact allows a detailed comparison of the results obtained for telephone-network users activity with genetic sequences.

2. Linguistic Analysis of Users Activity The initial dataset contains mobile-phone-call data records (CDR) of a few million users in a European country, collected during the period of four consecutive days. The analysis was performed for those users with UAS of length at least 100. The number of such UASs was about 25,000 and it slightly varied from day to day because not all the users appeared in the records each day. The Pajek Packet was used for clustering (Kamada-Kawai Free option). The first attempts of the linguistic analysis of genomic sequences were based on considering 2-letter words [11]. This was largely connected to a relatively small length of the genomic fragments available at that time. The sequences of users activities considered in the present research are also rather small. Therefore, we also use 2-letter words in our analysis. 2.1 Algorithm of Analysis Assume that there exists certain sequence S over a 4-letter alphabet (1, 2, 3, 4). Let p11 and p12 , denote the occurrences of words (1, 1) and (1, 2) in sequence S, respectively. We use similar notation for alk other occurrences. There are 16 different 2-letter words (i, j) over a 4-letter alphabet: i, j = 1, 2, 3, 4. The set {p11 , p12 , ..., p44 } is referred to as the spectrum of the sequence S. The spectra are normalized in two different ways, reflecting two different models of the UAS generation. Model 1. The spectrum is normalized in such a way that pi1 + pi2 + pi3 + pi4 = 1 for any i = 1, 2, 3, 4. This means, that the normalized values of pij are empirical probabilities of

60

´ et. al Couronne

transitions between the neighboring symbols in the sequence S, which corresponds to the Markov model of the UAS generation process. Model 2. Alternatively, occurrences can be normalized by the number of symbols in the whole sequence [12]. For example, if w(1), w(2) are the quantities of the symbols 1 and 2 in the p sequence S, respectively, then the standardized frequency of the word (1, 2) equals p12 / w(1)w(2). Such a standardization eliminates the dependence of a word frequency on the quantities of the corresponding symbols in the sequence.

Fig. 1: Distribution of the distances between UASs calculated using Model 1: (a) the same user’s UASs for days 1-2, 1-3, and 1-4. X-axis – correlation distances (see text), Y-axis – the number of users; (b) UASs for all possible pairs of different users. X-axis – correlation distances (see text), Y-axis – the number of pairs, scale 1:1000.

Fig. 2: The same as in Fig. 1, but the calculations were performed using Model 2.

61

Activities in the Mobile Telephone Network

2.2 UAS and Strategies In this section, we assume that each user employs some unique set of rules while generating his UAS. This set is referred to as the user’s strategy and we examine the users’ behavior from this point of view. We calculated the distribution of the correlation distances between each user’s UAS recorded on the first day and the UASs recorded on the next three days. For comparison, the distribution of the distances between all possible pairs of users was also calculated. The distance between the spectra is defined as d = 0.5(1-C), where C is the Pearson correlation coefficient. Therefore the value of d is maximal (equal to 1) for the inverse correlation and equal to zero for perfectly correlated spectra. The distributions calculated using Models 1 and 2 of the UAS generation are shown in Figs. 1 and 2. From the data presented on Figs. 1 and 2, it is clear that each user can be characterized by his specific behavior which is manifested in the high degree of correlation between their UASs and in the absence of pairs with large distances between the UASs recorded on different days. There also exists a significant number of close UASs for different users, but, which is more important, these users have UASs with large distances between them as well. Short distances observed in some pairs may result from similar behavior of the users - and this is a nontrivial fact, demonstrating by the existence of highly dissimilar behaviors. These results support the above-made assumption of the existence of the user’s strategy, which determines the user’s daily activities with close UAS spectra. The fact that the UASs generated on different days virtually merge,shows that these strategies are stable. 2.3 Cluster Analysis Let us consider the question of how much user-specific the strategies are. In other words, are there users with similar strategies and how many different strategies do exist? We answer these questions using cluster analysis. The matrix of the distances between the users’ UASs recorded during 24 hours is calculated for the standard Euclidean distance. Although nonnormalized pij values depend on the length of sequence S, the spectra of both types (Models 1 and 2) are actually normalized by length, so the use of the Euclidean distance is quite justified. In the clustering process relatively short distances are considered, namely, only those that do not exceed a 1/4 of the average of all pairwise distances. The partition of the users obtained on the basis of the Euclidean distances between the spectra in the framework of Model 1 is shown in Fig. 3.

Fig. 3: Cluster partition of the users based on the similarity of their UASs (Model 1). Cluster block A corresponds to the users having a relatively simple strategy. The clusters of block B correspond to a more diversified strategy.

62

´ et. al Couronne

Fig. 4: Clustering of the users with regard to their strategies based on Model 2 of spectrum generation.

In Fig. 3A, one can observe four clusters, corresponding to almost pure strategies. Cluster (A): the users’ strategy is almost always Activity 1 (occurrence of word (1,1) - 0.999); cluster (B): the users’ strategy is almost always Activity 2 (occurrence of word (2,2) - 0.999); cluster (C): mixed activity, words (1,2) and (2,2) are the most common; cluster (D): the strategy is characterized by: p11 = 0.35; p12 = 0.35; p22 = 0.64; p41 = 0.36; p42 = 0.64. Smaller clusters are shown in Fig. 3B. As an example, the strategies that comprise the cluster indicated by arrow are specified in Table 1. Table 1: The number of all possible words in the UASs of a few users belonging to the cluster indicated by arrow in Fig. 3B.

Words

1 1

1 2

1 3

1 4

2 1

2 2

2 3

2 4

3 1

3 2

3 3

3 4

4 1

4 2

4 3

4 4

User User User User

1 1 0 0

2 3 0 0

0 1 2 1

1 0 0 0

0 1 0 0

0 2 0 1

1 3 2 1

1 3 0 3

2 2 9 0

1 0 1 0

13 5 15 20

61 32 55 71

0 1 2 1

0 3 1 4

63 31 52 68

33 24 41 60

1 2 3 4

From the data presented in Table 1 it can be readily seen that the characteristic strategy feature in the cluster is the dominating exchange of text messages. It should be noted that the patterns of receiving two consecutive text messages (word (3,3)) or sending two consecutive text messages (word (4.4.)) are quantitatively quite close. Thus, our results demonstrate that the number of different strategies is relatively small. The results of the clustering procedure for the same users as above based on Model 2 are shown in Fig. 4.

3. Conclusion and Future Work It has been demonstrated in this work that each user possesses his own stable strategy, which is reproduced daily. This result is in accord with some previously obtained data on human behavior in other communication networks [13, 14, 15, 16, 17]. Using the method of cluster analysis, we have also found that the number of different strategies is limited (compare with [18]). The number of clusters, obviously, depends on the partitioning parameters and thus may vary. However, the fact that the number of strategies is limited, obviously, does not depend on a particular cluster implementation. This result, as well as the very existence of strategies, are not trivial facts. It could be supposed, a priori, that the same user’s UASs recorded on different days or different users’ UASs recorded on the same day should be quite dissimilar. 63

Activities in the Mobile Telephone Network

The method of linguistic analysis of the genome applied in this work proved to be quite satisfactory, but it still has to be further sophisticated in the area of the mobile-phone communications. It should be emphasized that two UASs can appear quite different when compared “letter-by-letter”, but, according to our definition, belong to the same strategy if their CS are very close. In this regard, it is important that our analysis proved to be most effective when the spectra were based on transition probabilities (which is quite uncommon in the analysis of genomes). For the latter case of transition probabilities, the next step of research can be studying the peculiarities of these probabilities. For example, if, each time, the user chooses one of the four possible types of activities uniformly at random , then the user’s activity will have transition probabilities of 0.25. If the values of transition probabilities are different, the user’s preferences should be described by more complicated models.

References [1] A. Sevtsuk, C. Ratti. Does urban mobility have a daily routine? Learning from the aggregate data of mobile networks. Journal of Urban Technology 17, 41– 60, 2010. [2] T. Couronn´e, A.M. Olteanu-Raimond, Z. Smoreda. Looking at spatiotemporal city dynamics through mobile phone lenses. In: Proceedings of the IEEE International Conference “Network of the Future” (NOF), Paris, pp. 128 – 134, 2011. [3] Z.Smoreda, A.M. Olteanu-Raimond, T. Couronn´e. Spatiotemporal data from mobile phones for personal mobility assessment. In: Transport Survey Methods: Best Practice for Decision Making, pp. 745–767, 2013. [4] O. J¨arv, R. Ahas, E. Saluveer, B. Derudder, F. Witlox. Mobile phones in a Traffic Flow: A Geographical Perspective to Evening Rush Hour Traffic Analysis Using Call Detail Records. PLoS ONE, 7, e49171, 2012. [5] M.C. Gonz´ alez, C.A. Hidalgo, A.L. Barab´asi. Understanding individual human mobility patterns. Nature, 453, 779–782, 2008. [6] L. Hossain, K. Chung, S. Murshed. Exploring temporal communication through social networks. In: Human-Computer Interaction–INTERACT, pp. 19–30, 2007. [7] Y.Y. Liu, J.J. Slotine, A.L. Barab´asi. Controllability of complex networks. Nature, 473, 167– 173, 2012. [8] C. Ratti, S. Sobolevsky, F. Calabrese, C. Andris, J. Reades, M. Martino, R. Claxton, S.H. Strogatz. Redrawing the Map of Great Britain from a Network of Human Interactions. PLoS ONE, 5, e14248, 2010. [9] D. Cardon, G. Fouetillou, C. Roth. Two paths of glory-structural positions and trajectories of websites within their topical territory, in: Intern. Conf. on Weblogs and Social Media (ICWSM), 2011. [10] A. Bolshoy, Z. Volkovich,V. Kirzhner, Z. Barzily. Genome Clustering: From Linguistic Models to Classification of Genetic Texts (Studies in Computational Intelligence), Springer, SpringerVerlag New York Inc, 2010. [11] V. Brendel, J.S. Beckmann, E.N. Trifonov. Linguistics of nucleotide sequences: morphology and comparison of vocabularies. J. Biomolec. Str. and Dynamics 4, 11–21, 1986. [12] S. Karlin, C. Burge. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet., 11, 283-90, 1995. [13] C. Krumme, A. Llorente, M. Cebrian, A.S. Pentland, E. Moro. The predictability of consumer visitation patterns. Scientific Reports, 3, article number:1645, 2013. [14] G. Miritello, R. Lara, E. Moro. Time allocation in social networks: correlation between social structure and human communication dynamics. arXiv:1305.3865v1, May 16, 2013. [15] G. Miritello, R. Lara, E. Moro. Time allocation in social networks: correlation between social structure and human communication dynamics. In (Holme, P., Saram¨ aki, J. editors): Temporal Networks. Springer, 2013. [16] J. P. Eckmann, E. Moses, D. Sergi. Entropy of dialogues creates coherent structures in e-mail traffic. Proc. Natl. Acad. Sci. USA 40, 14333, 2004.

64

´ et. al Couronne

[17] K.I. Goh, A.L. Barabasi. Burstiness and memory in complex systems. EPL (Europhysics Letters) 81, 48002, 2008. [18] T. Wu, C. Zhoud, J. Xiaob, J. Kurthsa, H. Schellnhubera. Evidence for a bimodal distribution in human communication. Proc. Natl. Acad. Sci. USA 107, 18803, 2010.

65