a method to identify the country of origin - CiteSeerX

3 downloads 39467 Views 5MB Size Report
Moreover, we may find several image hosting websites which allow users to upload .... Internet photograph websites could be a good proxy of tourism practices.
Tourist behavior analysis through geotagged photographies: a method to identify the country of origin ∗ Computer

J´erˆome Da Rugna∗† , Ga¨el Chareyron∗† and B´ereng`ere Branchet∗

Science Departement, ESILV, Pˆole Universitaire L´eonard de Vinci, Paris La D´efense, France Research Team on Tourism, University of Paris 1 Pantheon-Sorbonne, Paris, France [email protected], jerome.da [email protected], [email protected]

† International

Abstract—Much information can be extracted from geotagged photographies posted on online image databases like Flickr or Panoramio. Recent works have demonstrated that some treatment of this data can provide a good estimation of tourism behavior. Tourism represents today and for several years an important factor in the regional economy. Understanding and analyzing the tourist behavior corresponds to a significant demand from institutions. For this purpose, many studies have been launched. Many specialists of tourism need to separate tourists according to their place of residence. In the context of two projects supported by territorial collectivities, this paper introduces a new paradigm to estimate photographer’s country of residence. Each user will be described by his photographic timeline. This timeline allows to compute intermediate properties: travel time at a destination, number of trips, number of visited countries... This generation of symbolic data is essential and allows to synthesize the richness of the timeline in front of the recognition task to achieve. Classification algorithms will then be introduced, some sets with experts of science of tourism, others using data clustering and supervised learning techniques. We compared these methods for two distinct questions: firstly we classify photographers into two categories (French/non-French for example); secondly we find the country of residence of each user. It demonstrates that, using learning algorithms or expertdefined rules permits to identify users residence efficiently. We are thus able to meet the request of experts in tourism and refine even more the analysis of tourist behavior.

I. I NTRODUCTION For some years, it has been recognized, especially by Governments and relevant institutions, that tourism is a primordial factor to regional economy and, more precisely in the years to come, a major challenge. To support all stakeholders in the field, it is necessary to understand and analyze the behavior in a tourists destination: a country, a city, a park. . . To address this significant request, many studies have been launched, especially using the Internet as the source of information. Moreover, we may find several image hosting websites which allow users to upload their photos and to localize each of them on a map. Images and photographs have always represented a tourism and sociological view [1], [6], [7], [13], [17], [3]. A recent work [2] has demonstrated that these websites are representative of tourists practices and constitute a proxy to analyze tourism flows. Using this information, it is possible to follow a tourist along his journey in a city, a country or across several countries.

However, one of the first expectations of specialists of tourism is to separate the tourists according to their place of residence. The country of residence for each user is required. Using anonymous and public websites, we can easily extract photographs but profiling the photographers is not trivial, and the issue of legality has to be taken into account. In some websites, the user may define his location in his profile. On Flickr, about 40% of users define these, and, for most, the information is not standardized and does not indicate the country but a location belonging to a community (city, state,country of birth,...). One of the first ideas is to use a semantic tag photos to determine the country of origin of the photographer. Methods have been proposed by [8], [10] to analyze the tags associated with a city. They show that language of the tags and comments help to collect informations about the country of residence. Unfortunately in our case, the granularity of the analysis is not sufficient. For example, many comments are in english even if the users are not native speakers. Indeed they want to be understood and read by their followers. So, this paper introduces several methods to extract the country of residence using the geolocated photographs of each user. This work takes place in the context of two territorial projects. Let us first introduce the data acquired from websites and describe informations collected about 4 millions of photographers. II. A NEW TOURISM ANALYSIS TOOL The increasing use of smartphones, digital cameras, Global positioning Systems (GPS), and web-based services in our personal and professional activities are changing the way we communicate and interact with each other, as well as how we perceive our environment. Now, when we add photos on the web we often include geographical information. A. The data In this paper, we consider two photo-sharing websites: Flickr and Panoramio. These image hosting websites read image metadata for localization. People using these websites also have the possibility to add geographical attributes to their photographs. Metadata embedded by the camera into the image are processed as well. This information completes the geographical information. All this metadata is saved using

Exchangeable Image File Format (EXIF). Table I shows the data available. Using the Application Programming Interface TABLE I E XIF DATA RECORDED IN EACH IMAGE . latitude taken date camera serial number

longitude shutter speed camera model

focal length aperture value camera make

(API) provided by Flickr and Panoramio, all data from January 2005 to December 2011 is considered. Table II gives an overview of the data.

Fig. 1. Spatial presence of photographers in USA. The strong opacity of blue indicates many photographers in this place.

TABLE II DATA OVERVIEW.

Photographs Photographer Advantages & Drawbacks

Panoramio 47 M 2.5 M Highlight site Few photos by user No other data

Flickr 135 M 1.4 M Heterogeneous site Pictures of people Place of residence

B. Privacy When using data provided by websites like Flickr or Panoramio we must protect the user privacy. To protect the identity of each user we replace the user ID and the photo ID by an incremental number. Thus it is not possible to find a user or a photo through a timeline. This allows to analyze the data by country of origin while preserving the users’ anonymity. C. Space scale and time scale Depending on the working area, different scales may be used. For example, in figure 1, a grid of cell size 1◦ by 1◦ is used, while in figure 2 the cell size is 0.01◦ by 0.01◦ . The number of photographers is the value represented in each cell. The most visited places can be recognized in these figures (like the Golden Gate or Alcatraz bay cruise in figure 2(a) for example ...) Note that this space scale is made possible by the huge amount of data. Even considering a small zone, there is enough data to extract relevant information. Another property of these data is the time scale. The temporal information provided by EXIF data (the photograph taken date) allows to analyze an area by year, by season, by month, by time of day. . . . Figure 3 shows some results of a monthly distribution analysis. As expected, the Roland Garros stadium 3(b) is visited only during the French Open (end of May, beginning of June). The seasons of the ski resort are clearly identified 3(c), the small peak summer season too. In contrast, the city of Nara has a more uniform distribution 3(a).

(a)

Fig. 2. Spatial presence of photographers in San Francisco (a) and Nara (b). The heat scale is used to represent the number of photographers per place.

two categories do not have the same practices: so these two sets should not be analyzed together but separately. So, for most photographers, country of residence is then required. On Flickr, about 40% users define a location, and, for most, the information is not standardized and does not indicate the country but a location belonging to a community. Using GeoNames[5], geographical database of countries, cities and names of places with a word matching process, we are able to recognize about 30% of user country. Table III illustrates the results of these simple matching on the database. On specific destination, like French region ˆıle-de-France or the country Japan, the ratio of country known may be greater than 30%. On other one, it may be smaller and really reduce the analytical capacity. TABLE III U SER COUNTRY OF RESIDENCE DIRECTLY EXTRACTED ONLY FROM PUBLIC DATA . T OTAL NUMBER OF PHOTOGRAPHERS IS SHOWN FOR THE FRENCH REGION ˆI LE - DE -F RANCE , THE COUNTRY JAPAN AND THE WORLD . T HE RATIO C OUNTRY KNOWN INDICATES THE RATIO OF PHOTOGRAPHERS OF WHICH THE COUNTRY OF RESIDENCE IS KNOWN .

D. Which country tourists come from? Internet photograph websites could be a good proxy of tourism practices. Using this public information, it is possible to help the experts of tourism to describe where tourists are and when they arrive. However, one of the primary expectations is to separate tourists according to their country of residence, and more especially to identify local and foreign tourists. These

(b)

Panoramio

Flickr

Number of photographers Country known Number of photographers Country known

ˆıle-de-F.

Japan

World

33375

28839

2.5 M

0%

0%

0%

64770

41445

1.4 M

44%

37%

30%

(a)

(b)

(c)

Fig. 3. Monthly distribution of photographers, from January to December for the area of: (a) City of Nara, Japan; (b) Tennis stadium Roland Garros, France; (c) ski resort Les Arcs, France

7 days and that tourists also take pictures every day. We fixed this threshold at 8 days. – number of days in the country - This variable allows us to know the number of days spent by a photographer in a country. This will give us an indication of the presence of the user in the country. – number of days between 2 visits - With this value we compute the number of days between two stays in the same country. We can assume that if the value is low then the user has facilities to visit that country (business purposes, geographical proximity). B. Data sample and problematic

III. C OUNTRY OF RESIDENCE ESTIMATION

A. Photographer description A user is described by his timeline. It defined his route since his first photograph posted on the site to the last one. The figure 4 illustrates this timeline for a user during years 2010 and 2011. This represents a discrete view of “where the user was and when he was there”. This discrete representation of the user activity is separated in periods representing a presence in the same place: this is the unity of time and space. A threshold on the gap between two consecutive photos is used. If the gap is higher than 3 days, the unity of time is considered not respected. We can extract more information from this timeline, which helps describe the photographer. We list below the different data extracted. •





number of photographs - The total number of images gives an indication of the use of social networks. When the number of photos is low it is difficult to identify the users profile. number of visited countries - The number of countries visited by a photographer shows if he’s a frequent flyer, or conversely a rather sedentary. For each visited country – number of photographs - The number of photographs in the same country gives an indication of the time spent in this country. Unfortunately, this information alone does not determine the photographer’s country of origin. – number of places visited (city) - The number of places visited by a user in the same country gives information on his knowledge of the country. – number of visits - A stay can be defined in several ways. First, if a user visits another country between two photos, two visits are counted. Secondly, if there is a significant time gap between two pictures, two visits are also counted. To determine the average length of stay we used data provided by the UNWTO [14]. Thanks to this information we calculated for each user the number of stays in each country. To define the threshold for a significant gap, experts consider that the average residence time in a place is

We want first to classify the users into two categories : domestic tourism or foreign tourism. As explained in [15], this separation is essential because it is the most divisive for behavior analysis. For instance, if we consider the city of Paris, we want to distinguish French visitors from nonFrench visitors. To validate our approach, among all data at our disposal, we have selected only those for which the photographer’s country of origin is clearly identified. It turns out that the distributions of various parameters available on whether all data or only on the sample selected, are perfectly similar: our sample presents no bias and is therefore representative of all data. Photograph parameters are the features presented in III-A. Size and distribution of data/country are presented in table IV. TABLE IV T OP 10 TRAINING DATASE OVERVIEW. Origin country united states united kingdom italy canada spain brazil germany france australia netherlands

Nb. 130788 47576 23527 21488 20228 17421 16387 15086 12114 9324

% 29.54 10.74 5.31 4.85 4.56 3.93 3.70 3.40 2.73 2.10

To summarize we want to answer two questions: • are they French or not ? • where do they come from (country of origin) ? In this purpose we have used two methods: the first is based on expert approach, the second one on learning processes. 1) Domain expert rules: One of the first methods we used to extract the country of origin is based on assessment by a domain expert. Indeed, one can consider that the country of origin of a photographer is defined by where he stayed the most. In the chapter 3.1 we have computed for each user the number of countries visited and the total length of stay in each country. Then we sorted the data by the total length of stay. This first approach is the closest to the definition of the expert. Unfortunately, nearly 8% of users do not take pictures in their home countries and are not properly detected. Also, a threshold

Fig. 4. A photographer timeline. All photos of a user can be represented along the time. 8 periods may be identified including 4 in the user country (France), 2 in United States, 1 in Belgium and 1 in Germany.

is defined to decided ”no country”: if a user does not post enough photographs on the website, it is not possible to decide his country of residence. 2) The learning approach: Another way of finding the photographer’s country of origin is by using a learning algorithm [4], [12] (like C4.5 or SVM)[16], [18]. We have only two classes in the first item, whereas in the second, we have as many classes as countries. We provide details on the implementation used in the next paragraph, in which we show some results obtained in each case. IV. R ESULTS First of all, we present the results obtained by classifying the users into 2 categories: French/non-French then USAresident/non-USA-resident. Table V gives results calculated with expert rules, while table VI presents the results obtained with the learning method C4.5. We have also selected a 10-Fold cross-validation [11] to estimate the final efficiency of the learning method. The error rate of C4.5 method is about 8%, for both of the calculations (French/non-French and USA-resident/non-USA-resident). The main request is often to analyze the foreigner behaviors. If we focus on non-French or non-USA-resident issues, results are similar with both of the methods (Expert or C4.5). The efficiency is good enough to directly use these results in studies of tourism. Figures TABLE V P RECISION -R ECALL FOR F RENCH / NON -F RENCH AND USA- RESIDENT / NON -USA- RESIDENT OBTAINED WITH EXPERT RULES .

R P

Non-French 0.9018 0.9704

French 0.9092 0.7371

Non-USA-res. 0.9757 0.9633

USA-res. 0.8172 0.8724

TABLE VI P RECISION -R ECALL FOR F RENCH / NON -F RENCH AND USA- RESIDENT / NON -USA- RESIDENT OBTAINED WITH C4.5 METHOD AND CROSS - VALIDATION .

R P

Non-French 0.9523 0.9500

French 0.7128 0.7231

Non-USA-res. 0.9544 0.9369

USA-res. 0.7193 0.7831

5(a) and 5(b) respectively represent the precision-recall map obtained by classifying users according to their country of origin. We keep only about forty most visited country, which correspond to enough represented countries. Globally, and as expected, the results are less good : the task is harder than the first one. The C4.5 method is mainly depending on the number of users per country. The slightly worse results of the C4.5 method could be explained by the bias created by the least represented countries. Other methods such as SVM or KNN [9] are even more sensitive. Focusing on the most represented countries, the both methods give equivalent results. A next step would be refining the process of learning in order to apply it to the entire database. We probably have to define thresholds on the various parameters to determine if the data stay or not in the learning’s sample.

(a)

(b)

Fig. 5. Precision-Recall obtained with expert rules (a) and learning method C4.5 with cross-validation (b) for about forty most visited countries.

V. C ONCLUSION In this paper we have shown that it is possible to determine a photographer’s country of origin by analyzing his photographic practices. Indeed, whether it is by an expert approach or by a learning approach, the results obtained from data from Flickr are coherent and present a low error rate. It is therefore possible to enhance our data. The rules of behavior of photographers validated through the data from Flickr are certainly applicable to other database images such as Panoramio for example. This work will allow to consider

a better analysis of the behavior of each user group according to their countries of origin. These methods must be used in addition to statistical data on tourism to complement and contribute to the discovery of new tourist routes and new practices. For a first application, we show in figure 6 the spatial distribution of photographers — like in figure 2(a) — but by differentiating USA-residents by foreign tourists.

(a)

(b)

Fig. 6. Spatial presence of photographers in San Francisco, foreign tourists (a) and USA-residents (b). The heat scale is used to represent the number of photographers per place.

R EFERENCES [1] R. M. Chalfen, “Photograph’s role in tourism : Some unexplored relationships,” Annals of Tourism Research, vol. 6, no. 4, pp. 435–447, 1979. [Online]. Available: http://www.sciencedirect.com/science/article/B6V7Y-46BHYJDJX/2/e0b3f137339874655e54be4722c8457f [2] G. Chareyron, S. Cousin, J. Da-Rugna, and D. Gabay, “Touriscope: map the world using geolocated photographies,” in IGU meeting, Geography of Tourism, Leisure and Global Change, 2009. [3] D. J. Crandall, L. Backstrom, D. P. Huttenlocher, and J. M. Kleinberg, “Mapping the world’s photos,” in WWW, J. Quemada, G. Le´on, Y. S. Maarek, and W. Nejdl, Eds. ACM, 2009, pp. 761–770. [4] R. Duda, P. Hart, and D. Stork, Pattern Classification (Second Edition). Wiley-Interscience, 2001. [5] Geonames, “Geonames - http://www.geonames.org/.” [6] M. F. Goodchild, “Citizens as sensors: the world of volunteered geography,” GeoJournal, vol. 69, no. 4, pp. 211–221, 2007. [Online]. Available: http://www.springerlink.com/content/h013jk125081j628/ [7] K. Grunfeld, “Integrating spatio-temporal information in environmental monitoring data–a visualization approach applied to moss data,” The Science of the Total Environment, vol. 347, no. 13, pp. 1–20, Jul. 2005, PMID: 16084963. [Online]. Available: http://www.ncbi.nlm.nih.gov/pubmed/16084963 [8] A. Jaffe, M. Naaman, T. Tassa, and M. Davis, “Generating summaries and visualization for large collections of geo-referenced photographs,” in Proceedings of the 8th ACM international workshop on Multimedia information retrieval, ser. MIR ’06. New York, NY, USA: ACM, 2006, pp. 89–98. [Online]. Available: http://doi.acm.org/10.1145/1178677.1178692 [9] T. Joachims, “Making large-scale svm learning practical. advances in kernel methods - support vector learning,” 1999. [10] L. Kennedy, M. Naaman, S. Ahern, R. Nair, and T. Rattenbury, “How flickr helps us make sense of the world: context and content in community-contributed media collections,” in Proceedings of the 15th international conference on Multimedia, ser. MULTIMEDIA ’07. New York, NY, USA: ACM, 2007, pp. 631–640. [Online]. Available: http://doi.acm.org/10.1145/1291233.1291384 [11] R. Kohavi, “A study of cross-validation and bootstrap for accuracy estimation and model selection,” in IJCAI, 1995, pp. 1137–1145. [Online]. Available: citeseer.nj.nec.com/kohavi95study.html [12] T. M. Mitchell, Machine Learning. New York: McGraw-Hill, 1997.

[13] E. O’Neill, V. Kostakos, T. Kindberg, A. F. gen. Schieck, A. Penn, D. S. Fraser, and T. Jones, “Instrumenting the city: Developing methods for observing and understanding the digital cityscape,” in Ubicomp, ser. Lecture Notes in Computer Science, P. Dourish and A. Friday, Eds., vol. 4206. Springer, 2006, pp. 315–332. [14] W. T. Organization, Compendium of Tourism Statistics, Data 2006-2010, 2012 Edition, ser. Compendium of tourism statistics. World Tourism Organization, 2012. [15] A. Pizam, “Does nationality affect tourist behavior?” Annals of Tourism Research, vol. 22, pp. 901–917, 1995. [16] J. R. Quinlan, “Improved use of continuous attributes in c4.5,” Journal of Artificial Intelligence Research, vol. 4, pp. 77–90, 1996. [17] N. B. Salazar, “Imaged or imagined? cultural representations and the tourismification of peoples and places,” Cahiers d’´etudes africaines vol:49 issue:193-194, pp. 49–72, 2009. [18] V. Vapnik, The Nature of Statistical Learning Theory. NY, SpringerVerlag, 1995.