Redeem with Privacy (RwP): Privacy Protecting Framework for Geo

8 downloads 0 Views 995KB Size Report
By completing the tasks, users earn free product and/or discounts. GSNs are .... by Facebook [14]. Each dataset is a single file containing all the check-ins made by all ... We rearrange the dataset by putting the records into new files where each ...
Redeem with Privacy (RwP): Privacy Protecting Framework for Geo-social Commerce Md Moniruzzaman and Ken Barker Computer Science, University of Calgary 2500 University Dr, NW, Calgary Alberta T2N 1N4, Canada

{mmoniruz,kbarker}@ucalgary.ca

ABSTRACT Users are encouraged to check in to commercial places in Geo-social networks (GSNs) by offering discounts on purchase. These promotions are commonly known as deals. When a user checks in, GSNs share the check-in record with the merchant. However, these applications, in most cases, do not explain how the merchants handle check-in histories nor do they take liability for any information misuse in this type of services. In practice, a dishonest merchant may share check-in histories with third parties or use them to track users’ location. It may cause privacy breaches like robbery, discovery of sensitive information by combining check-in histories with other data, disclosure of visits to sensitive places, etc. In this work, we investigate privacy issues arising from the deal redemptions in GSNs. We propose a privacy framework, called Redeem with Privacy (RwP), to address the risks. RwP works by releasing only the minimum information necessary to carry out the commerce to the merchants. The framework is also equipped with a recommendation engine that helps users to redeem deals in such a way that their next visit will be less predictable to the merchants. Experimental results show that inference attacks will have low accuracy when users check in using the framework’s recommendation.

Categories and Subject Descriptors C.2.0 [Computer-Communication Networks]: General— Security and protection; K.4.1 [Computers and Society]: Public Policy Issues—Privacy; K.4.4 [Computers and Society]: Electronic Commerce—Cybercash, digital cash

Keywords Privacy; Geo-social networks; Time series analysis;

1.

INTRODUCTION

In GSNs, users interact based on their relationships (e.g., friendship) as well as their location. People check in to places Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. WPES’13, November 4, 2013, Berlin, Germany. Copyright 2013 ACM 978-1-4503-2485-4/13/11…$15.00. http://dx.doi.org/10.1145/2517840.2517858.

signaling their friends about their location and also, get notified when their friends are nearby. foursquare, Path and shopkick are some of the popular GSN applications (Add .com at the end to get the URL). foursquare is the biggest platform of all with around 35 million registered users. People can earn badges, mayorship and discounts by checking in using this platform. shopkick attempts to gamify people’s shopping experience by posing different challenges like product exploration and purchase at a specific merchant. By completing the tasks, users earn free product and/or discounts. GSNs are increasingly being used by the business organizations to attract more customers through paid advertising, deals and promoting their venues in search results. Organizations create virtual shops in GSNs. To check in to these shops, users need to be present in person at the real world shops, too. One of the incentives to check into these places is deals such as “30% off for your purchase on your 5th checkin” or “check in with 4 friends to get 20% off for your purchase”. GSN service providers allow merchants to access the information of the users who check in to their stores. Previous work [19] shows that human mobility is predictable. This property can be exploited to locate GSN users. By analyzing the check-in history, a merchant can narrow down users’ location to a certain area on a given day. A dishonest merchant may sell the check-in history to third parties which may then combine check-in histories from multiple stores and track the users’ location [8]. People consider their location information as a highly private information [5]. The misuse of location information can lead to serious privacy breaches. GSN users are prone to this risk when they redeem deals. At the same time, deal redemption is also an attractive service to users because of the cash benefits. In this work, we investigate methods to minimize the privacy risk in this service. As a solution, we propose to release to the merchants only the minimum information that is necessary to carry out the commerce. When users redeem a deal, check-in dates required to redeem the deal will be revealed to the merchant and the rest of the dates will be suppressed. Merchants must rely on the GSN service provider to verify that users have done sufficient check-ins to redeem a deal. We assume that merchants are interested in multiple check-ins by the users and not interested in their check-in data. Requiring less check-in history from the customers will also increase their confidence in GSNs which will eventually results in more check-ins to their stores.

Releasing only the minimum information may not provide enough privacy protection as the merchants, acting as adversary, may still be able to infer the suppressed dates by applying data mining. We investigate the amount of information released to the merchants when users redeem a deal and how much of the suppressed information is recoverable through inference. We have implemented inference engines using popular data mining algorithms. We found that an adversary can recover the suppressed check-ins with over 59% accuracy. Therefore, we propose a recommendation engine that helps users to choose check-in dates in such a way that will reduce the inference accuracy. Our contributions can be summarized as follows. (1) We design and release the minimum amount of a user’s check-in information that a merchant needs to know in order to provide the deals and suppress the rest of the information. (2) We design an adversarial scenario where two data mining algorithms are applied to see how much of the suppressed information can be recovered. We show that suppressing the check-in events is not enough as the merchants can still recover the information (3) We propose a recommendation engine that helps the users to choose check-in dates for redeeming deals. We show that an adversary would not be able to tell with good accuracy when a user visited their store or will visit their store next when users follow the recommendation engine. Contribution 1 and 3 form the privacy framework RwP. The framework is designed to work as a core functionality of a GSN. It will not work as a third party add on. The rest of the paper is organized as follows. Section 2 presents how the minimum information disclosure is enforced by the framework. Section 3 presents inference methodologies that an adversary can perform and the experimental results. Section 4 presents the recommendation engine. Section 5 describes the related research works. Section 6 draws conclusions and provides directions for future research.

2.

MINIMUM INFORMATION DISCLOSURE

Although a variety of deals are found on GSNs, in this work, we focus on the most common type of deal that requires a user to check in multiple times before they can redeem. We define a deal by (M, dealDeadline) where M is the number of check-ins that a user is required to do at a store, dealDeadline states the time period within which all the check-ins should be made. The deadline is often expressed in the number of days. As an example, consider the deal stating “Earn 15% off for your purchase on your 3rd check-in. All the check-ins must be done within a week.” Here, M is 3, dealDeadline is 7 days. Setting up standard values for M and dealDeadline depends on various factors like type of products, amount of discounts, consumer’s psychology, etc. Most of the deals we found on foursquare were asking for 2 to 6 check-ins and allowed consumers several weeks to complete the check-ins. So, we set dealDeadline = 31 days and M ∈ [2, 6] check-ins in this work. Merchants do not usually accept more than one check-in per day and want users to check in on different days when they provide promotion. We also apply this restriction in this work. In addition, we make the assumption that users redeem a deal on the M th check-in. Suppose, a deal requires 5 check-ins, then a user will purchase the deal on their 5th check-in at all times based on our assumption.

The minimum information disclosure principle [1] states that a data collector should not collect more data than the amount necessary to serve the data collection purpose. To enforce the principle, check-ins that are not necessary for a merchant to know in order to carry out the commerce are suppressed by our RwP framework. It is not possible to hide the purchase day from the merchant as a user needs to tell the merchant in person what deal they want to redeem. But it is possible to hide all the earlier check-ins before the purchase day that are counted towards the deal. The framework does not reveal any check-in date to merchants. But it confirms the merchants whether a user has sufficient check-ins to redeem a deal. When a user redeems a deal, the merchant gets the purchase day information and also, the number of check-ins made before the purchase day. Formally, if the required number of check-ins for a deal is expressed as M = N + 1, then N check-ins are suppressed and only 1 check-in is disclosed to the merchant. The merchant also knows the value of N .

3.

INFERENCE

A merchant, acting as an adversary, may attempt to infer the suppressed dates by analyzing the information available to them. If the merchants can achieve good inference accuracy, they may also be able to predict future check-in dates by using both the inferred information and the existing information. Therefore, we investigate how much of the suppressed information is recoverable through inference attacks. A merchant has access to a user’s check-in history that are done only at its own store. So, we design personalized inference that involves a single store and a single user. An Inference engine attempts to infer when a user visited a given store. We have implemented three inference engines. The first two engines are based on the Markov model (MM) [20] and the Viterbi algorithm [11]. We refer to them as MM engine and cMM engine. The third is based on the pattern mining algorithm, Apriori [2] which we call as Apriori engine. An inference engine is first trained using a part of the user’s check-in history. During training, the engine learns about the check-in behavior of the user. Later, the engine attempts to guess the hidden check-ins on a test dataset which is prepared by using the remaining check-in history left after training.

3.1

MM engine

Markov chain is a class of Markov model containing a finite set of states where the program starts in one of the states [20]. The probability with which the program moves to another state at the next step only depends on the current state. We design the MM engine using Markov Chains where the states are dates from the calendar. The state transition probability is calculated by considering mobility features that influence users’ check-in behavior. In the next section, we introduce the mobility features used in this work and how the engine quantifies their influence.

3.1.1

Training

a. Dataset. We have used datasets that were collected on Brightkite and Gowalla [8]. Both were popular Geo-social networks. Gowalla has recently been acquired by Facebook [14]. Each dataset is a single file containing all the check-ins made by all

Table 1: Sample check-in history ... 5739 5739 5739 5739 5739 5739 5739 5739 5739 ...

9241 9241 9241 9241 9241 9241 9241 9241 9241

2010-08-09 2010-08-12 2010-08-13 2010-08-14 2010-08-15 2010-08-17 2010-08-18 2010-08-19 2010-08-21

Monday Thursday Friday Saturday Sunday Tuesday Wednesday Thursday Saturday

the users. We rearrange the dataset by putting the records into new files where each file contains one user’s check-in records from one venue. We also truncate additional fields that are not used in our analysis. We use 50% of the checkins for training and the remaining for testing. Suppression is not applied on the training dataset. The format of the training dataset is given in Table 1 where the first field of each record contain the user ID and location ID separated by underscore ( ), the second and third field are the check-in date and the day of week information, respectively. Like other GSN datasets [3], our datasets are also sparse. For many users, there are not sufficient check-ins at a single venue. If we consider the time of the day, then there will be too many states in the MM engine with very few transitions to or from them. Besides, merchants typically follow the strategy to make a user check in on distinct days rather than multiple times on the same day when they offer deals. If a user checks in twice on the same day, the second check-in is not counted towards the redemption. So, users will not typically check in more than once on a day for deals. Due to this nature of deal redemption and also, to find enough temporal patterns to run the inference in this type of dataset, we ignore the time of the day when users check in. It also means that the inference engine may infer the date when a user checked in but it cannot say at what time on that date the user checked in.

b. Mobility features. As stated earlier, inference will be performed on a single user’s check-in histories made at a single venue. Between a user’s two successive check-ins, only the temporal information changes. We define mobility features based on this temporal information. Urban people’s mobility habit are usually driven by days of the week [18]. For example, users may have the habit to visit a place on select days of the week. We can also find a relation between two days of the week in the check-in behavior. An example of such relation is when a user shows up at a given place on Monday and they visit that place again on next Wednesday. To capture the check-in behavior influenced by days of the week, we define the feature, weekday transition. Weekday transition probability, denoted by A, is the probability that a user will check in on a day of the week given their current check-in is on another day of the week. This feature has also been suggested by other researchers [19]. The weekday transition probability from a day Y to another day Z is calculated by dividing the number of Y 0 s appearance immediately followed by Z (denoted by f requency(Y → Z) ) with the total number of Y 0 s appearance (denoted by f requency(Y )) in the check-in history.

A(Z Y ) =



f requency(Y Z) + q f requency(Y ) + 7 ∗ q

The addition of q and (7 ∗ q) to the numerator and denominator, respectively, ensures that transition probability from one day to another is never zero even when there is no transition between these days in the check-in histories. The process is called Laplace Smoothing [9]. Another important mobility feature is the time period between consecutive check-in events [18]. Some users may check in regularly while others may not. Time period probability, denoted by B, specifies the probability of a time distance between two consecutive check-ins of a user. We do not consider multiple check-ins on the same day, so the unit of time period is one day. Given the set of all time periods T P , the probability of a single time period tp is computed in the following way: B(tp) = P

∀tp0 ∈T P

f requency(tp) + r f requency(tp0 ) + (|T P | + 1) ∗ r

When calculating B(tp), we reserve some probability using r for potential time periods that may appear in the future inference. This probability is then equally distributed to the new time periods that appear for the first time. The inference engine calculates the mobility features’ probabilities A and B based on the training dataset. Using the mobility features’ probabilities, we can calculate the probability of a user’s check-in on a specific date. Suppose the latest check-in of the user at a given store was on the date s, then the probability that the user would check in to the store again on the date t is the following: P (t|s) = A(tw |sw ) ∗ B(t − s) Here, tw ,sw represent days of the week for the dates t and s respectively. P (t|s) is also called the state transition probability. In the calculation above, we treat both mobility features independent of each other. However, it is not exactly the case since the set of time periods depends on a given pair of days of the week. Later, we combine the two features into one in the cMM engine (Section 3.2). Researchers have extracted other mobility features [18] that are known to influence urban people’s check-in behavior. They combine all these features into a single score and use it to predict future check-ins. However, many of these mobility features cannot be learned from the check-in histories (such as Table 1) that is available in personalized inference.

3.1.2

Testing

We process the remaining check-in history left after training and turn it to a test dataset that represents the information revealed to the merchants once users redeem a deal with the minimum disclosure principle. For each deal, we randomly generate M between 2 to 6 as the number of required check-ins for that deal. We take M check-ins from the history, suppress N of the dates and leave the last date, denoted by purchaseDay, as it is. We also put the value of N in the dataset. The purchase day of the most recently redeemed deal by the user before the current one, denoted by lastSeenDay, is also known to the adversary. So an inference input consists of lastSeenDay, purchaseDay and

or a time period that did not appear before. A weekday transition (or time period) that appeared in the past may not be present in the current inference. Short term learning would assign probabilities to the features present in the current inference. Long term learning would strengthen the probabilities of the features present in the past as well as in the current observation. It would reduce the probabilities of those features that were present in the past but are absent in the current observation.

a. Short term learning.

Figure 1: All possible date sequences that may contain the suppressed check-ins

N indicating that N dates should be inferred between the lastSeenDay and the purchaseDay. The inference engine first lists all the days between the lastSeenDay and the purchaseDay. Out of these days, it constructs a directed graph containing the candidate date sequences. One of these sequences is the actual sequence following which the user redeemed the deal. Each path, termed as candidate pattern, has the length (N + 2) starting from the lastSeenDay and ending at the purchaseDay. As an example, all the sequences are given in Fig. 1 when the lastSeenDay is August 5, 2010, the purchaseDay is August 10, 2010 and N is 2. Each node in the graph is assigned a number i according to the potential check-in order. The graph in Fig. 1 is formally described by G : (S, E) where S is the set of dates and E is the set of the connections between the dates such that E ⊆ S × S. The function W provides the day of the week for a given date and is defined as W : S → |W eekdays|. Given that s is a date, sw is a shorthand notation for W (s). The probability of each edge of the graph is given by the state transition probability and is calculated using the probabilities of the mobility features as described in Section 3.1.1(b). Each inference input can be considered as an observation by the inference engine. During the training, the engine calculated its belief that consists of mobility features’ probabilities. Upon receiving an input, the engine changes its belief according to the observation and then, attempts to infer the suppressed dates of the input. The process through which the inference engine adapts its belief is called learning. We divide learning into short term learning and long term learning. In short term learning, the inference engine changes the probabilities based only on the current observation. By the current observation, we mean the information that is available from the current inference input. Short term learning is sufficient for the engine to infer the suppressed check-ins of an inference input. In long term learning, the inference engine updates the probabilities based on all the previous observations including the current one. At the beginning of an inference task, the engine performs short term learning and then, infers the suppressed check-ins. After that, the engine performs long term learning and updates the probabilities which would then be used in the next inference task. Figure 1 is an example of the information available to the engine from an inference input. The input may introduce completely new transition between two days of the week

Learning starts by computing an α value for each node of the graph in Figure 1 using the Forward algorithm and a β value using the Backward algorithm [22]. Once these values are computed, we use the Baum-Welch algorithm [21] to recalculate weekday transition and time period probability. The Forward algorithm generates α values as it progresses from left to right in the graph. The α value of the lastSeenDay, denoted by α(lastSeenDay)(pos=0) , is 1 as the user indeed checked in on that day and there is no uncertainty about it. The α value of any other date s ∈ S from pos = 1 to pos = N (excluding the purchaseDay) is computed using the following equations: X α ˆ (s)(pos) = α(j)(pos−1) ∗ P (s|j) ∀jincoming(s)

1 α ∀sS ˆ (s)(pos)

cpos = P

α(s)(pos) = cpos ∗ α ˆ (s)(pos) The function incoming(s) returns all the nodes with which s is connected through incoming edges. Under the given formula, α ˆ (s)(pos) leads to 0 very quickly as N increases. So, the direct implementation will result in an underflow problem. As a solution, we normalize the α ˆ value of a node at a given position by the scaling coefficient c which is computed by taking 1 over the summation of α ˆ values of all the nodes at that position. Unlike the Forward algorithm, the Backward algorithm starts computing β values from the right side of the graph. The β value of the purchaseDay, denoted by β(purchaseDay)(N +1) , is 1 as it is an actual checkin. The algorithm then progresses by computing β values of the nodes from pos = N to pos = 1. The same scaling coefficients used in the Forward algorithm are also used here. X βˆ(s)(pos) = P (j|s) ∗ β(j)(pos+1) ∀joutgoing(s)

β(s)(pos) = cpos ∗ βˆ(s)(pos) The function outgoing(s) returns all the nodes with which s is connected through outgoing edges. Next, we apply BaumWelch algorithm to calculate the new probabilities of the features. First, we calculate the weekday transition probability using the following equations. a(Z|Y ) =

X

X

∀pos∈[0,N ] ∀(s,t)∈R(pos,Y,Z)

A(Z|Y ) = P

α(s)(pos) ∗ P (t|s) ∗ β(t)(pos+1) α ˆ (purchaseDay)(N +1)

a(Z|Y ) XW eekdays a(X|Y )

An intermediate value, denoted by a, for the transition between any two given days {Y, Z} of the week is calculated

where the numerator represents the probabilistic value for one path in the graph that contains the transition. It is then normalized by α ˆ (purchaseDay)(N +1) and thus, the value becomes relative to the value of all the paths in the graph. Finally, a(Z|Y ) is the summation of the relative values of all the paths that contain the weekday transition. Here, the polymorphic function R returns all the edges (s, t) that contain the weekday transition (Y → Z) in the graph at a given position. The probability, A(Z|Y ), of the transition (Y → Z) is calculated by normalizing a(Z|Y ) with the summation of a values of all the weekday transitions starting with the day Y . Learning time period probability is similar to the learning of weekday transition probability. We first compute the intermediate value, b, of a time period tp by adding up the relative probabilistic value of all paths where this time period appears. b(tp) =

X

X

∀pos∈[0,N ] ∀(s,t)∈R(pos,tp)

B(tp) = P

α(s)(pos) ∗ P (t|s) ∗ β(t)(pos+1) α ˆ (purchaseDay)(N +1) b(tp)

∀tp0 ∈T P 0

b(tp0 )

The probability of the time period tp, B(tp), is calculated by normalizing b(tp) with the summation of b values of all 0 the time periods, T P , that appears in the graph. The inference engine calculates the mobility features’ probabilities over several rounds and in each round, a log likelihood parameter [20] is computed. If the difference of the probabilities between two consecutive rounds is less than a predefined threshold, the engine stops the calculation and stores the calculated probabilities.

b. Inference of the suppressed check-ins. The inference engine uses the Viterbi algorithm to find the most probable path in the graph. The algorithm calculates the Viterbi value, V , for each node except the lastSeenDay which is assigned V = 1. The Viterbi value of any other node t ∈ S is computed by the following: V(t)(pos) = max(V(s)(pos−1) ∗ P (t|s)) ∀s∈S

In other words, the Viterbi value of a node t at a given position is the maximum of the values computed by taking the Viterbi value of any node at the previous position times the probability of the edge from that node to node t. The node associated with the maximum value at the previous position is stored as the source of the current node t. The Viterbi value of the purchaseDay is also the Viterbi value of the most probable path. We get the path information by backtracking from the purchaseDay to its immediate source node. Control then moves to the source of the source node. The process continues until the node lastSeenDay is reached. The traversed path is output as the most probable path containing the suppressed check-ins.

c. Long term learning. Short term learning does not assign any probability to a weekday transition (or a time period) that is absent in the current inference. However, it is possible that the features absent in the current inference appeared in the previous inferences and/or may appear in the future inferences. In long

term learning, the probabilities are calculated based on the training dataset as well as all the inference inputs including the most recent one. Thus, a mobility feature’s influence from all the previous and the current observations are carried over to the future inferences. First, weekday transition learning is described. X la(Z|Y ) = a0 (Z|Y ) + ak (Z|Y ) ∀k∈[1,T ]

la(Z|Y ) + q ∀X∈W eekdays la(X|Y ) + 7 ∗ q

A(Z|Y ) = P

An intermediate value, la, of the transition (Y → Z) is calculated using a0 (Z|Y ) that represents transition counts from the training dataset and ak (Z|Y ) that represents the intermediate value from the kth inference event. Finally, we compute the transition probability A(Z|Y) between any two days of the week. Long term learning of a time period’s probability is performed by using its frequency from the training dataset as well as its intermediate values from all the inference events in the following ways. X lb(tp) = b0 (tp) + bk (tp) ∀k∈[1,T ]

B(tp) = P

3.2

∀tp0 ∈T P

lb(tp) + r lb(tp0 ) + (|T P | + 1) ∗ r

cMM engine

In the MM engine, it was assumed that transition between two days of the week is independent of time period. In reality, the time period between two given days of the week is limited by a fixed set of values. For example, possible time periods between Sunday and Monday are {1, 8, 15, 22,...}. In the cMM engine, we combine the two mobility features into one. The maximum time period that we consider between any two days is limited to 5 weeks. The weekday transition probability for a given time period is the following: tp

A(Z|Y )tp = P

∀tp0 ∈TY Z

f requency(Y −→ Z) + h tp0

f requency(Y −−→ Z) + (|TY Z | ∗ h)

Here, h is the smoothing parameter and TY Z is the time periods between the days Y and Z. The rest of the details is similar to that of the MM engine.

3.3

Apriori engine

We implement the Apriori algorithm as our second inference method. We wanted to validate the MM engine’s result against another data mining algorithm. The Apriori algorithm is a well-known technique for finding patterns in datasets. We use it to find patterns and then use those patterns to infer the suppressed information.

3.3.1

Training

We use 50% of a user’s check-in history as training dataset. We changed the format of the dataset to make it suitable for the Apriori algorithm. Check-in dates are converted to days of the week. Check-in times are removed as we do not attempt to infer the time but the date. Time period between two check-in dates are calculated and put in between two days of the weeks. Time period is measured in

the granularity of a day and is rounded up to the nearest integer. The new format of the check-in history looks like (Sunday, 1, M onday, 2, W ednesday, . . . . . . . . ., F riday). Considering the training dataset as one big transaction, the Apriori algorithm is applied to find the frequent patterns within that transaction. The resultant patterns consist of days of the week. If there are more than one day, then they are separated by the time period. < Sunday, 1, M onday > is an example pattern. The number of times a pattern appears in the history is called its support. A pattern is frequent when its support should at least be the minimum support. The value of the minimum support is empirically determined.

3.3.2

Testing

The format of the test dataset is same as before where the inference input consists of the lastSeenDay, the purchaseDay and the number of suppressed check-ins, N , in between the two days. The frequent patterns found in the training phase are used to infer the suppressed check-ins. It is done in two steps. First, a frequent pattern that matches the inference input is sought. We propose the PatternFinder algorithm to do this task. In the second, the matched pattern, if any, is used to infer the suppressed check-ins. We propose the Smart inference algorithm to do that task. In the following, we first describe the PatternFinder and then the Smart inference algorithm.

a. PatternFinder algorithm. An inference input may appear multiple times in a frequent pattern. Consider (N = 1, lastSeenDay = Sunday, purchaseDay = T uesday) as the inference input and the following as a pattern. < Sunday, 1, M onday, 1, T uesday, 5, Sunday, 2, T uesday > Note that the above pattern contains the input twice. The PatternFinder algorithm first finds the subsequence within each frequent pattern that best matches the inference input. To do that, all the appearances of the input within a pattern are given scores based on their similarity to the input. The appearance with the highest score is saved as the representation of that pattern. This step is called intra pattern ranking. Next, the pattern’s representation is used to rank it in compare to other patterns. This step is called inter patterns ranking. Both ranking algorithms are quite similar to each other. Due to space limitation, we describe only the inter patterns ranking here. Given a set of frequent patterns, each is assigned a score based on how close their representations are to the inference input. Scores range from negative integers to 0. The formula to calculate score is: Score = (x&y)(|N − N 0 | ∗ N P enalty + g ∗ T P P enalty) +(x ⊕ y)(InputP enalty + (1 − f ) ∗ N + f ∗ N 0 ) +¬(x||y)(InputP enalty) In the formula, the symbol N 0 denotes the number of qualified check-ins found in a pattern. By the qualified checkins, we mean the number of check-ins that are between the lastSeenDay and the purchaseDay. When the pattern contains either of the input days only, then N 0 is the number of check-ins that appears after the lastSeenDay or before

the purchaseDay. Boolean values x and y are true when a pattern contains the lastSeenDay and the purchaseDay, respectively. The score calculation has three parts. The first part (x&y) contributes when the pattern contains both input days. The score of this type of patterns is penalized by adding N P enalty(= −2) for each unit difference from N 0 to N . If the total time period T P 0 of a pattern differs from that of the input, then it is also penalized by adding T P P enalty(= −1) to its score. The function g returns true when the two time periods are different. Therefore, between two frequent patterns that have the exact number of checkins as N , the one that matches the input with regards to the time period will have a higher score. Any pattern that contains the input days but has N 0 = 0 is ignored. The next part (x ⊕ y) of the equation contributes when the pattern contains either of the input days, but not both at the same time. A value equal to InputP enalty(= −15) is added to its score as it does not have both days. The pattern is then assigned a score based on how many checkins it has after the lastSeenDay or before the purchaseDay. The function f returns true when (N 0 ≤ N ). So, when the pattern has more dates than the required N , it is assigned score for only N dates. The maximum score of a pattern containing either of the input days is -10 (=-15+5) which is lower than the lowest score -9 (=-2*4 -1) of a pattern containing both input days. The third part ¬(x||y) of the equation contributes when a pattern contains neither of the input days. We ignore these patterns and do not use them to infer the suppressed check-ins. These patterns are penalized in such a way that their highest score is lower than the lowest score of a pattern containing one input day. Finally, the PatternFinder algorithm returns the pattern with the highest score. When there are multiple patterns with the highest score, the pattern with the higher support than the rest is returned.

b. Smart inference. Given that a pattern matching the input has been found, a typical inference will use the check-ins between the input days in the pattern to infer the suppressed check-ins. Such a simple approach will not work when the pattern contains only one of the input days. It will also quit after the first attempt even when there are still uninferred check-ins (referred to as empty spots). Unlike the simpler approach, the Smart inference takes only one day from a pattern if it partially matches the input. The algorithm then prepares another inference input by updating the value of either the lastSeenDay or the purchaseDay with the recently inferred day. The algorithm continues until there are no more empty spots or the number of inferred check-ins remain unchanged after a given pass of the algorithm. If the latter is true, then there is no frequent pattern that contains the input partially or completely. In such case, the Smart inference attempts to use the frequent patterns of size 1 to infer the remaining suppressed check-ins. The Smart inference algorithm is given in Table 2 where all the variables are initialized at line 1-3. A list data structure checkins is used to store the inferred dates. It is initialized with [lastSeenDay, ?, ..., ?, purchaseDay] where symbol ? represents an empty spot that must be inferred. Two pointers beg and rear are maintained to keep track of the empty spots in the list as the check-ins are inserted to both

Table 2: Smart inference algorithm 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

Table 3: Overall accuracies of the inference engines

Input: lastSeenDay, purchaseDay, N, f P atterns, f req1 Output: checkins checkins = [lastSeenDay, ?, . . . , ?, purchaseDay] beg = 1, rear = N, eSpots = N, lday = lastSeenDay rday = purchaseDay, leSpots = −1 Repeat step 6 − 17 while(beg ≤ rear&eSpots 6= leSpots) pattern, ldayi , rdayi = P atternF inder(lday, rday, eSpots, f P atterns) If (ldayi >= 0 and rdayi >= 2) : Insert(checkins, beg, rear, pattern, ldayi , rdayi ) Elseif (ldayi >= 0 and (ldayi + 1) < pattern.length) : Insert1(checkins, beg, rear, pattern, ldayi + 1) Elseif (rdayi >= 1) : Insert1(checkins, beg, rear, pattern, rdayi − 1) lday = checkins[beg − 1] rday = checkins[rear + 1] leSpots = eSpots eSpots = rear − beg + 1 If (eSpots > 0) : Sequences = genSequences(lday, rday, eSpots) seq = getBestSequence(Sequences, f req1) Insert(checkins, beg, rear, seq, −1, seq.length)

ends of the list. The initial values of lday and rday are the lastSeenDay and the purchaseDay, respectively. The loop at line 5 repeats the subsequent instructions from line 6 to 17 until the given condition is false. At line 6, the function PatternFinder is called which returns the frequent pattern with the highest similarity to the inference input. The condition at line 7 checks whether the returned pattern contains both of the input days. The suppressed check-ins are inferred by calling the function Insert that takes the intermediate check-ins between the lday and the rday of the pattern and inserts them to the list checkins starting at the left most empty spot. The pointer beg is moved to the right by the number of insertions. When the pattern contains either of the input dates, we take only 1 check-in from the pattern using the function Insert1. If it contains the lday (line 9), then the first check-in that appears after lday is inserted and the function moves the beg to the right by 1 position. Otherwise, the first check-in that appears before the rday is inserted to the list and the pointer rear is moved to the left by 1 position. The instructions at line 14-17 reinitialize the variables so that they are up-to-date for the next iteration of the loop. It involves the creation of a new inference input. The date that appears before the first empty spot in checkins becomes the new value of lday and the date that appears after the last empty spot is assigned to rday. leSpots takes the value of eSpot and eSpot will be updated to the number of suppressed check-ins that are yet to be inferred. When the loop terminates and there are still empty spots, the instructions at line 19-21 attempt to use the frequent patterns of size 1 (stored in the list f req1). All possible date sequences of size eSpots are enumerated according to the calendar between lday and rday using the function genSequences. The sequence containing the highest number of days that are in the list f req1 is chosen using the function getBestSequence. In case there are multiples of such a sequence, the function adds the support of all the days in a sequence and chooses the sequence with the highest sum-

Apriori engine MM engine Baseline cMM engine

Gowalla dataset 58.11% 52.44% 41.87% 32.65%

Brightkite dataset 59.02% 56.25% 48.16% 38.7%

mation which is then used to infer the remaining suppressed check-ins.

3.4

Accuracy of the inference engines

Figure 2 and 3 show the accuracy of different inference engines on the Brightkite and Gowalla dataset, respectively. The dates between the lastSeenDay and the purchaseDay are referred to as available days. The baseline performance is calculated by randomly choosing N days out of the available days. The uncertainty in an inference input depends on the number of the available days. The uncertainty also depends on N which is the number of the available days an engine should infer. Therefore, we present the inference result by plotting available days along the X axis, accuracy along the Y axis and finally, each line in the graph represents inference with a given value of N. In other words, each line shows an inference engine’s accuracy in inferring a given number of suppressed check-ins over different available days. The inference engines’ overall accuracies are given in Table 3. The Apriori engine equipped with our proposed Smart inference algorithm shows the best accuracy among all the engines. When the Apriori engine uses a basic inference instead, the accuracy goes down to 33.73% and 38.88% on Gowalla and Brightkite datasets, respectively. Thus, the proposed Smart inference improves the performance by 72% and 52% on the two datasets. Among all the engines, the Apriori and the MM engine surpass the baseline accuracy. However, the performance of the cMM engine with the combined feature lags below the baseline. The reason is that a weekday transition’s probability is shared with all possible time periods between the two days of the week in the cMM engine. Suppose, a 2 → T uesday in their check-in user shows the trait Sunday − history. The cMM engine spreads the probability to all the 2 9 variations Sunday − → T uesday, Sunday − → T uesday and so on. As a result, the actual trait does not remain as strong as it should be. Inference input is prepared by taking check-ins from the datasets that consist of general purpose check-ins made by GSN users. As a result, we did not have complete control over the uncertainty associated to an inference input. There were some inputs with available days = N where there is no uncertainty for an adversary. We ignore these inputs while calculating the experimental results.

4.

RECOMMENDATION ENGINE

Since the suppressed information can be recovered through inference, suppression alone is not enough to protect a user’s privacy. If users choose check-in dates by maintaining a balance among all the mobility features, then an adversary will have poor inference accuracy. However, such strategy will require a user to check in on all types of days and will not support their life’s normal routine. We propose a recommendation algorithm that works in combination with suppres-

Figure 2: Inference results on Brightkite dataset

Figure 3: Inference results on Gowalla dataset

works by utilizing this difference. It calculates the mobility features’ probabilities from the complete check-in history. It also knows the amount of history exposed to a merchant and thus, calculates an inference engine’s estimate of the probabilities. Based on the difference between the two sets of values, a weight is calculated for each transition in Figure 4. Given a transition from date s to date t, its weight is given by the following: weight(t|s) = |P r (t|s) − P (t|s)|

Figure 4: All possible date sequences of length N

sion. It helps users to choose dates that cannot be predicted in advance or inferred (when suppressed) with good accuracy. At the same time, it allows users to build check-in routine and follow them.

4.1

0 0 + weight(t|s)) = max0 (V(s)(pos−1) V(t)(pos)

Algorithm’s description

For each deal offered by a merchant, the recommendation engine will suggest M dates to users within a given deadline. Given M = N + 1, the N dates and the purchaseDay are recommended based on two different strategies that will be described shortly. Once a user redeems the deal based on the recommendations, only the purchaseDay among the M check-ins is revealed to the merchant and the rest of the dates are suppressed. We run the recommendation algorithm simulating that a user redeems 20 deals at a given store. We repeat the simulation for 3669 user-location pairs and generate a test dataset containing only the purchaseDays and N s. To test whether the recommended dates cannot be inferred with good accuracy, we apply the inference engines, introduced in Section 3, to recover the suppressed dates of the test dataset. Recall that our inference engines are constrained to take inputs of M = [2, 6] and dealDeadline = 31. We use the same limiting values for M and dealDeadline in the recommendation engine. For the simplicity of the implementation, we assume that users will perform all the checkins of a deal within the next 31 days from the lastSeenDay that is the purchase day of their most recently redeemed deal.

4.1.1

Here, P (t|s) is the transition probability according to the inference engine’s belief while P r (t|s) is the transition probability calculated from the complete check-in history. The weight of a path is the summation of weights of all the edges in the path. The path with the highest weight is recommended to the user. We find the path with the highest weight by adapting the Viterbi algorithm. A value, similar to the Viterbi value, is calculated for each node t ∈ S 0 and is denoted by V 0 . Here, S 0 is the set containing the next 30 days.

Recommending N check-ins By using the next 30 days since the lastSeenDay, the recommendation engine generate all possible date sequences of length N . Note that N th check-in can never be the 31st day. If it is, there will not be any day left to recommend as the purchaseDay. To get a sense of how the sequences look like, consider that a user recently purchased a deal on Oct 9, 2010. The user is now planning to redeem another deal which requires 3 (value of M ) check-ins within the next 5 days. We reduce the dealDeadline from 31 days to 5 days for simplifying the example. Given the lastSeenDay is the starting node, all possible date sequences of length 2 (value of N ) are shown in Figure 4. The engine chooses one of the sequences and recommends it to the user. Due to suppression, an adversary will not have access to all the check-ins made by the users. An inference engine’s estimation of the mobility features’ probabilities, therefore, will differ from the actual check-in behavior. The recommendation engine

∀s∈S

It is expressed in the above equation that among all the paths that sink to a node, the path with the highest weight determines the V 0 value of the node. Since an edge’s weight is not a probabilistic value, we use addition instead of multiplication to compute the V 0 values. The algorithm proceeds by computing the V 0 for each node in Figure 4. The path with the highest weight is found by backtracking from the node with the highest V 0 at position N until position 1 is reached. Before a user starts to use the RwP framework, their check-in history is exposed in plain to the inference engine. When they use the recommendation engine for the first time, both the inference and the recommendation engines will have identical probability values for the mobility features and there will be no difference between the two sets of probabilities. In such case, the recommendation engine randomly chooses N dates out of the next 30 days for the first deal. For the second deal and onwards, the recommendation will be based on the difference between the two sets of probabilities.

4.1.2

Recommending the purchase day

The purchaseDay is randomly chosen from one of the dates left after selecting the N check-ins. The purchaseDay is released to the merchant once a user redeems the deal. If it is chosen based on the recommendation strategy, it will leak information about what features yield higher difference from the adversary’s set of probability values to the recommendation engine’s values. In addition, the recommendation engine does not use the purchaseDay when it calculates the mobility features’ probabilities. If it does, then the adversary can rule out the mobility features associated with this day for the next deal as those features will have similar probability values in both sets resulting in smaller differences or weights. Since the purchaseDay does not affect the recommendation strategy, it does not reveal any information to the adversary. To give an example of how the purchaseDay is recommended, consider the path (Oct 10, Oct 11) in Figure 4 was recommended as N check-ins in the previous step. The remaining dates including the deadline are {Oct 12, 13, 14}. Here, Oct 12 is not eligible to be the purchaseDay since it

will result into an inference input where available days = N . Therefore, one of the remaining dates {Oct 13, Oct 14} is randomly chosen as the purchaseDay.

4.2 4.2.1

Why users would prefer the recommendation engine Privacy

The recommendation engine chooses those check-in dates, the probabilities of which are very different than the adversary’s belief. Knowing just one set of the probability values, the adversary will not be able to predict what mobility features will yield the highest difference and thus, cannot tell what dates will be recommended. To analyze an adversary’s prediction chance, we know that the recommendation engine chooses a sequence of M dates out of the next 31 days since the lastSeenDay. Total number of sequences of size M is 31 CM . The probability that an adversary will correctly predict all the recommended check-ins is only 31 C1 . It is a M very small probability as the adversary will have to pick one out of 465 sequences when M = 2. The probability is even smaller when M > 2. Once a user redeems the deal, the adversary will know the purchaseDay of the deal. The uncertainty in the inference of the recommended dates prior to the purchaseDay will be reduced as the inference is now limited to available days, denoted by avdays, which is the number of days between the lastSeenDay and the purchaseDay. avdays may have any value from 2 to 30. The adversary’s chance to recover 1 . The best inference chance is 50% all N dates is avdays CN when avdays = 2 and N = 1 while the worst chance is 5.89 × 10−4 % when avdays = 30 and N = 5. The average chance over all combinations of avdays and N is 3.7%. We test the effectiveness of the recommendation engine using the inference engines introduced earlier. Using the recommended dates, we prepare a series of inference inputs. The MM engine correctly inferred 24.45% of the hidden recommended dates while the Apriori engine inferred only 9.63% of the dates. If the definition of weight in the recommendation r (t|s)−P (t|s)| , engine is slightly changed to weight(t|s) = |P P r (t|s)+P (t|s) then the MM engine’s accuracy drops to 12.05% while the Apriori engine’s accuracy stays about the same with 10.51%. These experimental results show that when a user redeems deals based on the recommended dates, an adversary who attempt to infer the dates will have lower accuracy and therefore, users will have better privacy.

4.2.2

Check-in routine

The mobility features’ probabilities that are associated with the recommended dates increase in greater magnitude in the recommendation engine’s calculation than the increase or decrease in their value in the inference engine’s learning. As a result, these features will yield higher difference than others. It means that a user can expect the days of the week that were recommended for one deal will be recommended for the next deal as well. The same thing is true for time periods. As a result, users will be able to form a routine for check-ins. To prove the claim, we ran the Apriori algorithm on the recommended dates for finding patterns. On average, for each user-location pair we found 15 frequent patterns of size 2 or more. If the modified definition of weight is used, we found 8 frequent patterns of size 2 or more. It is a fair number of patterns given that the simulation was run for

only 20 deals. The results show that users can form checkin routine which the RwP framework keeps hidden from the merchants through the suppression and the random selection of the purchase day.

5.

RELATED WORKS

Privacy in Geo-social commerce: To our best knowledge, our work is the first attempt to address the privacy issues of the deal redemption in GSNs. The closest to our work is the privacy framework, called Shy Mayor [7]. It allows users to secretly check in and earn badges in GSNs using various cryptographic techniques. The framework allows a user to hide among k users checked in at a store. It also allows a user to prove their presence at a commercial location without disclosing their identities to the merchants. However, the authors do not consider any event like the deal redemption where users need to interact with the merchant in person. The possibility that every time a user redeems a deal, the merchant may keep a record and use that information during inference is not investigated by the authors. Other than the commerce, there are a handful of works that address privacy issues arising from different services offered in GSNs such as private interest matching for dating [6], private proximity computation [17] between two users and decentralized architecture for GSNs [12]. Freni et. al. [10] propose to enforce a user’s preference about who can access their location at what precision level by generalizing the spatial and temporal information of the resource (e.g., check-in, photos, status, etc) and also, delaying publication of the resource. Inference on GSN datasets: In this section, we compare our inference results with the results from other researchers who have run inference on the same or different GSN datasets. The datasets that we have used is originally collected by Cho et al. [8] who applied an Expectation-Maximization based parameter estimation method and obtained 40% accuracy in predicting users’ location at a given time of the day. They also report temporal periodicity in users’ check-in behavior. Malmi et al. study a foursquare dataset and report 19.5% accuracy by using the 1st order Markov model where the check-in to a place depends only on the previous checkin [13]. The nature of inference performed by both works is different than ours. Along with the check-in time, these works predict user’s location. In these works, when the engine infers a user’s check-in at a place, it has access to the same user’s check-ins from all the other places as well. In contrast, we have performed personalized inference where the engine has access to a user’s check-in history from only one place. It is not possible to use many features in personalized inference. In addition, users tend to visit different venues of the same category. Instead of exact venue, the researchers [13] make a prediction about the category of venue. We could not do that as the datasets did not include venue type information. Cho et al. also report noise in the datasets which makes it further difficult to find the correlations. Privacy in the literature: Our work falls into the category of computational privacy. Another notable area is policy based privacy [16] where data providers’ preferences about how their data should be handled are stored with the data. Privacy preference may include variety of restrictions like purpose, visibility, granularity, and retention [4]. Data users are obligated to follow the preferences either through agreement or access control mechanism. In the recent past, we

have explored how to enforce data providers’ preferences when their data are shared within or between organizations [15].

6.

CONCLUSION

People commonly use GSNs for getting discounts from merchants by checking in to their stores. But they end up revealing their location information to the merchants. Dishonest merchants may use the information to track users’ location. Existing GSNs do not provide adequate controls to the users to mitigate the privacy concerns. In this work, we propose the privacy framework RwP to protect users’ privacy when they redeem deals in GSNs. The framework discloses only the necessary check-in information to the merchants. Even after the minimum disclosure, a merchant may run inference and recover the information that were not originally released. We propose a recommendation engine to help the users to reduce the success rate of the inference attacks. Currently, our recommendation engine assumes that a user accepts all the recommended check-ins for a given deal. In future, we want to extend the recommendation engine so that it adapts to the situation when users accept some of the recommendations and can still maintain the privacy. We also plan to extend our work for general check-in events where the user’s purpose is not solely redeeming deals. Some of the deals in GSNs require a group of people to check in. For such deals, we plan to enforce the minimum disclosure principle by revealing only the necessary check-in dates of the user who redeems the deal and the number of their accompanying friends to the merchants while the friends’ identities will be suppressed. Even such minimum disclosure may help recovering the suppressed check-ins of one of the friends. Furthermore, aggregation of all check-ins on a given day may help to infer the suppressed identities. In future, we want to investigate these issues in detail.

7.

REFERENCES

[1] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu. Hippocratic databases. In Proc. of 28th Int. Conf. on VLDB, pages 143–154, 2002. [2] R. Agrawal, R. Srikant, et al. Fast algorithms for mining association rules. In Proc. of 20th Int. Conf. VLDB, volume 1215, pages 487–499, 1994. [3] J. Bao, Y. Zheng, and M. F. Mokbel. Location-based and preference-aware recommendation using sparse geo-social networking data. In Proc. of 20th SIGSPATIAL, pages 199–208, 2012. [4] K. Barker, M. Askari, M. Banerjee, K. Ghazinour, B. Mackas, M. Majedi, S. Pun, and A. Williams. A data privacy taxonomy. In Dataspace: The Final Frontier, volume 5588 of Lecture Notes in Computer Science, pages 42–54. Springer Berlin Heidelberg, 2009. [5] A. Brush, J. Krumm, and J. Scott. Exploring end user preferences for location obfuscation, location-based services, and the value of location. In Proceedings of the 12th ACM international conference on Ubiquitous computing, pages 95–104. ACM, 2010. [6] J. Camenisch and G. M. Zaverucha. Private intersection of certified sets. In Financial Cryptography and Data Security, pages 108–127. 2009.

[7] B. Carbunar, R. Sion, R. Potharaju, and M. Ehsan. The shy mayor: Private badges in geosocial networks. In Applied Cryptography and Network Security, volume 7341, pages 436–454. 2012. [8] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: user movement in location-based social networks. In Proc. of 17th ACM SIGKDD, pages 1082–1090. ACM, 2011. [9] D. A. Field. Laplacian smoothing and delaunay triangulations. Communications in applied numerical methods, 4(6):709–712, 1988. [10] D. Freni, C. Ruiz Vicente, S. Mascetti, C. Bettini, and C. S. Jensen. Preserving location and absence privacy in geo-social networks. In Proc. of 19th CIKM, pages 309–318, 2010. [11] J. Hagenauer and P. Hoeher. A viterbi algorithm with soft-decision outputs and its applications. In IEEE GLOBECOM, pages 1680–1686 vol.3, 1989. [12] S. Jahid, S. Nilizadeh, P. Mittal, N. Borisov, and A. Kapadia. Decent: A decentralized architecture for enforcing privacy in online social networks. In PERCOM Workshop, pages 326–332, 2012. [13] E. Malmi, T. M. T. Do, and D. Gatica-Perez. Checking in or checked in: comparing large-scale manual and automatic location disclosure patterns. In Proc. of 11th MUM, pages 26:1–26:10, 2012. [14] Mashable. Facebook confirms gowalla acquisition. url: http://mashable.com/2011/12/05/facebooks-acquiresgowalla. [15] M. Moniruzzaman and K. Barker. Delegation of access rights in a privacy preserving access control model. In Privacy, Security and Trust (PST), 2011 Ninth Annual International Conference on, pages 124–133, 2011. [16] M. Moniruzzaman, M. Ferdous, and R. Hossain. A study of privacy policy enforcement in access control models. In Computer and Information Technology (ICCIT), 2010 13th International Conference on, pages 352–357, 2010. [17] A. Narayanan, N. Thiagarajan, M. Lakhani, M. Hamburg, and D. Boneh. Location privacy via private proximity testing. In NDSS, 2011. [18] A. Noulas, S. Scellato, N. Lathia, and C. Mascolo. Mining user mobility features for next place prediction in location-based services. In Proc. of 12th IEEE ICDM, pages 1038–1043, 2012. [19] S.-M. Qin, H. Verkasalo, M. Mohtaschemi, T. Hartonen, and M. Alava. Patterns, entropy, and predictability of human mobility and life. PloS one, 7(12):e51353, 2012. [20] L. Rabiner and B.-H. Juang. An introduction to hidden markov models. ASSP Magazine, IEEE, 3(1):4–16, 1986. [21] K. Seymore, A. McCallum, and R. Rosenfeld. Learning hidden markov model structure for information extraction. In AAAI Workshop on Machine Learning for Information Extraction, pages 37–42, 1999. [22] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura. Speech parameter generation algorithms for hmm based speech synthesis. In Proc. of IEEE ICASSP, volume 3, pages 1315–1318 vol.3, 2000.