C:/course/kl-degree anonymization/newexp/plot ... - Semantic Scholar

1 downloads 3215 Views 554KB Size Report
Aug 24, 2011 - friendship attack, based on the vertex degree pair of an edge. .... (c). Figure 2: Examples of anonymous graphs: (a) 2. 2. - degree, (b) 3. 2.
Privacy-Preserving Social Network Publication Against Friendship Attacks Chih-Hua Tai

Philip S. Yu

De-Nian Yang

Dept. of Electrical Engineering National Taiwan University

Dept. of Computer Science University of Illinois at Chicago

Inst. of Information Science Academia Sinica

[email protected]

[email protected] Ming-Syan Chen

[email protected]

Research Ctr. for Information Tech. Innovation Academia Sinica

[email protected] ABSTRACT Due to the rich information in graph data, the technique for privacy protection in published social networks is still in its infancy, as compared to the protection in relational databases. In this paper we identify a new type of attack called a friendship attack. In a friendship attack, an adversary utilizes the degrees of two vertices connected by an edge to re-identify related victims in a published social network data set. To protect against such attacks, we introduce the concept of k2 -degree anonymity, which limits the probability of a vertex being re-identified to 1/k. For the k2 -degree anonymization problem, we propose an Integer Programming formulation to find optimal solutions in small-scale networks. We also present an efficient heuristic approach for anonymizing large-scale social networks against friendship attacks. The experimental results demonstrate that the proposed approaches can preserve much of the characteristics of social networks.

(a) original social network G

Figure 1: An example of the friendship attack. rapid growth in the number of services and applications that leverage social networks, there is increasing concern about privacy issues in published social networks [8, 15, 20]. The prevention of vertex re-identification is one of the critical privacy issues that have been addressed. The complexity of graph data has motivated various background knowledge attacks [5, 6, 9, 19, 21]. In this paper, we identify a new type of attack, called a friendship attack, based on the vertex degree pair of an edge. Note that in a social networking website, such as Facebook, MySpace or Friendster, an adversary can acquire the number of friends of an individual1 . Moreover, the adversary can also extract the friendship relation between two individuals from the interaction information publicly available on the website. Therefore, using the vertex degrees of two individuals and their friendship relation, the adversary can issue a friendship attack on the published social network to re-identify the vertices corresponding to an individual and his friend as well as associated vertex information, such as hobbies, activities and religious beliefs. Consider Figure 1 as an example. Suppose that a user’s friend count is made publicly available on a social networking website. With the vertex degree information, an adversary cannot uniquely reidentify anyone from the na¨ıve anonymized social network in Figure 1(b). However, if the adversary also knows that Bob and Carl are friends, both Bob and Carl are uniquely identified by the vertex degree pair (2, 4). This example illustrates that it is possible to launch an effective attack and identify individuals as long as the friendship information on the social networking websites can be obtained. To prevent friendship attacks, we introduce the novel concept of k2 -degree anonymity, which ensures that the prob-

Categories and Subject Descriptors H.2.8 [Database Management]: Database Applications— Data Mining

General Terms Algorithms

Keywords privacy, anonymization, social network publication

1.

(b) na¨ıve anonymized G

INTRODUCTION

The social relationships and activities shared by individuals in a social network can be modeled as a graph in which each vertex represents an individual, and the social relationships and activities are summarized by the edges. Due to the

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. KDD’11, August 21–24, 2011, San Diego, California, USA. Copyright 2011 ACM 978-1-4503-0813-7/11/08 ...$10.00.

1

Nowadays the default setting in Facebook is to display all friends of an individual in his/her corresponding profile, in order to encourage users to find common friends.

1262

to vertex re-identification. Therefore, it is possible that a victim can be uniquely re-identified, even if the probability that an edge links the victim and his/her friend is small. For example, in the social network in Figure 1, an attacker cannot determine whether Bob and Carl are friends from the anonymized graph because the probability of an edge linking a vertex of degree 2 and a vertex of degree 4 is 1/6. However, if the friendship information is known, both Bob and Carl can be uniquely identified by the degree pair (2, 4). Cheng et al. [3] propose k-isomorphism to protect links as well as vertices against subgraph attacks. Similar to [21], they do not impose restrictions on the size of a subgraph, and they need to add vertices to anonymize a graph such that it becomes symmetric. In addition, the work in [18] considers multiple edge types, and proposes five strategies to protect sensitive edges.

ability of a vertex identity being revealed is not greater than 1/k even if an adversary knows a certain degree pair (dA , dB ), where A and B are friends and dA and dB are their vertex degrees, i.e., the number of friends, respectively. For the k2 -degree anonymization problem, we propose an Integer Programming formulation to find optimal solutions and an efficient heuristic approach for anonymizing large-scale social networks. After that, we also consider a stronger attack based on the degrees of a sequence of friends with length l, and extend the proposed algorithm to achieve the corresponding kl -degree anonymity. The experimental results show that the social networks anonymized by our approaches preserve the characteristics of the original social networks.

2.

RELATED WORK

A number of recent studies have been proposed to protect vertex identities in published social networks against various attacks [5, 6, 9, 13, 19, 21]. In [9], Liu and Terzi consider the attack of vertex degree, and propose k-degree anonymity to protect each individual in a group consisting of at least k vertices of the same degree. For the same attack model, Tai et al. [13] propose k-structural diversity to resist vertex reidentification as well as community re-identification. While the above approaches aim to prevent attacks based on vertex degrees, the objective of this paper is to prevent friendship attacks based on degree pairs. Zhou and Pei [19] identify the knowledge of the neighborhood configuration and propose a method to prevent 1-neighborhood attacks. Since such attacks focus on the connectivity of a subgraph within the range of one hop to a vertex, the method [19] does not consider the vertex degrees of neighbors; thus, it cannot protect against friendship attacks. For example, the graph in Figure 1(b) satisfies 1neighborhood anonymity for k set as 2, but the vertex pairs 2 and 3, 6 and 7, 4 and 5, and 8 and 9 can be uniquely reidentified by the degree pairs (2,4), (2,3), (3,1) and (4,1), respectively. Zou et al. [21] propose k-automorphism against subgraph attacks with an arbitrary range to any vertex. However, since any arbitrarily large subgraph must be protected, their approaches need to add many new vertices and adjust many edges of the original graph to make a graph symmetric for anonymization. This decreases the utility of the anonymized graph for some applications. Therefore, unlike [21], we focus on only degree pairs or a limited range of degree sequences that can be specified by users for the corresponding applications. Moreover, in contrast to [19, 21], we preserve the vertex set, i.e., we do not add or delete vertices, to better maintain the data utility of the anonymized graph. The approaches proposed in [5, 6] also protect against subgraph attacks. However, unlike the above works, [5] groups vertices and then collapses each group of vertices into a single super-vertex for privacy protection, while [6] partitions the graph into local substructures and treats each substructure as a single unit to be anonymized. Other works focus on the problem of link disclosure [3, 16, 17, 18]. Ying and Wu [16] investigate edge re-identification without considering the background knowledge of an adversary and perform random edge addition, edge deletion and edge swap for anonymization. Zhang et al. [17] assume that an adversary knows some vertex descriptions such as degrees, and propose reducing the probability of the existence of an edge linking two individuals by edge swap and edge deletion. The approach does not consider the resistance

3. PROBLEM FORMULATION In this paper, we model a social network as a simple graph G = (V, E), where V is the set of vertices corresponding to the individuals, and E ⊆ V × V is the set of edges representing the relationships between the individuals. Let dv denote the degree of a vertex v. For a published social network G of G, we define a friendship attack as follows. Definition 1. Friendship Attack . Given a target individual A and the degree pair information D2 = (d1 , d2 ), a friendship attack (D2 , A) exploits D2 to identify a vertex v1 corresponding to A in G, such that v1 connects to another vertex v2 in G with the degree pair (dv1 , dv2 ) = (d1 , d2 ). The published social network G may have multiple candidate vertices satisfying the above degree pair requirement. However, it is easier for the adversary to identify A from the candidate vertices when the number of candidate vertices is small. Therefore, to achieve privacy preservation, we define k2 -degree anonymity as follows. Definition 2. k2 -Degree Anonymity . A graph G is 2 k -degree anonymous if, for every vertex with an incident edge of degree pair (d1 , d2 ) in G, there exist at least k − 1 other vertices, such that each of the k − 1 vertices also has an incident edge of the same degree pair. If G has an isolated vertex, i.e., a vertex with degree 0, k2 degree anonymity requires that at least k − 1 other vertices with degree 0 must exist to ensure anonymity. Figure 2 shows examples of different anonymous graphs. The graph in Figure 2(a) has four candidate vertices {1, 4, 5, 6} with D2 = (1, 3), and two candidate vertices {2, 3} with D2 = (3, 3) and D2 = (3, 1). Therefore, the graph is 22 -degree anonymous. Similarly, the graph in Figure 2(b) is 32 -degree anonymous because all three vertices are candidate vertices with D2 = (2, 2). In contrast, the graph in Figure 2(c) has candidate vertices {1, 5} with D2 = (1, 2), {2, 7} with D2 = (2, 3), {6, 7} with D2 = (2, 2), {2, 6} with D2 = (2, 1), {4, 9} with D2 = (1, 3) and {3, 8} with D2 = (3, 1), (3, 2) and (3, 3). Therefore, the graph is 22 -degree anonymous. Note that k2 -degree anonymity does not simply identify k edges with degree pair (d1 , d2 ) because the k edges in this case are allowed to share some common vertices. Thus, this edgebased approach does not guarantee that at least k vertices will be provided for each degree in (d1 , d2 ). An adversary with the information D2 = (d1 , d2 ) can launch a friendship attack as well as a vertex degree attack, i.e., D = (d1 ), to identify a vertex corresponding to A.

1263

(a)

(b)

if a new edge (u, v) is added to G, and variable δu,v denotes if an existing edge is deleted from G. Variable γu,m indicates if the degree of vertex u is m, where Δ denotes the set of degrees. Variable φm,n indicates if D2 = (m, n) is an anonymous group that needs to be protected from friendship attacks, where Δ+ denotes the set of positive degrees, i.e., Δ = Δ+ ∪ {0}. Variable θm,n,u represents whether vertex u is with degree m and is in an anonymous group D2 = (m, n), and εu,v,n denotes if an edge (u, v) exists in the solution graph G, with the degree of vertex v as n. Specifically, the objective function is as follows.   min ω αu,v + (1 − ω) δu,v .

(c)

Figure 2: Examples of anonymous graphs: (a) 22 degree, (b) 32 -degree and (c) 22 -degree. However, k2 -degree anonymity has the following properties of downward closure protection. Proposition 1. If a graph G is k2 -degree anonymous, G is also k-degree anonymous. Proposition 2. If a graph G is k12 -degree anonymous, G is also k22 -degree anonymous for every k2 ≤ k1 .

(u,v)∈E /

The formulation includes the following three categories of constraints to ensure that αu,v and δu,v correspond to the solution to k2 -degree anonymization.

With the above properties, the problem involves anonymizing a graph G = (V, E) to a published version G = (V , E) such that G is k2 -degree anonymous. To limit the distortion in G, we preserve the vertex set, i.e., V = V , and allow only edge addition and deletion operations. Preserving the vertex set prevents the addition or removal of any individual. Since no potential candidates are removed, all possible leaders, influential vertices, and bridge vertices remain. This allows all corresponding behavior patterns to be more easily identified. In contrast, when vertex addition and deletion operations are allowed [3, 21], the information about vertices is likely to be distorted. A similar argument has been pointed out in record anonymization of relational databases, where the deletion or addition of records is considered less desirable than attribute value perturbation or generalization [4]. Based on the above observation, we define the anonymization cost incurred by edge addition and deletion, which are also considered in the anonymization cost of previous works2 [15, 20].

1. Degree Constraints    αu,v + (1 − δu,v ) = m×γu,m , ∀u ∈ V (1) (u,v)∈E /

(u,v)∈E



m∈Δ

γu,m = 1, ∀u ∈ V

(2)

m∈Δ

The two constraints find the degree of each vertex. The left-hand-side (LHS) of constraint (1) is the sum of the number of edges added to G and the number of existing edges that are not deleted. Constraint (2) ensures that only a unique degree is selected for each vertex, and the right-handside (RHS) of constraint (1) thereby guarantees that only γu,m with the correct degree is 1 for every vertex u. 2. Anonymization Constraints γu,m + γv,n + αu,v ≤ 2 + φm,n , ∀ (u, v) ∈ / E, ∀m, n ∈ Δ+ (3) γu,m + γv,n − δu,v ≤ 1 + φm,n , ∀ (u, v) ∈ E, ∀m, n ∈ Δ+ (4)  k × φm,n ≤ θm,n,u , ∀m, n ∈ Δ+ (5)

Definition 3. Anonymization Cost. Given a weight ω, 0 < ω < 1, the cost of anonymizing G = (V, E) to G = (V, E) is Cost(G, G) = ω|E\E| + (1 − ω) |E\E|.

u∈V

θm,n,u ≤ γu,m , ∀m, n ∈ Δ+ , ∀u ∈ V

(6)

Constraints (3) and (4) identify the anonymous groups that must be protected. Specifically, an anonymous group D2 = (m, n) needs to be protected if there are two vertices u and v with degrees m and n respectively, and u and v are connected in G. We consider two cases. First, constraint (3) considers that (u, v) does not exist in G, but is added to the solution. In this case, the LHS of constraint (3) is 3, forcing that φm,n on the RHS must be 1. In the second case, if the existing (u, v) is not removed from G, i.e., δu,v = 0, the LHS of constraint (4) is 2, and φm,n on the RHS must be 1. Therefore, the above two constraints guarantee that φm,n is 1 for each anonymous group that needs to be protected from friendship attacks. Constraint (5) ensures k2 -degree anonymity by enforcing that there are at least k vertices for each of the two degrees in an anonymous group. Specifically, for each anonymous group D2 = (m, n) with φm,n as 1, the RHS of (5) chooses at least k different vertices in V for m, and θm,n,u of each selected vertex u is 1. Meanwhile, constraint (6) guarantees that the degree of each selected vertex u must be m. Note that because edge (u, v) is identical to (v, u), constraints (3) and (4) will implicitly enforce φn,m to be 1 if φm,n is also set as 1 by these two constraints. Therefore, another k vertices in V are chosen for n in constraint (5), and θn,m,v of each chosen vertex v is guaranteed to be 1

When ω approaches zero, we prefer to preserve the existing edges and generate a super graph G of G. In contrast, a larger ω leads to more edge deletions. We compare the solutions with different ω settings in Section 6.4. Next, we define the anonymization problem considered in this paper. Definition 4. Problem k2 -Degree Anonymization. Given G = (V, E), a positive integer k, and a weight ω, 0 < ω < 1, the problem is to minimize Cost(G, G) for anonymizing G to G = (V, E) such that G is k2 -degree anonymous. To solve the problem, we propose an Integer Programming formulation in Section 4 to find optimal solutions for small instances. Then, in Section 5, we present an efficient heuristic algorithm for anonymizing large-scale social networks.

4.

(u,v)∈E

INTEGER PROGRAMMING

In this section, we propose an Integer Programming formulation for k2 -degree anonymization. The formulation contains the following binary variables. Variable αu,v represents 2

However, this paper slightly generalizes the cost model and allows users to specify different weights for edge addition and deletion.

1264

vj ) denote the minimum cost of dividing the overall vertex sequence v1 , ..., vj such that each subsequence contains at least k vertices. We derive DegCost(v1 , vj ) with dynamic programming as follows.

in this case. The degree of each chosen vertex v will be set as n by constraint (6). 3. Enforcement Constraints  θm,n,u ≤ εu,v,n , ∀m, n ∈ Δ+ , ∀u ∈ V

(7)

DegCost(v1 , vj ) = min  mindx CluCost(v1 , vj , dx ), mink≤i≤j−k (DegCost(v1 , vi ) + mindx CluCost(vi+1 , vj , dx )), (13) where mindx CluCost(v1 , vj , dx ) corresponds to the case that v1 , ..., vj belong to the same subsequence. To divide v1 , ..., vj into multiple subsequences, our dynamic programming scheme leverages the existing result DegCost(v1 , vi ) and finds the minimum anonymization cost to move vi+1 , ..., vj to the same subsequence. In other words, given the existing subsequences for v1 , ..., vi , we attach a new subsequence vi+1 , ..., vj to the result. Therefore, i ≥ k must hold to ensure that we have at least k vertices in each subsequence from v1 , ..., vi . Similarly, i ≤ j − k must hold to ensure that there exist at least k vertices in the new attached subsequence. If no vertex vi satisfies the above two conditions, we set DegCost(v1 , vj ) as mindx CluCost(v1 , vj , dx ). The above dynamic programming scheme requires O(|V |4 ) time. We devise the following strategies to reduce the complexity to O(k|V | log |V |) time. The first strategy reduces the time required to find CluCost(vi , vj , dx ). We observe that CluCost(vi , vj , dx ) first decreases monotonically and then increases monotonically as the target degree dx varies from dvi to dvj . In other words, the function of dx is convex. Intuitively, as dx approaches dvi , the number of vertices required to add edges is significantly more than the number of vertices that are required to remove edges. Therefore, the cost increases accordingly. Similarly, as dx approaches dvj , we have much more vertices required to remove edges. Based on the above observation, we propose the following Three-Point Binary Search (TPBS) scheme to find the optimal dx .

v∈V :v=u

εu,v,n ≤ γv,n , ∀u, v ∈ V, ∀n ∈ Δ+

(8)

+

/ E, ∀n ∈ Δ εu,v,n ≤ αu,v , ∀ (u, v) ∈

(9) +

εu,v,n ≤ 1 − δu,v , ∀ (u, v) ∈ E, ∀n ∈ Δ

(10)

After the Anonymization Constraints choose the vertices for each anonymous group D2 = (m, n), the Enforcement Constraints ensure that each chosen vertex u with degree m connects to another vertex with degree n. Specifically, when θm,n,u is 1, constraint (7) assigns at least one vertex v with εu,v,n as 1, and constraint (8) enforces that the degree of v must be n. Moreover, constraints (9) and (10) require that u and v must be connected by an edge in G. If the edge does not appear in the input graph G, constraint (9) will add this edge to the solution. On the other hand, constraint (10) enforces that this edge cannot be deleted, i.e., δu,v = 0, when (u, v) is in G. Therefore, u and v must belong to the same anonymous group with D2 = (m, n). Moreover, to consider the isolated vertices with degree 0, we include the following constraint.  k × γu,0 ≤ γv,0 , ∀u ∈ V (11) v∈V

The above constraint states that if any vertex u has degree 0, the solution must have at least k vertices with the same degree to ensure the anonymity of u.

5.

SCALABLE APPROACH

In this section, we propose a scalable algorithm, called DEgree SEqence ANonymization (DESEAN), for k2 -degree anonymization of large-scale social networks. Algorithm DESEAN consists of three steps. The first step clusters vertices with similar degrees, selects a target degree for each cluster, and ensures that each cluster contains at least k vertices. In order to achieve the required level of anonymity protection between two clusters, the second step adds or removes edges as necessary. The last step adjusts the edges in the graph such that all the vertices in each cluster meet the target degree selected in step 1. We explain each step in detail below. Step 1. Degree Sequence Anonymization. This step chooses at least k vertices with similar degrees for each cluster. Without loss of generality, we assume that the vertices are sorted in decreasing order of the degrees, i.e., dvi ≥ dvj for ∀i ≤ j. The step cuts the sequence into multiple subsequences, each of which will share the same degree with steps 2 and 3 explained later. Specifically, for each vi and vj , Formula (12) first evaluates the cost CluCost(vi , vj , dx ) of grouping vertices vi , ..., vj in the same subsequence with a target degree dx . CluCost(vi , vj , dx ) =  (du − dx ) + ω (1 − ω) dvi ≥du

>dx

 dx >d

1. Set dL as dvi and dS as dvj . The optimal dx must be within the range [dvi , dvj ], i.e., dvi ≥ dx ≥ dvj . 2. Set dAvg as (dL + dS )/2 . Let C + = CluCost(vi , vj , dAvg + 1), C = CluCost(vi , vj , dAvg ), and C − = CluCost(vi , vj , dAvg − 1). 3. Consider the three cases in Figure 3: (a) if C + ≥ C and C ≤ C − , return C and dAvg , and terminate the search process; (b) if C + ≤ C ≤ C − , set dS as dAvg because the optimal dx ∈ [dL , dAvg ]; or (c) if C + ≥ C ≥ C − , set dL as dAvg because the optimal dx ∈ [dAvg , dS ]. 4. Go to 2 and repeat. The TPBS scheme can reduce the complexity of finding mindx CluCost(vi , vj , dx ) for each (vi , vj ) to O(|V | log |V |) time. Next, we further reduce the complexity to O(log |V |) with the proposed Pre-Cumulation (PC) strategy, which determines the cost of decreasing or increasing the degrees in a subsequence to a proper degree d in advance. Specifically, let LV [d] denote the number of vertices with degrees larger than d. First, we determine the following cost of reducing the degrees to ensure that the degrees of all vertices are at most d,  Dec2Deg[d] = du − d × LV [d]. (14)

(dx − du ), (12)

u ≥dvj

where the first and second terms are the anonymization costs for edge deletion and addition, respectively. Let DegCost(v1 ,

du >d

1265

With LV [d] and Dec2Deg[d], we can derive the cost of decreasing the degrees such that the degrees in a subsequence vi , ..., vj are at most d,  (du − dx ) dvi ≥du >dx

=(



du −

du >dx



(a)

(b)

(c)

Figure 3: Illustrations of TPBS.

du ) − dx (LV [dx ] − LV [dvi ])

du >dvi

= Dec2Deg[d ] − Dec2Deg[dvi ] − (dvi − dx )LV [dvi ]. (15) x

Therefore, the first part of Formula (12) can be obtained in O(1) time with LV [d] and Dec2Deg[d] derived in O(|V |) time in advance. Similarly, let SV [d] denote the number of vertices with degrees smaller than d. Then, we derive the cost of increasing the degrees such that the degrees of all vertices are at least d,  Inc2Deg[d] = d × SV [d] − du , (16)

(a)

 du k holds, or v has more than one edge incident to the vertices in x with SatTbl[y][x] = k. Once the target degree of a vertex u is achieved, condition 1 ensures that the degree of u would not be changed in the subsequent process. Note that SatTbl[x][y] counts the number of vertices in x with edges connecting to vertices in y. Conditions 2 and 3 guarantee that SatTbl[x][y] ≥ k and SatTbl[y][x] ≥ k still hold after the deletion of (u, v). Therefore, we can remove the edge (u, v) to reduce the degree of u without violating k2 -degree anonymity for protecting the pairs of connected vertices in x and y. This process repeats until the degree of u is dx . After all the vertices have been processed to achieve their target degrees, DESEAN outputs the anonymized graph. Example 3. Following Example 2, DESEAN scans the vertices in the order of DEGIBFHJAC. First, for vertex D with dD > 4, DESEAN removes the edge (D,G) to reduce the degree of D. The edge (D,G) is chosen because G is a subsequent vertex with the largest degree difference among vertices A, B, F and G. In addition, the deletion of (D,G) will not violate 22 -degree anonymity for protecting the vertices in the clusters {D, E} and {G, I}. Similarly, the edge (I,J) is then deleted to reduce the degree of vertex I. Figure 4(d) shows the resulting 22 -degree anonymous graph. 

6.

Original G Degree Friendship 0.28% 5.37% 0.53% 10.69% 0.73% 14.71% 0.93% 18.44%

k-degree G Friendship 2.89% 4.65% 5.82% 7.23%

k2 -degree G Friendship 0.38% 1.10% 1.43% 2.27%

at the PODS 2009 conference. An edge is added to connect two authors (vertices) if they are co-authors of a paper. There are 107 edges in the PODS09 data set. The other data set is the 20TopConf data set containing the authors who have ever published their papers at 20 top conferences such as KDD, VLDB, ICDM, to name a few. 20TopConf consists of 30,749 vertices representing the paper authors and 78,539 edges representing the co-author relationships. Synthetic data sets: We use the R-MAT graph model [2] to generate synthetic graphs. The model takes four parameters a, b, c and d as inputs, where a + b + c + d = 1. The generated graphs have the power-law degree distributions and small-world properties, which are observed in many realworld social networks. In this work, we use the suggested settings (0.45, 0.15, 0.15 and 0.25) for the four corresponding parameters, and generate two data sets, SD-SG (small dense synthetic graph) and LS-SG (large sparse synthetic graph). SD-SG consists of 20,000 vertices and 150,000 edges, while LS-SG contains 100,000 vertices and 260,000 edges.

6.2 Privacy Breaches under Friendship Attacks First, we show that the friendship attack is an important privacy issue in published social networks. We launch both the vertex degree attack and the friendship attack on the original 20TopConf. We also launch the friendship attack on the k-degree anonymous 20TopConf. Table 1 reports the percentages of vertices that could be revealed with a probability greater than 1/k. We observe that a friendship attack can cause a much more serious privacy breach than a vertex degree attack, and that k-degree anonymity does not provide effective protection against friendship attacks. Second, the Integer Programming method guarantees to generate k2 -degree anonymous graphs. Similar to other twophase graph anonymization methods [9], DESEAN may not be able to achieve the target anonymous degrees for a few vertices. Therefore, we also test the power of friendship attacks on 20TopConf anonymized by DESEAN with ω set as 0.5. As shown in Table 1, DESEAN can provide far more effective protection against friendship attacks than k-degree anonymity when an attacker has the precise knowledge of a degree pair (d1 , d2 ). If an attacker only knows the degree pair with an uncertainty range of 1, i.e., (d1 ± 1, d2 ± 1), the percentages of vertices that the attacker can reveal with a probability greater than 1/k are reduced to 0.01%, 0.09%, 0.23% and 0.32% for k equal to 5, 10, 15 and 20, respectively.

PERFORMANCE STUDIES

In this section, we evaluate the power of friendship attacks and the performance of our approaches on both real and synthetic data sets. The utility of the anonymized graphs is demonstrated from the degree distributions, degree centralities, clustering coefficients, average path lengths and the numbers of edge changes. The programs are implemented in C++. All experiments were performed on a Debian GNU/Linux server with double dual-core 2.4 GHz Opteron processors and 4GB RAM.

6.3 Data Utility In this subsection, we demonstrate that our approaches preserve much of the original graphs from the degree distributions, degree centralities, clustering coefficients, average path lengths and the numbers of edge changes. The test data include two real data sets, PODS09 and 20TopConf,

6.1 Data Sets Real data sets: From the DBLP database, we extract two social networks with different scales. One is the PODS09 data set containing 68 authors whose papers were published

1267

Degree Freq.

Degree Freq.

15

200 original k-degree DESEAN

160

10 5 0

120 80 40 0

0

2

4

6

8

160 original k-degree DESEAN

160 120 80 40

20

40

60

80

100

120

40 0

20

40

Vertex Degree

(a) PODS09, k = 4

80

0

10

Vertex Degree

original k-degree DESEAN

120

Degree Freq.

200 original optimal DESEAN

20

Degree Freq.

25

60

80

100

120

30 40 50 60 70 80 90 100 110

Vertex Degree

(b) 20TopConf, k = 20

Vertex Degree

(c) SD-SG, k = 20

(d) LS-SG, k = 25

Figure 5: Degree distributions. 0.1

0.015

0.006

original k-degree

0.09

0.002

original k-degree

0.06 0.05 4

5

original k-degree

0.005

6

5

10

15

20

k

25

0.002

0.001

50

5

10

15

20

k

(a) PODS09

DESEAN

DESEAN

0 3

0.003

0.01

DC

0.07

2

DESEAN

0.004

DC

DC

0.08

DESEAN

DC

original optimal

25

50

5

10

15

20

k

(b) 20TopConf

25

50

25

50

k

(c) SD-SG

(d) LS-SG

Figure 6: Degree Centralities. 1

0.8

0.0035 original k-degree DESEAN

0.01 0.8

0.75

0.008

original k-degree DESEAN

0.0028

original optimal

0.4

DESEAN

0.7 original k-degree

CC

CC

CC

CC

0.0021 0.6

0.006

0.0014

DESEAN 0.004

0.0007

0.65 2

3

4

5

6

5

10

15

k

20

25

50

5

10

15

20

k

(a) PODS09

25

50

5

10

15

k

(b) 20TopConf

20

k

(c) SD-SG

(d) LS-SG

Figure 7: Clustering coefficients. 6

7.5 original optimal

5

3.72 original k-degree

DESEAN 7

original k-degree

3.7

DESEAN

5.09

6.5

APL

3

6

2 1

5.5 2

3

4

5

6

5

10

15

k

20

25

5.08

APL

3.68

APL

4

APL

5.1

DESEAN

3.66

5.06

3.62

5.05

50

5

10

15

k

(a) PODS09

5.07

3.64 20

25

50

original k-degree 5

DESEAN

10

15

20

k

(b) 20TopConf

25

50

25

50

k

(c) SD-SG

(d) LS-SG

8%

9% 6% edge added edge deleted

3% 0% 2

3

4

k

(a) PODS09

5

6

15% edge added edge deleted

6% 4% 2% 0% 5

10

15

20

25

50

4% edge added edge deleted

12% 9% 6% 3% 0% 5

k

Edge Changes (%)

10%

12%

Edge Changes (%)

15%

Edge Changes (%)

Edge Changes (%)

Figure 8: Average path lengths.

10

15

20

k

(b) 20TopConf

(c) SD-SG

25

50

edge added edge deleted

3% 2% 1% 0% 5

10

15

20

k

(d) LS-SG

Figure 9: Number of edge changes. we obtain the optimal solutions with the proposed formulations using CPLEX [7]. We also compare the characteristics of the anonymized graphs produced by Algorithm DESEAN with those of the original graph and the optimal solutions. For the other three data sets, note that finding the optimal solutions on large-scale graphs is computationally infeasible. We compare the characteristics of k2 -degree anony-

and two synthetic graphs, SD-SG and LS-SG. All the graphs are anonymized with ω set as 0.5, indicating an equal preference for edge addition and edge deletion. Note that our methods preserve the vertex set. Therefore, in the evaluations, we compare our approaches with a scheme such as [9], which does not perform vertex addition and deletion. Related works such as [3, 21] are not directly comparable since the vertex set has been changed. Specifically, for PODS09,

1268

6.4 Influence of the Parameters

mous graphs generated by DESEAN with those of k-degree anonymous graphs and the original graphs. Degree Distribution: Figure 5 shows the degree distributions of the anonymized graphs and the original graphs. Accordingly, DESEAN is comparable to the optimal solution and k-degree anonymity in preserving the degree distribution. This is because DESEAN preserves the degree sequences by clustering the vertices with similar degrees in step 1. Degree Centrality (DC): Figure 6 shows the degree centralities [12, 14] of the anonymized and original graphs as a function of k. For a graph, a large degree centrality indicates the existence of strong leaders and influential vertices. After k2 -degree anonymization, our methods decrease the degree centralities in the same quantity as kdegree anonymization when providing better privacy protection against friendship attacks. Clustering Coefficient (CC): Figure 7 lists the clustering coefficients of the anonymized graphs as a function of k. The horizontal constant lines represent the CC values of the original graphs, which do not vary with the value of k. In Figure 7(a), the CC values of PODS09 anonymized by the optimal solution approach deviate little from the original CC value as the optimal solution method minimizes the edge editions. DESEAN decreases the CC values of PODS09 a bit more when k is small and reduces the CC values from 0.94 to 0.62 when k is larger than 4. This is because PODS09 has many cliques and DESEAN may connect distant vertices and disconnect vertices in the same cliques for anonymization. When k increases, DESEAN decreases the CC value further as more edge editions are performed. For the other three data sets, Figures 7(b-d) show that the CC values derived by DESEAN do not deviate much from the original values. On 20TopConf, DESEAN even achieves comparable results to k-degree anonymity. Average Path Length (APL): Figure 8 details the average path lengths between vertex pairs of the anonymized graphs as a function of k. The horizontal constant lines represent the APLs of the original graphs, which do not vary as the value of k increases. First, note that PODS09 has many cliques. Therefore, in Figure 8(a), the optimal solution method increases the APL values of PODS09 by about 1.5 hops because a few vertices in different cliques are connected and a few vertices in the same cliques are disconnected. For the same reason, under Algorithm DESEAN, the APL values of PODS09 deviate a bit more when k is small, and increase from 1.43 to 4.61 with more edge editions when k is larger than 4. For the other three data sets, Figures 8(b-d) show that, under DESEAN, the APL values deviate slightly from the original values and the results are comparable to those of k-degree anonymity. Number of Edge Changes: Figure 9 reports the ratio of edge editions by DESEAN to the original numbers of edges in the graphs. Accordingly, DESEAN fairly adds edges and deletes edges for anonymization with ω set as 0.5, and the numbers of edge changes increase linearly to the value of k. All the evaluations on the degree distributions, degree centralities, clustering coefficients, average path lengths and the numbers of edge changes show that both the optimal solution approach and the DESEAN algorithm can preserve much of the original graphs.

In this subsection, we evaluate the influence of the preference weight ω between edge addition and edge deletion operations. Figure 10 shows the clustering coefficients (CCs) and average path lengths (APLs) of 20TopConf anonymized by DESEAN as a function of ω. Accordingly, with any ω, both the CC and APL values deviate more for a large k. In addition, for each k, the CC/APL value is best preserved at ω = 0.5, indicating an equal preference for edge addition and edge deletion. 0.8

APL

CC

k=10 k=15

6.4

0.7 0.65 0.6

original k=5

6.8

0.75

original k=5

k=10 k=15

6 5.6

0.55 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

ω

ω

(a)

(b)

Figure 10: Data utility of 20TopConf w.r.t. ω.

6.5 Algorithm Scalability Note that the optimal solution method has an exponential execution time. In the experiment, we observed that it takes several hours to find optimal solutions on PODS09 when k is larger than 3. Therefore, in this section, we focus on demonstrating the good scalability of Algorithm DESEAN. Concerning the effects of graph size and graph density, we report the execution time of DESEAN as a function of |V | and |E|/|V | in Figures 11(a) and 11(b), respectively. In Figure 11(a) all the test graphs have the |E|/|V | ratio equal to 2.5, while in Figure 11(b) all the graphs have 100,000 vertices. These results show that the execution time grows linearly with the number of vertices |V | and the graph density.

100

3000 k=5 k=10 k=15 k=20

Exec. Time (sec)

Exec. Time (sec)

150

50

0 5000

10000

15000

20000

Number of Vertices

(a)

25000

2000

k=5 k=10 k=15 k=20

1000

0 1.5

2

2.5

3

3.5

|E|/|V|

(b)

Figure 11: Execution efficiency of DESEAN.

7. GENERAL FRIENDSHIP ATTACKS In this section, we discuss vertex re-identification under a general friendship attack, where an adversary knows a degree sequence of length l, i.e., Dl = (dv1 , dv2 , ..., dvl ). Table 2 shows the percentages of vertices that could be re-identified with a probability greater than 1/k by D3 , i.e., D3 = ((dA , dB , dC ), A), D3 = ((dA , dB , dC ), B) and D3 = ((dA , dB , dC ), C), in the original/anonymized 20TopConf. Together with those in Table 1, these results demonstrate that the general friendship attack can effectively reveal the vertex identities, and that the power of a friendship attack increases dramatically with the length of the degree sequence. For privacy protection against a general friendship attack, the definitions of the friendship attack and k2 -degree

1269

Table 2: Percentages of vertices that violate k3 degree anonymity. k 5 10 15 20

Original G 40.67% 56.82% 63.46% 66.80%

k-degree G 38.20% 54.26% 61.35% 66.28%

[3] J. Cheng, A. W. Fu, and J. Liu. K-isomorphism: privacy preserving network publication against structural attacks. In Proc. of SIGMOD, 2010. [4] B. C. M. Fung, K. Wang, R. Chen, and P. S. Yu. Privacy-preserving data publishing: A survey on recent developments. ACM Computing Surveys, in press. [5] M. Hay, G. Miklau, D. Jensen, D. F. Towsley, and P. Weis. Resisting structural re-identification in anonymized social networks. In Proc. of VLDB, 2008. [6] X. He, J. Vaidya, B. Shafiq, N. Adam, and V. Atluri. Preserving privacy in social networks: A structure-aware approach. In Proc. of WI-IAT, 2009. [7] IBM. (ILOG CPLEX 12.2, 2010) http://www01.ibm.com/software/integration/optimization/cplexoptimizer/. [8] K. Liu, K. Das, T. Grandison, and H. Kargupta. Privacy-preserving data analysis on graphs and social networks. In H. Kargupta, J. Han, P. Yu, R. Motwani, and V. Kumar, editors, Next Generation Data Mining. CRC Press, 2008. [9] K. Liu and E. Terzi. Towards identity anonymization on graphs. In Proc. of ACM SIGMOD, 2008. [10] P. Samarati and L. Sweeney. Generalizing data to provide anonymity when disclosing information. In Proc. of ACM PODS, 1998. [11] L. Sweeney. k-anonymity: A model for protecting privacy. IJUFKS, 10(5), 2002. [12] J. P. Scott. Social network analysis: a handbook. Sage Publications, 2nd edition, 2000. [13] C.-H. Tai, D.-N. Yang, P. S. Yu, and M.-S. Chen. Structural Diversity for Privacy in Publishing Social Networks. In Proc. of SDM, 2011. [14] S. Wasserman, K. Faust, D. Iacobucci, and M. Granovetter. Social network analysis: methods and applications. Cambridge University Press, 1994. [15] X. Wu, X. Ying, K. Liu, and L. Chen. A Survey of Privacy-Preservation of Graphs and Social Networks. Springer US, 2010. [16] X. Ying and X. Wu. Randomizing social networks: a spectrum preserving approach. In Proc. of SDM, 2008. [17] L. Zhang and W. Zhang. Edge anonymity in social network graphs. In Proc. of CSE, 2009. [18] E. Zheleva and L. Getoor. Preserving the privacy of sensitive relationships in graph data. In Proc. of PinKDD, 2007. [19] B. Zhou and J. Pei. Preserving privacy in social networks against neighborhood attacks. In Proc. of ICDE, 2008. [20] B. Zhou, J. Pei, and W. Luk. A brief survey on anonymization techniques for privacy preserving publishing of social network data. SIGKDD Explorations, 10(2), 2008. ¨ [21] L. Zou, L. Chen, and M. T. Ozsu. K-automorphism: a general framework for privacy preserving network publication. In Proc. of VLDB, 2009.

k2 -degree G 35.65% 51.01% 59.74% 64.51%

anonymity can be easily modified to accommodate the general friendship attack and kl -degree anonymity. To solve the problem, we propose the extended DESEAN (E-DESEAN) algorithm. In the following, we describe the concept of EDESEAN. E-DESEAN consists of three steps. The first step clusters the vertices with similar degrees such that each cluster contains at least K vertices, where K ≥ k. Here, we choose another parameter K instead of k. The reason for this substitution is that when l grows, it would be more difficult to satisfy kl -degree anonymity if for many vertices, there are only k − 1 other vertices with the same degree in the same cluster. The second step adds and removes edges to ensure that there are sufficient edges between any two clusters required to achieve kl -degree anonymity. However, for E-DESEAN, when a degree subsequence of length l is considered, the addition of an edge may create new degree subsequences that would need to be processed later. Therefore, the anonymous cost of adding an edge would be different and would vary according to the number of additional edges needed for protecting against the new subsequences. Finally, the third step adjusts the graph such that vertices in the same cluster share the same target degree to ensure that kl -degree anonymity is achieved.

8.

CONCLUSION

In this paper, we have identified the privacy risk in published social networks in terms of a new type of attack, called a friendship attack, and proposed the concept of k2 -degree anonymity to protect against such attacks. For k2 -degree anonymization, we developed an Integer Programming formulation to find optimal solutions. We also designed an efficient heuristic approach for anonymizing large-scale social networks. In addition, we discussed the extension of the heuristic approach to handle the general friendship attack of a degree sequence of length l. The experimental results demonstrate that our approaches can preserve much of the characteristics of social networks.

9.

ACKNOWLEDGEMENTS

The work is supported in part by National Science Council of Taiwan under Contract No. NSC 97-2221-E-002-172MY3, and by US NSF through grants IIS-0914934, OISE0968341 and OIA-0963278.

10. REFERENCES [1] L. Backstrom, C. Dwork, and J. M. Kleinberg. Wherefore art thou r3579x?: Anonymized social networks, hidden patterns, and structural steganography. In Proc. of WWW, 2007. [2] D. Chakrabarti, Y. Zhan, and C. Faloutsos. R-mat: A recursive model for graph mining. In Proc. of SDM, 2004.

1270