Fast Anonymizing Algorithm for Numerical Streaming

0 downloads 0 Views 1MB Size Report
Aug 21, 1984 - FAANST: Fast Anonymizing. Algorithm for Numerical. Streaming DaTa. Hessam Zakerzadeh and Sylvia Osborn. The University of Western ...
FAANST: Fast Anonymizing Algorithm for Numerical Streaming DaTa Hessam Zakerzadeh and Sylvia Osborn The University of Western Ontario London, Ontario, Canada DPM 2010, Sept. 23, 2010

1

Outline 1. Background

2. Our Approach 3. Experiments 4. Conclusions

DPM 2010, Sept. 23, 2010

2

1. Background  Traditional data is  

finite persistent

 Streaming data is   

continuous potentially infinite time varying

DPM 2010, Sept. 23, 2010

3

Background, cont’d Assumptions:  have streaming data  wish to analyze it in a timely manner  wish to preserve privacy of individuals whose data is in the stream  attributes can be considered to be   

identifiers, e.g. name, id number quasi identifiers: e.g. the combination (date of birth, zip code) other data which might be summarized by the statistical operations, considered sensitive DPM 2010, Sept. 23, 2010

4

Firstname

Lastname

Customer Number

Gender

Zipcode

Anna

Mackay

111

Female

20433

Bob

Miller

222

Male

20436

Carol

King

333

Female

20456

Daisy

Herrera

444

Female

20489

Alicia

Bartlett

555

female

20411

Gender

Zipcode

Female

20433

Male

20436

Female

20456

Female

20489

Female

20411

DOB 21 Aug 1984 1 Feb 1982 9 Feb 1975 24 Oct 1971 3 Sep 1974

Companyname

DOB 21 Aug 1984 1 Feb 1982 9 Feb 1975 24 Oct 1971 3 Sep 1974

ECU Mining corp. M split corp. MAG silver corp.

Quantity

1200

180

10500

M Split corp.

300

Quantity

ECU Mining corp.

1200

M split corp.

180

MAG silver corp.

2500

Warnex Inc.

10500

M Split corp.

300

identifying attributes

2500

Warnex Inc.

Company name

(a) original data with

quasi-identifiers sensitive data

(b) identifying attributes removed

(c) publicly available census data

Name

Address

City

Zip

DOB

Gender

Party



…. Carol King ….

….. 900 Adelaide ….

….

…..

….

….

….

….

Sometown

20456

9/2/1975

female

democrat



….

….

….

….

DPM 2010, Sept. 23, …. ….2010

5

Definitions  quasi-Identifiers: let DS (A1,. . . ,An) be a dataset. A

quasi-identifier of DS is a set of attributes {Ai,. . . ,Aj} ⊂ {A1,. . . ,An} whose release must be controlled (because it can be linked with publicly available data to reveal individuals) [sam01, ss98]  k-anonymitiy: let DS(A1,. . .An) be a dataset and QI

be the quasi-identifiers associated with it. DS is said to satisfy k-anonymity with respect to QI if and only if each sequence of values in DS[QI] appears with at least k occurrences in DS[QI] [sam01, ss98] DPM 2010, Sept. 23, 2010

6

Gender

Zipcode

Female

20433

Male

20436

Female

20456

Female

20489

Female

20411

DOB 21 Aug 1984 1 Feb 1982 9 Feb 1975 24 Oct 1971 3 Sep 1974

Company name

Quantity

ECU Mining corp.

1200

M split corp.

180

MAG silver corp.

2500

Warnex Inc.

10500

M Split corp.

300

de-identified data from before

2-anonymized version of the above. Quasi-identifiers have been generailzed Gender

Zipcode

DOB

Company name

Quantity

NA NA Female Female Female

2043* 2043* 204** 204** 204**

[1980-1985] [1980-1985] [1970-1975] [1970-1975] [1970-1975]

ECU Mining corp. M split corp. MAG silver corp. Warnex Inc. M Split corp.

1200 180 2500 10500 300

DPM 2010, Sept. 23, 2010

7

Challenges with streaming data  information loss:  

if too much data is generalized, we do not get good analysis this speaks to the quality of the results

 run time:  

computing optimal k-anonymity on static data is NP-hard in a streaming application we want anonymized sets of data to analyze in almost real time DPM 2010, Sept. 23, 2010

8

Generalizing data values  For a numeric column like age, a generalization might put

ages into ranges like [10, 20], etc. where the overall column has values say in the range [0, 110]  For categorical values, we can have a value generalization hierarchy which must be constructed for every attribute  something like this can also be used to calculate distances for clustering not-released

once-married

married

widow

never-married

divorced

single

DPM 2010, Sept. 23, 2010

9

Calculating Information Loss  for numerical data in a quasi-identifier, in the range

[vmin, vmax], when the data is clustered, an individual value vi will be replaced by a range, [ci, cj] 𝑐𝑚𝑎𝑥 −𝑐𝑚𝑖𝑛 𝑖𝑛𝑓𝑜𝑙𝑜𝑠𝑠 𝑣𝑖 = 𝑣𝑚𝑎𝑥 −𝑣𝑚𝑖𝑛

 given a generalized tuple g = (v1,v2,...,vn)

𝑖𝑛𝑓𝑜𝑙𝑜𝑠𝑠(𝑔) =

1 𝑛

𝑛 1 𝑖𝑛𝑓𝑜𝑙𝑜𝑠𝑠(𝑣𝑖)

 information loss of a set of tuples is the average of

their information loss DPM 2010, Sept. 23, 2010

10

2. Our Approach  start by filling up a window of size mu with tuples

 need to create clusters which will be the k-anonymized

groups  first round: we use the k-means clustering algorithm to cluster the data into k’ clusters according to their quasi-identifier values – this is a different k from k-anonymity  if a cluster has ≥ k tuples, it can be output as a data set that satisfies k-anonymity, with its quasi-identifiers replaced by a generalized value  if it also has information loss < delta, a predefined limit, then these are good clusters, to which we want to assign tuples which arrive later. These are called accepted clusters. DPM 2010, Sept. 23, 2010

11

Our Approach, cont’d  subsequent rounds: the window is filled up; any tuple whose values fall into

the range of an accepted cluster is output immediately to that cluster and kmeans is run again on the remaining tuples  on the last pass, first output any tuples to accepted clusters  then we run k-means again and output any clusters larger than k as above

 when fewer than k tuples remain, they are generalized to the maximum

range  choose k’ for k-means to be

(number of current tuples in the window)/k so that there is at least one cluster of size >k to be output at each pass, according to the pigeonhole principle

DPM 2010, Sept. 23, 2010

12

a couple of points  we keep track of the accepted clusters by

keeping the ranges covered by the cluster for the quasi-identifiers

DPM 2010, Sept. 23, 2010

13

aside on k-means  initial centroids are chosen at random.  the distance of each point to the k’ centroid

points is calculated  each point is assigned to the closest centroid  the mean point in each cluster is calculated replacing the old centroids  each point is assigned to the centroid closest to it  the algorithm repeats until there is no change in point-to-cluster assignment DPM 2010, Sept. 23, 2010

14

3. Experiments  used numerical data from two standard datasets

from the UCI Machine Learning Repository [an07]

DPM 2010, Sept. 23, 2010

15

First Experiment

Data loss on all of these diagrams is a synonym for information loss

DPM 2010, Sept. 23, 2010

16

DPM 2010, Sept. 23, 2010

17

(c) Average number of suppressed tuples per MU = 500,1000,1500,2000, and 2500 vs. k on Dataset 1, Numeric data

DPM 2010, Sept. 23, 2010

18

DPM 2010, Sept. 23, 2010

19

DPM 2010, Sept. 23, 2010

20

Decided to proceed with just delta=1.0

DPM 2010, Sept. 23, 2010

21

Experiment 2

DPM 2010, Sept. 23, 2010

22

DPM 2010, Sept. 23, 2010

23

Comparison with other work on streams  SKY [low08] and SWAF[wlal07] need a

specialization tree even for numerical data  CASTLE [ccft08] is slower  [zhp*09] by Bin Zou et al. only supports numerical data

DPM 2010, Sept. 23, 2010

24

4. Conclusions  FAANST outperforms Castle on numerical data in     

terms of run time performance of the two algorithms is similar wrt information loss can use the same idea for character (categorical) data by using metoids occasional spikes are caused by the random selection of centroids in k-means one problem is that a tuple could stay in the processing window for a long time our current work we are looking at 2 ways of reducing the maximum delay DPM 2010, Sept. 23, 2010

25

Acknowledgements  We express our gratitude to the authors of

[ccft08] (Castle) for kindly providing us with their source code.  The research of H.Z. was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.

DPM 2010, Sept. 23, 2010

26

References      



[an07] A. Asuncion and D.J. Newman. UCI machine learning repository. http://www.ics.uci.edu/ mlearn/, 2007. [ccft08] Jianneng Cao, Barbara Carminati, Elena Ferrari, and Kian Lee Tan. Castle: A delay-constrained scheme for ks-anonymizing data streams. In Proc. of the 2008 IEEE 24th ICDE, pages 1376–1378, 2008. [low08] Jianzhong Li, Beng Chin Ooi, and Weiping Wang. Anonymizing streaming data for privacy protection. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 1367–1369, USA, 2008. [sam01] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Trans. on Knowl. and Data Eng., 13(6):1010–1027, 2001 [ss98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression. Technical report, 1998. [wlal07] Weiping Wang, Jianzhong Li, Chunyu Ai, and Yingshu Li. Privacy protection on sliding window of data streams. In Proceedings of the 2007 International Conference on Collaborative Computing: Networking, Applications and Worksharing, pages 213–221, Washington, DC, USA, 2007. [zhp*09] Bin Zhou, Yi Han, Jian Pei, Bin Jiang, Yufei Tao, and Yan Jia. Continuous privacypreserving publishing of data streams. In EDBT, pages 648–659, 2009.

DPM 2010, Sept. 23, 2010

27

MU=2000, k=50, and DELTA=0.8 output tuples falling clusters which have more than k tuples, partition tuples intointo (2000/50)=40 clusters. and store clusters whose data loss is less than DELTA.

Cluster name

# tuples

Info loss

Cluster 1-1

72 > 50

0.76 < 0.8

Cluster 1-2

12 < 50

0.34

Cluster 1-3

110 > 50

0.89 > 0.8

……

….

…..

Cluster 1-40

44 50

0.76 < 0.8

Cluster 1-2

12 < 50

0.34

Cluster 1 -3

110 > 50

0.89 > 0.8

……

….

…..

Cluster 1-40

44 50

0.76 < 0.8

Cluster 1-2

12 < 50

0.34

Cluster 1 -3

110 > 50

0.89 > 0.8

……

….

…..

Cluster 1-40

44 50

0.71 < 0.8

Cluster 2-2

51 > 50

0.39 < 0.8

output

store

Cluster 1-1, 0.76

Cluster 2-3

11 < 50

0.89

……

….

…..

List of accepted clusters



MU=2000, k=50, and DELTA=0.8 output tuples falling into clusters which have more than k tuples, and store clusters whose data loss is less than DELTA.

Cluster name

# tuples

Info loss

Cluster 2-1

83 > 50

0.71 < 0.8

Cluster 2-2

51 > 50

0.39 < 0.8

output

store

Cluster 1-1, 0.76

Cluster 2-3

11 < 50

0.89

……

….

…..

List of accepted clusters



MU=2000, k=50, and DELTA=0.8 Repeat the same untilcome, no more come. At the end, if noprocedure more tuples thetuples remaining tuples which do not fit into stored clusters, should be suppressed.

Cluster 2-2, 0.39 Cluster 2-1, 0.71 Cluster 1-1, 0.76

List of accepted clusters

Time Complexity  Time complexity: O(

i

3

 MU k

2

) where i is the

number of iterations of the algorithm.