Aug 21, 1984 - FAANST: Fast Anonymizing. Algorithm for Numerical. Streaming DaTa. Hessam Zakerzadeh and Sylvia Osborn. The University of Western ...
FAANST: Fast Anonymizing Algorithm for Numerical Streaming DaTa Hessam Zakerzadeh and Sylvia Osborn The University of Western Ontario London, Ontario, Canada DPM 2010, Sept. 23, 2010
1
Outline 1. Background
2. Our Approach 3. Experiments 4. Conclusions
DPM 2010, Sept. 23, 2010
2
1. Background Traditional data is
finite persistent
Streaming data is
continuous potentially infinite time varying
DPM 2010, Sept. 23, 2010
3
Background, cont’d Assumptions: have streaming data wish to analyze it in a timely manner wish to preserve privacy of individuals whose data is in the stream attributes can be considered to be
identifiers, e.g. name, id number quasi identifiers: e.g. the combination (date of birth, zip code) other data which might be summarized by the statistical operations, considered sensitive DPM 2010, Sept. 23, 2010
4
Firstname
Lastname
Customer Number
Gender
Zipcode
Anna
Mackay
111
Female
20433
Bob
Miller
222
Male
20436
Carol
King
333
Female
20456
Daisy
Herrera
444
Female
20489
Alicia
Bartlett
555
female
20411
Gender
Zipcode
Female
20433
Male
20436
Female
20456
Female
20489
Female
20411
DOB 21 Aug 1984 1 Feb 1982 9 Feb 1975 24 Oct 1971 3 Sep 1974
Companyname
DOB 21 Aug 1984 1 Feb 1982 9 Feb 1975 24 Oct 1971 3 Sep 1974
ECU Mining corp. M split corp. MAG silver corp.
Quantity
1200
180
10500
M Split corp.
300
Quantity
ECU Mining corp.
1200
M split corp.
180
MAG silver corp.
2500
Warnex Inc.
10500
M Split corp.
300
identifying attributes
2500
Warnex Inc.
Company name
(a) original data with
quasi-identifiers sensitive data
(b) identifying attributes removed
(c) publicly available census data
Name
Address
City
Zip
DOB
Gender
Party
…
…. Carol King ….
….. 900 Adelaide ….
….
…..
….
….
….
….
Sometown
20456
9/2/1975
female
democrat
…
….
….
….
….
DPM 2010, Sept. 23, …. ….2010
5
Definitions quasi-Identifiers: let DS (A1,. . . ,An) be a dataset. A
quasi-identifier of DS is a set of attributes {Ai,. . . ,Aj} ⊂ {A1,. . . ,An} whose release must be controlled (because it can be linked with publicly available data to reveal individuals) [sam01, ss98] k-anonymitiy: let DS(A1,. . .An) be a dataset and QI
be the quasi-identifiers associated with it. DS is said to satisfy k-anonymity with respect to QI if and only if each sequence of values in DS[QI] appears with at least k occurrences in DS[QI] [sam01, ss98] DPM 2010, Sept. 23, 2010
6
Gender
Zipcode
Female
20433
Male
20436
Female
20456
Female
20489
Female
20411
DOB 21 Aug 1984 1 Feb 1982 9 Feb 1975 24 Oct 1971 3 Sep 1974
Company name
Quantity
ECU Mining corp.
1200
M split corp.
180
MAG silver corp.
2500
Warnex Inc.
10500
M Split corp.
300
de-identified data from before
2-anonymized version of the above. Quasi-identifiers have been generailzed Gender
Zipcode
DOB
Company name
Quantity
NA NA Female Female Female
2043* 2043* 204** 204** 204**
[1980-1985] [1980-1985] [1970-1975] [1970-1975] [1970-1975]
ECU Mining corp. M split corp. MAG silver corp. Warnex Inc. M Split corp.
1200 180 2500 10500 300
DPM 2010, Sept. 23, 2010
7
Challenges with streaming data information loss:
if too much data is generalized, we do not get good analysis this speaks to the quality of the results
run time:
computing optimal k-anonymity on static data is NP-hard in a streaming application we want anonymized sets of data to analyze in almost real time DPM 2010, Sept. 23, 2010
8
Generalizing data values For a numeric column like age, a generalization might put
ages into ranges like [10, 20], etc. where the overall column has values say in the range [0, 110] For categorical values, we can have a value generalization hierarchy which must be constructed for every attribute something like this can also be used to calculate distances for clustering not-released
once-married
married
widow
never-married
divorced
single
DPM 2010, Sept. 23, 2010
9
Calculating Information Loss for numerical data in a quasi-identifier, in the range
[vmin, vmax], when the data is clustered, an individual value vi will be replaced by a range, [ci, cj] 𝑐𝑚𝑎𝑥 −𝑐𝑚𝑖𝑛 𝑖𝑛𝑓𝑜𝑙𝑜𝑠𝑠 𝑣𝑖 = 𝑣𝑚𝑎𝑥 −𝑣𝑚𝑖𝑛
given a generalized tuple g = (v1,v2,...,vn)
𝑖𝑛𝑓𝑜𝑙𝑜𝑠𝑠(𝑔) =
1 𝑛
𝑛 1 𝑖𝑛𝑓𝑜𝑙𝑜𝑠𝑠(𝑣𝑖)
information loss of a set of tuples is the average of
their information loss DPM 2010, Sept. 23, 2010
10
2. Our Approach start by filling up a window of size mu with tuples
need to create clusters which will be the k-anonymized
groups first round: we use the k-means clustering algorithm to cluster the data into k’ clusters according to their quasi-identifier values – this is a different k from k-anonymity if a cluster has ≥ k tuples, it can be output as a data set that satisfies k-anonymity, with its quasi-identifiers replaced by a generalized value if it also has information loss < delta, a predefined limit, then these are good clusters, to which we want to assign tuples which arrive later. These are called accepted clusters. DPM 2010, Sept. 23, 2010
11
Our Approach, cont’d subsequent rounds: the window is filled up; any tuple whose values fall into
the range of an accepted cluster is output immediately to that cluster and kmeans is run again on the remaining tuples on the last pass, first output any tuples to accepted clusters then we run k-means again and output any clusters larger than k as above
when fewer than k tuples remain, they are generalized to the maximum
range choose k’ for k-means to be
(number of current tuples in the window)/k so that there is at least one cluster of size >k to be output at each pass, according to the pigeonhole principle
DPM 2010, Sept. 23, 2010
12
a couple of points we keep track of the accepted clusters by
keeping the ranges covered by the cluster for the quasi-identifiers
DPM 2010, Sept. 23, 2010
13
aside on k-means initial centroids are chosen at random. the distance of each point to the k’ centroid
points is calculated each point is assigned to the closest centroid the mean point in each cluster is calculated replacing the old centroids each point is assigned to the centroid closest to it the algorithm repeats until there is no change in point-to-cluster assignment DPM 2010, Sept. 23, 2010
14
3. Experiments used numerical data from two standard datasets
from the UCI Machine Learning Repository [an07]
DPM 2010, Sept. 23, 2010
15
First Experiment
Data loss on all of these diagrams is a synonym for information loss
DPM 2010, Sept. 23, 2010
16
DPM 2010, Sept. 23, 2010
17
(c) Average number of suppressed tuples per MU = 500,1000,1500,2000, and 2500 vs. k on Dataset 1, Numeric data
DPM 2010, Sept. 23, 2010
18
DPM 2010, Sept. 23, 2010
19
DPM 2010, Sept. 23, 2010
20
Decided to proceed with just delta=1.0
DPM 2010, Sept. 23, 2010
21
Experiment 2
DPM 2010, Sept. 23, 2010
22
DPM 2010, Sept. 23, 2010
23
Comparison with other work on streams SKY [low08] and SWAF[wlal07] need a
specialization tree even for numerical data CASTLE [ccft08] is slower [zhp*09] by Bin Zou et al. only supports numerical data
DPM 2010, Sept. 23, 2010
24
4. Conclusions FAANST outperforms Castle on numerical data in
terms of run time performance of the two algorithms is similar wrt information loss can use the same idea for character (categorical) data by using metoids occasional spikes are caused by the random selection of centroids in k-means one problem is that a tuple could stay in the processing window for a long time our current work we are looking at 2 ways of reducing the maximum delay DPM 2010, Sept. 23, 2010
25
Acknowledgements We express our gratitude to the authors of
[ccft08] (Castle) for kindly providing us with their source code. The research of H.Z. was supported by a grant from the Natural Sciences and Engineering Research Council of Canada.
DPM 2010, Sept. 23, 2010
26
References
[an07] A. Asuncion and D.J. Newman. UCI machine learning repository. http://www.ics.uci.edu/ mlearn/, 2007. [ccft08] Jianneng Cao, Barbara Carminati, Elena Ferrari, and Kian Lee Tan. Castle: A delay-constrained scheme for ks-anonymizing data streams. In Proc. of the 2008 IEEE 24th ICDE, pages 1376–1378, 2008. [low08] Jianzhong Li, Beng Chin Ooi, and Weiping Wang. Anonymizing streaming data for privacy protection. In Proceedings of the 2008 IEEE 24th International Conference on Data Engineering, pages 1367–1369, USA, 2008. [sam01] P. Samarati. Protecting respondents’ identities in microdata release. IEEE Trans. on Knowl. and Data Eng., 13(6):1010–1027, 2001 [ss98] P. Samarati and L. Sweeney. Protecting privacy when disclosing information: kanonymity and its enforcement through generalization and suppression. Technical report, 1998. [wlal07] Weiping Wang, Jianzhong Li, Chunyu Ai, and Yingshu Li. Privacy protection on sliding window of data streams. In Proceedings of the 2007 International Conference on Collaborative Computing: Networking, Applications and Worksharing, pages 213–221, Washington, DC, USA, 2007. [zhp*09] Bin Zhou, Yi Han, Jian Pei, Bin Jiang, Yufei Tao, and Yan Jia. Continuous privacypreserving publishing of data streams. In EDBT, pages 648–659, 2009.
DPM 2010, Sept. 23, 2010
27
MU=2000, k=50, and DELTA=0.8 output tuples falling clusters which have more than k tuples, partition tuples intointo (2000/50)=40 clusters. and store clusters whose data loss is less than DELTA.
Cluster name
# tuples
Info loss
Cluster 1-1
72 > 50
0.76 < 0.8
Cluster 1-2
12 < 50
0.34
Cluster 1-3
110 > 50
0.89 > 0.8
……
….
…..
Cluster 1-40
44 50
0.76 < 0.8
Cluster 1-2
12 < 50
0.34
Cluster 1 -3
110 > 50
0.89 > 0.8
……
….
…..
Cluster 1-40
44 50
0.76 < 0.8
Cluster 1-2
12 < 50
0.34
Cluster 1 -3
110 > 50
0.89 > 0.8
……
….
…..
Cluster 1-40
44 50
0.71 < 0.8
Cluster 2-2
51 > 50
0.39 < 0.8
output
store
Cluster 1-1, 0.76
Cluster 2-3
11 < 50
0.89
……
….
…..
List of accepted clusters
…
MU=2000, k=50, and DELTA=0.8 output tuples falling into clusters which have more than k tuples, and store clusters whose data loss is less than DELTA.
Cluster name
# tuples
Info loss
Cluster 2-1
83 > 50
0.71 < 0.8
Cluster 2-2
51 > 50
0.39 < 0.8
output
store
Cluster 1-1, 0.76
Cluster 2-3
11 < 50
0.89
……
….
…..
List of accepted clusters
…
MU=2000, k=50, and DELTA=0.8 Repeat the same untilcome, no more come. At the end, if noprocedure more tuples thetuples remaining tuples which do not fit into stored clusters, should be suppressed.
Cluster 2-2, 0.39 Cluster 2-1, 0.71 Cluster 1-1, 0.76
List of accepted clusters
Time Complexity Time complexity: O(
i
3
MU k
2
) where i is the
number of iterations of the algorithm.