Privacy-Preserving Genetic Algorithms for Rule ... - Semantic Scholar

Privacy-Preserving Genetic Algorithms for Rule Discovery ? Shuguo Han

Wee Keong Ng

Center for Advanced Information Systems, School of Computer Engineering, Nanyang Technological University, Singapore. 639798 {hans0004,awkng}@ntu.edu.sg

Abstract. Decision tree induction algorithms generally adopt a greedy approach to select attributes in order to optimize some criteria at each iteration of the tree induction process. When a decision tree has been constructed, a set of decision rules may be correspondingly derived. Univariate decision tree induction algorithms generally yield the same tree regardless of how many times it is induced from the same training data set. Genetic algorithms have been shown to discover a better set of rules, albeit at the expense of efficiency. In this paper, we propose a protocol for secure genetic algorithms for the following scenario: Two parties, each holding an arbitrarily partitioned data set, seek to perform genetic algorithms to discover a better set of rules without disclosing their own private data. The challenge for privacy-preserving genetic algorithms is to allow the two parties to securely and jointly evaluate the fitness value of each chromosome using each party’s private data but without compromising their data privacy. We propose a new protocol to address this challenge that is correct and secure. The proposed protocol is not only privacy-preserving at each iteration of the genetic algorithm, the intermediate results generated at each iteration do not compromise the data privacy of the participating parties.

1

Introduction

In the modern world of business competition, collaboration between industries or companies is one form of alliance to maintain overall competitiveness. Two industries or companies may find that it is beneficial to collaborate in order to discover more useful and interesting patterns, rules or knowledge from their joint data collection, which they would not be able to derive otherwise. Due to privacy concerns, it is impossible for each party to share its own private data with one another if the data mining algorithms are not secure. Privacy-preserving data mining (PPDM) has been proposed to resolve the data privacy concerns [1, 8]. Conventional PPDM makes use of Secure Multi-party Computation [?] or randomization techniques to allow the participating parties to preserve their data privacy during the mining process. ?

This work is supported in part by grant P0520095 from the Agency for Science, Technology and Research (A*STAR), Singapore.

In recent years, PPDM has emerged as an active area of research in the data mining community. Several traditional data mining algorithms have been adapted to be become privacy-preserving: decision trees, association rule mining, k-means clustering, SVM, Na¨ıve Bayes. These algorithms generally assume that the original data set has been horizontally and/or vertically partitioned, with each partition privately held by a party. Jagannathan and Wright introduced the concept of arbitrarily partitioned data [6], which is a generalization of horizontally and vertically partitioned data. In arbitrarily partitioned data, different disjoint portions are held by different parties. Decision tree induction algorithms iteratively perform greedy selection on attributes when constructing the decision tree. At each iteration, the best attribute that optimize some criteria, such as entropy, information gain, or the gini index, is chosen to split the current tree node. Univariate decision tree induction is known to produce the same decision tree regardless of the number of times it is induced from the same data set. As identified by Freitas [4], genetic algorithms are able to discover a better set of rules than decision trees, albeit at the expense of efficiency. Hence, the genetic algorithm approach is a better alternative for generating and exploring a more diverse set of decision rules than decision tree induction. In this paper, we propose a protocol for two parties each holding a private data partition to jointly and securely apply genetic algorithms to discover a set of decision rules for their private data partitions without compromising individual data privacy. As genetic algorithms are iterative, it is a challenge to not only preserve the privacy of data at each iteration, it is also a challenge to ensure that the intermediate results produced at each iteration do not compromise the data privacy of the participating parties. We shall show that our proposed protocol satisfy these two requirements for data privacy. This paper is organized as follows. In the following section, we review various privacy-preserving data mining algorithms. In Section 3, we propose a protocol for performing privacy-preserving genetic algorithms for arbitrarily partitioned data involving two parities. Section 4 analyzes the correctness, security, and complexity of the protocol. We conclude our work in the final section.

2

Related Work

In this section, we review current work on privacy-preserving data mining algorithms that are based on Secure Multi-party Computation [15]. Lindell and Pinkas [8] proposed a privacy-preserving ID3 algorithm based on cryptographic techniques for horizontally partitioned data involving two parties. Vaidya and Clifton [10] presented privacy-preserving association rule mining for vertically partitioned data based on the secure scalar product protocol involving two parties. The secure scalar product protocol makes use of linear algebra techniques to mask private vectors with random numbers. Solutions based on linear algebra techniques are believed to scale better and perform faster than those based on cryptographic techniques.

Vaidya and Clifton [11] presented a method to address privacy-preserving kmeans clustering for vertically partitioned data involving multiple parties. Given a sample input which are held partially by different parties, determining which cluster the sample is closest must be done jointly and securely by all the parties involved. This is accomplished by the secure permutation algorithm [3] and the secure comparison algorithm based on circuit evaluation protocol [15]. Jagannathan and Wright [6] proposed a new concept of arbitrarily partitioned data which is a generalization of horizontally and vertically partitioned data. They provided an efficient privacy preserving protocol for k-means clustering in an arbitrarily partitioned data setting. To compute the closest cluster for a given point securely, the protocol also makes use of secure scalar product protocols. Yu et al. [17] proposed a privacy-preserving SVM classification algorithm for vertically partitioned data. To achieve complete security, the generic circuit evaluation technique developed for secure multiparty computation is applied. In another paper, Yu et al. [16] securely constructed the global SVM classification model using nonlinear kernels for horizontally partitioned data based on the secure set intersection cardinality protocol [12]. Laur et al. [7] proposed secure protocols to implement the Kernel Adaption and Kernel Perception learning algorithms based on cryptographic techniques without revealing the kernel and Gramm matrix of the data. To the best of our knowledge, there is no work to date on a privacy-preserving version of genetic algorithms. In this paper, we propose a protocol for two parties each holding a private data partition to jointly and securely apply genetic algorithms to discover a set of decision rules from their private data partitions. We propose protocols to securely compute a fitness function for rule discovery and ensure that the intermediate results of the protocols do not compromise data privacy.

3

Genetic Algorithms for Rule Discovery

In this section, we present a protocol for privacy-preserving genetic algorithms for rule discovery. When applying genetic algorithms to a problem domain, the solutions in the problem domain are mapped to chromosomes (individuals) which are crossovered and mutated to evolve new and better chromosomes as the genetic process iterates. When using genetic algorithms for rule discovery, each rule—representing a possible solution—is mapped to a chromosome (individual). An example of a rule is (X = xi ) −→ (C = ci ), where xi is an attribute value of attribute X and ci is a class label. As rules evolve in the genetic iteration process, the antecedent of rules may include more attributes, thereby becoming more complex and accurate in predicting the class labels of the data set. To evaluate the goodness of each rule, a fitness function is defined. In a two-party setting, the fitness function is jointly applied by the two parties to evaluate each rule with respect to the private data partitions held by each party. The challenge here is to perform this joint evaluation without compromising each party’s data privacy. In Section 3.1, we describe a general protocol for privacy-

Protocol 1 PPGA Protocol 1: Two parties jointly initialize population randomly. 2: Evaluate the fitness of each individual in the population based on the Secure Fitness Evaluation Protocol. 3: repeat 4: Two parties jointly select the best-ranking individuals for reproduction. 5: Breed new generation through crossover and mutation (genetic operations) and give birth to offspring by two parties together. 6: Evaluate the individual fitness of the offspring based on the Secure Fitness Evaluation Protocol. Replace the worst ranked part of population with offspring by two parties to7: gether. 8: until hterminating conditioni.

Predicted ci Class not ci

Actual ci NTP NFN

Class not ci NFP NTN

Fig. 1. Confusion matrix for the rule (X = xi ) −→ (C = ci ).

preserving genetic algorithms. We propose protocols to securely compute the fitness function in Section 3.2. 3.1

General Protocol for Privacy-Preserving Genetic Algorithms

Party A and Party B each holds an arbitrarily partitioned data portion respectively. They wish to jointly and securely use genetic algorithms to discover a set of decision rules from their private data partitions so that these rules can be used to predict the class label of future data samples. The general protocol for Privacy-Preserving Genetic Algorithms (PPGA) in the setting of arbitrarily partitioned data involving two parties is shown in Protocol 1. In this paper, the steps that we use for population initialization and genetic operations (i.e., crossover and mutation) in genetic algorithms for rule discovery are the same as those by Freitas [4]. 3.2

Secure Fitness Function Evaluation for Rule Discovery

In this section, we describe how a fitness function for rule discovery can be securely computed. The performance of a classification rule with respect to its predictive accuracy can be summarized by a confusion matrix [14]. For example, for the rule (X = xi ) −→ (C = ci ), its confusion matrix is shown in Fig. 1. Each quadrant of the matrix refers to the following: – NTP (True Positives): Number of tuples belonging to class ci whose class labels are correctly predicted by the rule; i.e., these tuples satisfy the TP condition (X = xi ) ∧ (C = ci ).

– NFP (False Positives): Number of tuples not belonging to class ci whose class labels are incorrectly predicted by the rule; i.e., these tuples satisfy the FP condition (X = xi ) ∧ (C 6= ci ). – NFN (False Negatives): Number of tuples belonging to class ci whose class labels are incorrectly predicted by the rule; i.e., these tuples satisfy the FN condition (X 6= xi ) ∧ (C = ci ). – NTN (True Negatives): Number of tuples not belonging to class ci whose class lables are correctly predicted by the rule; i.e., these tuples satisfy the TN condition (X 6= xi ) ∧ (C 6= ci ). Note that there is a confusion matrix for each classification rule. The goodness (fitness) of each rule can be evaluated using its confusion matrix: Fitness = True Positive Rate × True Negative Rate.

(1)

where True Positive Rate = NTP /(NTP + NFN ) and True Negative Rate = NTN /(NTN + NFP ). Note that other forms of fitness functions may also be defined using combinations of Precision, True Positive Rate, True Negative Rate, and/or Accuracy rate, with Precision = NTP /(NTP +NFP ) and Accuracy Rate = (NTP + NTN )/(NTP + NFP + NFN + NTN ). Note that the fitness function as defined in Eq. 1 has been used for classification rule discovery using genetic algorithms [2]. It has also been used as a rule quality measure in an ant colony algorithm for classification rule discovery [9]. It is clear that the fitness function as defined in Eq. 1 is defined in terms of the four values NTP , NFP , NFN , and NTN . If these values can be computed in a secure manner, the overall fitness function can also be securely computed. In the remaining part of this section, we show how each of the four values can be securely computed as the scalar product of two binary vectors. Binary Vector Notation: Define a binary vector Vα β , where α ∈ {a, b} (a ≡ ‘Party A’ and b ≡ ‘Party B’) and β ∈ {TP, FP, FN, TN}. For a given classification rule r, vector Vβα encodes its TP, FP, FN, or TN condition with respect to the private data partition of a party α. More precisely, for a particular rule r: (X = xi ) −→ (C = ci ), α (either Party A or B) generates binary vector Vβα = [V1 , V2 , . . . Vm ]T (m is the number of tuples in data set S) where – Vi = 1 if (i) attribute X’s value in the i-th tuple of S is held by another party but not α or (ii) X’s value in the i-th tuple of S is held by α and the i-th tuple satisfies β’s condition. – Vi = 0 otherwise. Vector Vβα contains partial information of how partial tuples in α’s partition a satisfy β. For instance, VTP captures partial information of those tuples in Party A’s data partition satisfying the TP condition with respect to the rule r. In this way, the value Nβ (β ∈ {TP, FP, FN, TN}) for rule r can be computed by applying secure scalar product protocols on vector Vβa from Party A and vector Vβb from Party B: Nβ = Vβa • Vβb .

(2)

Therefore, the four values NTP , NFP , NFN and NTN may be securely computed using secure scalar products. When there are more than one attributes in the rule antecedent, we use a logical conjunction of their corresponding binary vectors to obtain a single binary vector. Although the fitness value of a rule may now be securely computed, there is one weakness with the use of secure scalar product protocols. Vaidya and Clifton [12] have pointed out that existing secure scalar product protocols are subject to probing attacks. Party A may probe Party B’s binary vector by generating a vector [0, 0, 1, 0, 0] containing only one element whose value is 1. Party A uses this probing vector to probe Party B’s third element value when computing the scalar product. Party A could likewise generate all such probing vectors to probe Party B’s vector. To prevent probing attacks, we could find a secure way to express the scalar product of two binary vectors as the sum of two component values; each component value is held privately by each party. In this way, each party does not know the other party’s component value, but they are still able to jointly and securely come together to determine the scalar product. In the following paragraph, we describe a protocol to compute the fitness function (Eq. 1) in the manner just described. Secure Fitness Evaluation Protocol: There exist known protocols on how to express the scalar product of two vectors as the sum of two component values [3, 5]. Party A and B may use any of these protocols to derive component values when computing NTP , NFP , NFN , and NTN . At the end of these protocols, each party only knows their component values, but not those of the other party. This prevents probing attacks. The protocol to securely compute the fitness value of a rule is shown above. Steps 1 to 4 of the protocol yield the component values corresponding to NTP , NFP , NFN , and NTN . With the component values, Party A and B may now jointly and securely compute the fitness value (Eq. 1) of rule r in the genetic iteration process as shown: Fitness ⇔ ⇔ ⇔

a b a b VTP • VTP VTN • VTN × a • Vb + Va • Vb a • Vb + Va • Vb VTP VFP TP FN FN FP TN TN a (vTP

a b a b vTN + vTN vTP + vTP × a b a b b a + vb ) + vTP ) + (vFN + vFN ) (vFP + vFP ) + (vTN TN

a b a b (vTP + vTP ) × (vTN + vTN ) a + v a ) + (v b + v b )] × [(v a + v a ) + (v b + v b )] [(vTP FN TP FN FP TN FP TN

The above shows that the fitness value of a rule is expressed in terms of private component values corresponding to NTP , NFP , NFN , and NTN of the rule. Thus, what remains is to compute the numerator and denominator of the fraction. This is accomplished in Steps 5 and 6 of the protocol. As shown in Protocol 2, another protocol is required to perform Steps 5 and 6. This is shown in Protocol 3.

Protocol 2 Secure Fitness Evaluation Protocol a a a a , VFN and VTN with respect to rule , VFP Input: Party A has four input vectors VTP b b b b for rule r. and VTN r. Party B also has four input vectors VTP , VFP , VFN Output: Fitness = True Positive Rate × True Negative Rate is the fitness value of rule r. a b 1: Party A and Party B compute VTP • VTP using existing secure scalar product protocols. At the end of the protocol, Party A and B each holds a private component a b a b a b values vTP and vTP respectively, where vTP + vTP = VTP • VTP . a b 2: Similarly, Party A and Party B compute VFP • VFP and at the end, each holds a a b a b a b private component values vFP and vFP respectively, where vFP + vFP = VFP • VFP . a b 3: Similarly, Party A and Party B compute VFN • VFN and at the end, each holds a a b a b a b private component values vFN and vFN respectively, where vFN +vFN = VFN •VFN . a a 4: Similarly, Party A and Party B compute VTN • VTN and at the end, each holds a a b a b a b private component values vTN and vTN respectively, where vTN +vTN = VTN •VTN . a a b b 5: Party A uses vTP and vTN and Party B uses vTP and vTN as inputs to execute a b a b Protocol 3. The output is u1 = (vTP + vTP ) × (vTN + vTN ). a a a a b b b b 6: Party A uses vTP + vFN and vFP + vTN and Party B uses vTP + vFN and vFP + vTN a a b b as inputs to execute Protocol 3. The output is u2 = [(vTP + vFN ) + (vTP + vFN )] × a a b b [(vFP + vTN ) + (vFP + vTN )]. 7: Fitness = u1 /u2 .

Protocol 3 Input: Party A has input values w1a and w2a . Party B has input values w1b and w2b . Output: o = (w1a + w1b ) × (w2a + w2b ). 1: Party A and Party B securely compute the scalar product of vectors Wa = [w1a , w2a ]T and Wb = [w2b , w1b ]T . At the end of this protocol, Party A and B each holds a private component values wa and wb respectively, where wa + wb = Wa • Wb . 2: Party A sends the partial result oa = w1a × w2a + wa to Party B. 3: Party A computes the final result o = oa + ob and sends it to Party B.

4

Protocol Analysis

In this section, we show that Protocols 2 and 3 are correct and privacy-preserving during the iteration process of genetic algorithms. 4.1

Correctness

Protocol 3: The correctness of the protocol is given: o ⇔ oa + ob ⇔ {w1a × w2a + wa } + {w1b × w2b + wb } ⇔ w1a × w2a + Wa • Wb + w1b × w2b ⇔ w1a × w2a + [w1a , w2a ]T • [w2b , w1b ]T + w1b × w2b ⇔ (w1a + w1b ) × (w2a + w2b )

a a a a Protocol 2: Party A has four input vectors VTP , VFP , VFN and VTN . Party b b b b B has four input vectors VTP , VFP , VFN and VTN . The output is Fitness = True Positive Rate × True Negative Rate. It is clear that the protocol is correct by definition of the fitness function in the previous page.

Protocol 1: It is clear that the protocol is correct as it uses Protocol 2, which has been shown above to be correct. 4.2

Privacy Preservation

Protocol 3: We show that it is impossible for Party B (dishonest) to guess the values of Party A (honest) regardless of the number of times Protocol 3 is invoked. If Protocol 3 is invoked once, Party B only knows the intermediate values of o and oa from Party A as follows: o = (w1a + w1b ) × (w2a + w2b ) = w1a × w2a + w1a × w2b + w2a × w1b + w2a × w2b oa = w1a × w2a + wa There are three unknown values wa , w1a and w2a for Party B in two nonlinear equations. Even if the equations are linear, with only three unknowns, there are infinitely possible solutions as the number of unknowns is more than the number of equations. If Protocol 3 is invoked more than once, the values of wa , w1a and w2a are different, as they are randomly generated and known only to Party A. The more number of times the protocol is executed, the more unknowns Party B has. Therefore, it is impossible for Party B to guess the values of wa , w1a and w2a using the intermediate results from the two or more times the protocol is executed. Hence, Protocol 3 is secure. a Protocol 2: As shown above, value vTP (which corresponds to w1a in Protocol 3) of Party A as used in Step 5 of Protocol 2 cannot be disclosed during Protocol 3. a Without vTP , Party B is not able to guess, even by probing, any element values a of vector VTP of Party A. Likewise, Party B will not know any element values a a a of vectors VFP , VFN and VTN of Party A. Therefore, if Protocol 2 is invoked once, the data privacy of Party A is not compromised. However, the numerator u1 and denominator u2 in Protocol 2 are revealed. To achieve the complete security, the secure division protocol proposed by Vaidya et al. [13] that securely computes the division can be applied. According to the authors, the protocol would significantly increase the complexity. a a a a If Protocol 2 is invoked more than once, the values of vTP , vFP , vFN and vTN are different, as they are randomly generated and known only to Party A. It is impossible for Party B to guess any element values of the private vectors of Party A using the intermediate results. Thus, Protocol 2 is secure. On the whole, no matter how many times Protocol 2 is invoked by Protocol 1, the dishonest party is not able to guess any element values of the honest party’s private vectors.

Protocol 1: The protocol is secure as it uses Protocol 2, which has been shown above to be secure. 4.3

Complexity Analysis

The computational complexity and communication cost of the secure scalar product protocol used in Protocol 2 and Protocol 3 are defined as O(φ(z)) and 0 O(φ (z)) respectively, where (1) z is the number of elements in vectors; and (2) 0 φ(z) and φ (z) are expressions for the computational complexity and communication cost respectively of z with respect to some chosen secure scalar product protocol. As Protocol 3 invokes the secure scalar product protocol for two vectors of length 2 once (Step 1), the overall computational complexity of Protocol 3 is 0 O(φ(2)) = O(1) and the communication cost is O(φ (2)) = O(1). As Protocol 2 invokes the secure scalar product protocol for two vectors of length m four times (Steps 1–4) and Protocol 3 twice (Steps 5 and 6), the computational complexity of Protocol 2 is O(4φ(m)) + O(2φ(2)) = O(φ(m)), where m is the number of tuples in the data sets. The communication cost is 0 0 0 O(4φ (m)) + O(2φ (2)) = O(φ (m)). Since Protocol 1 uses Protocol 2 once (Step 6) at each iteration, the total computational complexity for Protocol 1 is T × O(φ(m)) = O(T × φ(m)) and 0 0 the communication cost is T × O(φ (m)) = O(T × φ (m)) where T is the total number of iterations before the termination condition is satisfied.

5

Conclusions

Several traditional data mining algorithms have been adapted to be become privacy-preserving: decision trees, association rule mining, k-means clustering, SVM, Na¨ıve Bayes, Bayesian network. These algorithms generally assume that the original data set has been horizontally and/or vertically partitioned, with each partition privately held by a party. In this paper, we proposed protocols for securely using genetic algorithms for rule discovery on private arbitrarily partitioned data that are held by two parties without compromising their data privacy. As genetic algorithms are iterative, it is a challenge to not only preserve the privacy of data at each iteration, it is also a challenge to ensure that the intermediate results produced at each iteration do not compromise the data privacy of the participating parties. We showed that the proposed protocols satisfy these two requirements for data privacy. As the protocol currently only works for two parties, extending to multiple parties is part of the future work.

References 1. R. Agrawal and R. Srikant. Privacy-preserving data mining. In Proceedings of the ACM international Conference on Management of Data, pages 439–450, Dallas, Texas, United States, 2000.

2. D. R. Carvalho and A. A. Freitas. A genetic algorithm with sequential niching for discovering small-disjunct rules. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 1035–1042, San Francisco, CA, USA, 2002. Morgan Kaufmann Publishers Inc. 3. W. Du and M. J. Atallah. Privacy-preserving cooperative statistical analysis. In Proceedings of the 17th Annual Computer Security Applications Conference, pages 102–110, New Orleans, Louisiana, USA, December 10–14 2001. 4. A. A. Freitas. Data Mining and Knowledge Discovery with Evolutionary Algorithms. Spinger-Verlag, Berlin, 2002. 5. B. Goethals, S. Laur, H. Lipmaa, and T. Mielikainen. On private scalar product computation for privacy-preserving data mining. In Proceedings of the 7th Annual International Conference in Information Security and Cryptology, pages 104–120, Seoul, Korea, December 2–3 2004. 6. G. Jagannathan and R. N. Wright. Privacy-preserving distributed k-means clustering over arbitrarily partitioned data. In Proceedings of the 8th ACM International Conference on Knowledge Discovery in Data Mining, pages 593–599, Chicago, Illinois, USA, 2005. 7. S. Laur, H. Lipmaa, and T. Mielikainen. Cryptographically private support vector machines. In Proceedings of the 12th ACM International Conference on Knowledge Discovery and Data Mining, pages 618–624, Philadelphia, PA, USA, 2006. 8. Y. Lindell and B. Pinkas. Privacy preserving data mining. In Advances in Cryptology, volume 1880 of Lecture Notes in Computer Science, pages 36–53. SpringerVerlag, 2000. 9. R. Parpinelli, H. Lopes, and A. Freitas. An ant colony based system for data mining: Applications to medical data. In L. Spector and E. G. et al, editors, Proceedings of the Genetic and Evolutionary Computation Conference, pages 791– 798, San Francisco, USA, July 2001. Morgan Kaufmann. 10. J. Vaidya and C. Clifton. Privacy preserving association rule mining in vertically partitioned data. In Proceedings of the 8th ACM International Conference on Knowledge Discovery and Data Mining, pages 639–644, Edmonton, Alberta, Canada, July 23-26 2002. 11. J. Vaidya and C. Clifton. Privacy-preserving k-means clustering over vertically partitioned data. In Proceedings of the 9th ACM International Conference on Knowledge Discovery and Data Mining, pages 206–215, Washington, DC, 2003. 12. J. Vaidya and C. Clifton. Secure set intersection cardinality with application to association rule mining. Journal of Computer Security, 13(4), 2005. 13. J. Vaidya, M. Kantarcioglu, and C. Clifton. Privacy-preserving naive bayes classification. The International Journal on Very Large Data Bases. 14. S. M. Weiss and C. A. Kulikowski. Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1991. 15. A. C. Yao. How to generate and exchange secrets. In Proceedings of the Annual IEEE Symposium on Foundations of Computer Science, pages 162–167, 1986. 16. H. Yu, X. Jiang, and J. Vaidya. Privacy-preserving svm using nonlinear kernels on horizontally partitioned data. In Proceedings of the ACM Symposium on Applied Computing, pages 603–610, Dijon, France, 2006. 17. H. Yu, J. Vaidya, and X. Jiang. Privacy-preserving svm classification on vertically partitioned data. In Proceedings of the 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining, volume 3918 of Lecture Notes in Computer Science, pages 647–656, Singapore, April 9-12 2006.