Similarity Based Distributed Classification - Semantic Scholar

8 downloads 0 Views 126KB Size Report
Distributed and Parallel Knowledge Discovery. MIT Press, 2000. 11. Ross J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, San. Mateo, 1993 ...
Similarity Based Distributed Classification Grigorios Tsoumakas, Lefteris Angelis, and Ioannis Vlahavas Department of Informatics, Aristotle University of Thessaloniki Thessaloniki 54006, Greece {greg,lef,vlahavas}@csd.auth.gr

TECHNICAL REPORT 2002-LPIS-011 Abstract. Most distributed knowledge discovery approaches view data distribution as a technical issue and combine local models aiming at a single global model. This however, is unsuitable for inherently distributed databases, which often produce models that differ semantically. In this paper we present an approach for distributed classification that uses the pairwise similarity of local models in order to produce a better model for each of the distributed databases. This is achieved by averaging the decisions of all local models weighted by their similarity with the model induced from the origin of the unlabelled data.

1

Introduction

Nowadays, physically distributed databases are increasingly being used for knowledge discovery. Advances in network technology and the Internet as well as the growing size of data being stored in today’s information systems have contributed to the proliferation of distributed database architectures. Globalization, business-to-business commerce and online collaboration between organizations rose lately the need for inter-organizational data mining. Most distributed knowledge discovery approaches view data distribution as a technical issue and treat distributed data sets as parts of a global data set. This has been identified as a very narrow view of Distributed Knowledge Discovery [10],[12]. Real-world, inherently distributed data sets have an intrinsic data skewness property. The data distributions in different partitions are not identical. For example, data related to heart diseases from hospitals around the world will have varying distributions due to different nutrition habits, climate and quality of life. The same is true for buying patterns identified in supermarkets at different regions of a big country, due to different needs and habits of customers. Especially distributed classification approaches tend to combine the distributed predictive models in an attempt to derive a single global model. However, there might not really exist such a single model in the distributed data, but two or more groups of models. One should not just blindly integrate local models, but should do some sort of reasoning about the relations between local models before the integration. This paper presents a new approach for distributed classification. It involves evaluating the pairwise similarities of local models and weighting their decisions on an unlabelled data item by their similarity with the local model of the unlabelled data’s origin. It so manages to deal with the differences in model semantics and skewness of data distributions present in inherently distributed databases.

2

Related Work

The way that multiple classifiers are combined is an important research issue that has been investigated in the past from the communities of statistics, pattern recognition, machine learning, knowledge discovery and data mining. When only the label of the predicted class is available, then the simplest combination method that can be used is Majority Voting [6], which does not require a training stage. In this case, the class that receives the most classifier predictions is the final result. Weighted Majority Voting [9], weights the decision of each classifier by its performance on the training data. When a measure of belief, confidence, certainty or other about the classification is available along with the class label, then a number of different rules for combining these measures have been suggested, like Sum, Min, Max, Prod and Median. [7] is an interesting study of these rules. Stacked Generalization [13], also known as Stacking in the literature, is a method that combines multiple classifiers by learning the way that their output

correlates with the true class on an independent set of instances. Chan and Stolfo [2], applied the concept of Stacked Generalization to distributed data mining, via their Meta-Learning methodology. They focused on combining distributed data sets and investigated various schemes for structuring the meta-level training examples. Knowledge Probing [5] builds on Meta-Learning and in addition uses an independent data set in order to discover a comprehensible model. Davies and Edwards [3] follow a different approach to discover a single comprehensible model out of the distributed information sources. The main idea in the DAGGER algorithm is to selectively sample each distributed data set to form a new data set that will be used for inducing the single model.

3

Similarity of Predictive Models

In this section, a general method for evaluating the similarity of two predictive models that participate in an ensemble will be formulated. The desirable characteristics of such a method should be: – Independence of the predictive model type. The method should be able to evaluate the similarity of two models, whether they are decision trees, rules, neural networks, Bayesian classifiers or other. This is necessary in applications where different types of learning algorithms are used at each node. – Independence of the predictive model opacity. The method should be able to evaluate the similarity of two models, even if they are black boxes, providing just an output with respect to an input. This is necessary in applications where the evaluated models might be coming from different organizations that would not like to share their model’s details. Under the above defined constraints, a solution is to measure model similarity based on their performance on individual data items e of an independent evaluation data set E. The evaluation data set has to be independent of the training data set of each model to avoid biased results. The process of evaluation then is: For each data item e ∈ E, we calculate the output of each one of N models Mx (e), x = 1..N , which we consider to be a posterior probability distribution P Dxj (e) with respect to each class Cj , j = 1..|C|, where C is the set of all classes. Our next consideration is to define some pairwise similarity measure for the calculated output of the models on an evaluation item e. This measure should observe the following requirements for two models Mx and My : – 0 ≤ se (Mx , My ) ≤ 1 – se (Mx , Mx ) = 1 – se (Mx , My ) = se (My , Mx ) Such a similarity measure could be one of the following [8]: 1. Gower’s similarity coefficient se (Mx , My ) = 1 − δe (Mx , My )

(1)

where

|C|

δe (Mx , My ) =

1 X |P Dxj (e) − P Dyj (e)| |C| j=1 Rj (e)

is a dissimilarity coefficient and Rj (e) = max {P D1j (e), ..., P DN j (e)} − min{P D1j (e), ..., P DN j (e)} is the range of the probabilities for class Ci . Note that the division of each term of the above sum by the range, weights the absolute differences in the sense that when the range is large, (i.e. there are very high or very low probabilities) the contribution of the absolute difference is small. 2. Similarity coefficient obtained by a distance measure We can construct a similarity measure by any distance metric de (Mx , My ) by the relation: 1 (2) se (Mx , My ) = 1 + de (Mx , My ) For the distance measure there are several available, such as: – Euclidean Distance v  u |C| u X  u (P Dxj (e) − P Dyj (e))2 ) de (Mx , My ) = t  

(3)

j=1

– Scaled Euclidean Distance v  u |C| u X  u wj (e)(P Dxj (e) − P Dyj (e))2 ) de (Mx , My ) = t  

(4)

j=1

where

1 Rj (e)

wj (e) = or wj (e) =

1 stddev {P D1j (e), P D2j (e), ..., P DN j (e)}

– Canberra Metric de (Mx , My ) =

|C| X |P Dxj (e) − P Dyj (e)| j=1

P Dxj (e) + P Dyj (e)

(5)

– Czekanowski Coefficient de (Mx , My ) = 1 −

2

P|C|

j=1

P|C|

min {P Dxj (e), P Dyj (e)}

j=1

(P Dxj (e) + P Dyj (e))

(6)

We proceed computing one of the similarity measures defined above for every e ∈ E. Finally, the overall similarity of two models is calculated as their average similarity with respect to each data item: 1 X se (Mx , My ) (7) sE (Mx , My ) = |E| e∈E

4

Similarity Based Distributed Classification

Our approach performs distributed classification based on the similarity of local models in the following two stages: a) Similarity Evaluation and b)Weighted Sum Rule 4.1

Similarity Evaluation

In this stage, the pairwise similarity of the distributed classifiers has to be evaluated. In order to achieve this, our methodology requires the availability of independent evaluation data. For each pair of models we can use data from all distributed databases apart from the ones that they were trained on. There is no need to gather all these data at one place, as the evaluation methodology is incremental. Only the local models are broadcasted so that they become available at each distributed node. For example, if we have 3 local models M1 , M2 and M3 trained on data sets D1 , D2 and D3 , then we will evaluate the similarity of M1 and M2 on D3 , that of M1 and M3 on D2 and that of M2 and M3 on D1 . In general, if there are N distributed models (N > 2), then each pair of models Mx , My is evaluated on (N − 2) data sets and the mean is retained as their overall similarity: PN sD (Mx , My ), i 6= x, i 6= y (8) SD (Mx , M y) = i=1 i N −2 This means that the total number of evaluations is: N (N − 1)(N − 2) N! (N − 2) = 2!(N − 2)! 2 However, as the evaluations happen in parallel, the number of evaluations at each distributed node is: (N − 1)(N − 2) 2 and the computational complexity of the evaluation stage is o(N 2 ). During the evaluation stage, all available data that were not used for training one of the two models are used for their evaluation. This is achieved without moving any of the raw data around, an important constraint for real-world distributed databases. The only network traffic is the models, which have negligible size for standard bandwidth. The number of models that have to be exchanged through the network is N (N − 1). Therefore the size of the required network traffic is also o(N 2 ).

4.2

Similarity Weighted Sum Rule

At each node the similarity of each pair of models is broadcasted, so that all nodes can calculate the overall similarity of each pair of models. When a new unlabelled item u arrives at node x, the probability distribution of the final decision is calculated by averaging the probability distributions of all models, weighted by their similarity with the model that the data item comes from: P Dxj (u) =

PN

y=1

P Dyj (u) ∗ SD (My , Mx ) N

(9)

The final decision is the class with the maximum degree of certainty. Equation 9, shows that our approach does not actually produce a single model that applies to all distributed data, rather a model for each of the distributed data sets. In real-world applications unlabelled data will probably belong to one of the distributed databases. Therefore taking into account the origin of the unlabelled data, assists in making a better decision. Our approach gives more weight to models similar with the one that was trained from the origin of the unlabelled data. In addition, there might not be a global model in the first place. As explained in the introduction, distributed models may differ semantically and a single global model could only approximate the two or more groups of models that are hidden within the distributed data. A global model in this case would contain the average knowledge of the distributed models, which is not really useful for any of the distributed databases. In contrast our approach produces models that separately describe each database according to the knowledge of similar local models. There is also an expected gain in the predictive performance of each model due to the correction of errors from individual classifiers through the sum rule. Model similarity plays an important role here. If all local models had contributed the same, then the performance could decrease, because local models induced from inherently distributed databases differ not randomly, which is an essential for the success of ensemble learning [4], but systematically. 4.3

Punishing Very Small and Very Large Similarities

One could argue that linear combination of similarity and decisions could not be the best way of weighting decisions for the following reasons: – Theoretical and experimental work from the field of multiple classifier systems and learning classifier ensembles, have shown that diversity in an ensemble of classifiers is a good factor for increased accuracy. We could therefore punish very large similarities instead of giving more weight to them. – The classification results could be better if we gave even less weight to the decisions of the classifiers that are not very similar with the model of the unlabelled data’s origin.

To deal with these issues we need a transformation of the original similarity in order to give less weight to very small and very large values. Since the similarities used are real numbers between 0 and 1, we propose the Beta Transformation, which transforms a similarity measure s in the following way: B(s) =

sa−1 (1 − s)b−1 Ba,b

(10)

where a, b > 1 and Ba,b =

Z

1

ta−1 (1 − t)b−1 dt = 0

and Γ (a) =

Z

Γ (a)Γ (b) Γ (a + b)



ta−1 e−t dt 0

for n integer, Γ (n + 1) = n!. The parameters a and b control the properties of the transformation. Figure 1 shows the shape of the function for various parameter combinations. 5 a=4, b=2 a=8, b=2 a=9, b=1.5 a=3, b=3 a=2.5, b=5.5

4.5

4

3.5

3

2.5

2

1.5

1

0.5

0

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Fig. 1. Shape of Beta Transformation for various parameters

5

Experimental Results

In order to evaluate the performance and thus applicability of our methodology for the task of distributed classification we conducted a series of experiments, which were concerned with the following issues:

– What are the optimal parameters of the beta transformation? – Which similarity measure offers the best accuracy? – How does similarity based distributed classification compare to simple and state-of-the-art approaches, like majority voting, stacking and the average performance of local models, in terms of predictive performance? We give answers to all of the above issues using the heart-disease data sets from the UCI Machine Learning repository [1]. Let us note here the lack of many inherent distributed data sets, which is a hindrance to experimental research in distributed knowledge discovery. One could perform experiments that simulate the distribution by splitting large ordinary data sets. However, this alternative is not very realistic as it lacks the skewness of data distributions and conceptual differences of local models, which are present in inherently distributed data sets. All of the results mentioned below are averages over 100 runs where each data set is split into 70% of training data and 30% of test data and C4.5 [11] is used to learn all local models and the global model in Stacking. 5.1

Beta-Transformation Parameters

Intuitively, the beta transformation should have such parameters that favor similarities, but punish very large and very small similarities. The question is what the optimal configuration of such parameters should be. To answer this question we measured the errors in similarity based distributed classification using all combinations of parameters a and b, where a, b ∈ {2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12}. We use integer values for the parameters to simplify the computation. We must note that the space of parameter values is limited, but enough to explore the use of beta transformation. Figure 2 depicts the mean classification error surface for the space of parameters a and b with respect to Gower’s similarity coefficient and the test sets of the heart-disease data. The lowest error rates were found for large values of a and small values of b that favor large but not very large and very small values of similarity, which verifies our expectation. The error surfaces for the rest of the measures are similar apart from the Canberra metric. Table 1 presents the best parameter configurations for all similarity measures. 5.2

Similarity Measures

Now that we have found the optimal parameters for each of the similarity measures, we can proceed comparing the predictive performance of each measure. Table 1, shows the accuracy of similarity based distributed classification using each measure with respect to each test set of the heart-disease data sets. It also shows the average over all test sets. The maximum predictive performance for each test set is shown in bold. The conclusion is that all measures offer more or less the same predictive accuracy. The best two measures, Gower’s similarity coefficient and scaled Euclidean distance, both adjust their values with respect to the variance of the local models decisions. This is probably a positive feature of a similarity measure.

50 49 48

Error

47 46 45 44 43 42 41 2 4 2

6

a

4

8

6 8

10 10 12

b

12

Fig. 2. Gower’s similarity coefficient error surface

Table 1. Predictive accuracy of similarity measures Measure

a b

Clev.

Hung.

Gower’s 7 3 75.6190 100.0000 Euclidean 11 3 75.6236 100.0000 Scaled Euclidean 12 4 75.2279 100.0000 Canberra 8 8 75.1363 100.0000 Czekanowski 11 2 75.4980 100.0000

5.3

Switz.

Calif.

Aver.

27.6341 26.9462 28.1032 26.4729 27.9958

30.0641 29.2661 28.9223 29.9656 28.6737

58.3293 57.9590 58.0633 57.8937 58.0418

Comparative Results

We now compare the Similarity Based Distributed Classification approach using Gower’s similarity coefficient and the Beta Transformation (B-SDBC) to the same approach without the Beta Transformation (SDBC) and Majority Voting (MV), Weighted Majority Voting (Weighted MV), Stacking and the average performance of local models. Table 2 presents the results. First of all we notice that for this task of distributed classification from inherently distributed databases Majority Voting and Weighted Majority Voting exhibit a rather poor performance. Secondly, we notice that the Beta Transformation did not really offer much in terms of accuracy. The results are more or less similar for B-SBDC and SBDC with the former being better on average, but the latter being better in 2 out of 4 data sets. The reason behind this could be that for these data sets there aren’t models that are very similar to be punished. Perhaps the Beta Transformation is more useful for distributed models that have small random differences.

Table 2. Comparative results of predictive accuracy Approach B-SBDC SBDC MV Weighted MV Stacking Local Models

Cleveland Hungarian Switzerland California Average 75.2563 75.7942 66.8010 63.7493 74.2341

100.0000 100.0000 99.0477 99.6593 100.0000

25.3106 22.8832 14.0163 13.0253 27.1685

29.3747 29.4816 27.8624 27.1271 22.7938

57.4854 57.0397 51.9318 50.8902 61.5873 56.0491

SBDC and B-SBDC are on average the best approaches after Stacking and SBDC is better than Stacking in half of the data sets. Stacking outputs a single model that behaves like an average model trying to maximize the overall performance with respect to any future data it will receive. It does not use information about the origin of the unlabelled data. It also requires gathering data in a single place to train the final model, which is time-consuming and complex. SBDC and B-SBDC are fundamentally different from the rest of the approaches in that they take into account the relationships of local models and thus are more suitable for distributed classification from inherently distributed databases.

6

Conclusions and Future Work

This paper has presented a new framework for distributed classification based on pairwise similarity of local models. Its main strong points are that it takes into account the origin of the unlabelled data and the differences of the local models, and manages to produce an accurate model for each distributed database. Furthermore, it is incremental and doesn’t require moving raw data around. There are extensions to this work towards ensemble learning and multiple classifier systems that need to be investigated. The Beta Transformation could probably be more useful in environments of multiple similar classifiers. Perhaps the most interesting issue in terms of future work is similarity based clustering of the models. This will allow detecting groups of similar models and make decisions within each group to achieve better results.

References 1. Catherine L. Blake and Christopher J. Merz. UCI repository of machine learning databases. http://www.ics.uci.edu/∼mlearn/MLRepository.html, 1998. 2. Philip Chan and Salvatore Stolfo. Meta-learning for multistrategy and parallel learning. In Proceedings of the Second International Workshop on Multistrategy Learning, 1993.

3. Winston Davies and Pete Edwards. Dagger: A new approach to combining multiple models learned from disjoint subsets. Machine Learning, 2000. 4. Thomas G. Dietterich. Ensemble methods in machine learning. In Multiple Classifier Systems, pages 1–15, 2000. 5. Yike Guo and Janjao Sutiwaraphun. Probing knowledge in distributed data mining. In Proceedings of the PAKDD’99 Conference, Beijing, China, 1999. 6. Fumitaka Kimura and Malayappan Shridhar. Handwritten numerical recognition based on multiple algorithms. Pattern Recognition, 24(10):969–983, 1991. 7. Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas. On combining classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(3):226–238, March 1998. 8. Wojtek J. Krzanowski. Priniciples of Multivariate Analysis: A user’s perspective. Oxford Science Publications, 1993. 9. Louisa Lam and Ching Y. Shen. Optimal combinations of pattern classifiers. Pattern Recognition Letters, 16:945–954, 1995. 10. Foster Provost. Distributed data mining: Scaling up and beyond. In Advances in Distributed and Parallel Knowledge Discovery. MIT Press, 2000. 11. Ross J. Quinlan. C4.5: Programs for Machine Learning. Morgan Kaufman, San Mateo, 1993. 12. Ruediger Wirth, Michael Borth, and Jochen Hipp. When distribution is part of the semantics: A new problem class for distributed knowledge discovery. In Proceedings of the PKDD 2001 Workshop on Ubiquitous Data Mining for Mobile and Distributed Environments, pages 56–64, 2001. 13. David Wolpert. Stacked generalization. Neural Networks, 5:241–259, 1992.