Learning from Label Proportions by Optimizing Cluster ... - Eldorado

Learning from Label Proportions by Optimizing Cluster Model Selection Marco Stolpe1 and Katharina Morik1 Technical University of Dortmund, Artificial Intelligence Group Baroper Strasse 301, 44227 Dortmund, Germany [email protected] and [email protected] http://www-ai.cs.tu-dortmund.de/

Abstract. In a supervised learning scenario, we learn a mapping from input to output values, based on labeled examples. Can we learn such a mapping also from groups of unlabeled observations, only knowing, for each group, the proportion of observations with a particular label? Solutions have real world applications. Here, we consider groups of steel sticks as samples in quality control. Since the steel sticks cannot be marked individually, for each group of sticks it is only known how many sticks of high (low) quality it contains. We want to predict the achieved quality for each stick before it reaches the final production station and quality control, in order to save resources. We define the problem of learning from label proportions and present a solution based on clustering. Our method empirically shows a better prediction performance than recent approaches based on probabilistic SVMs, Kernel k-Means or conditional exponential models.

1

Introduction

Consider a steel factory where charges of steel sticks are processed sequentially at several production stations. The quality of sticks is assessed at the end of the process. For each stick though, we are given sensor measurements and other parameters during its being processed. Based on this information, we want to predict the quality of individual sticks as early as possible, before they reach the final production station and quality control. This saves resources, because sticks that can no longer reach the desired quality can be locked out. The steel sticks cannot be marked and tracked. Therefore, the available quality information is not related to single sticks, but charges of sticks. For each charge, we know how many sticks had a certain type of error (quality). We want to learn a prediction function from the sensor measurements of the process and the error type counts of charges. The learned model is used to predict the error type for individual sticks at intermediate production stations. We generalize this learning problem. It deviates from that of supervised learning, where we learn from individually labeled training examples. It is different from semi-supervised learning [5], where we are given at least some labeled examples. Since we have some label information, it is not strictly unsupervised

2

learning. Multiple instance learning [23] is a special case, because the bags of examples are either labeled positive or negative, where we have proportions of labels for each charge. In this paper, we contribute a clustering approach for the problem which has the following properties: – – – – – – –

It empirically shows good prediction performance. Learning has linear running time in the number of observations. Its prediction models are fast to apply. It can handle the case of multiple classes. It can handle additional labeled observations, if they exist. It can weight the relevance of features. It is independent of a certain clustering method.

To the best of our knowledge, no other method exists yet which shares all of these properties. The paper is structured as follows. Section 2 formally defines the learning task and accompanying error measures. We analyze best and worst case from a Bayesian perspective. Section 3 presents a new method for learning from label proportions, LLP. In Sect. 4, we compare the prediction performance and runtime of LLP with other existing methods. In Sect. 5, we shortly discuss related work and conclude.

2

Learning from Label Proportions

In the following, we will first define the task of learning from label proportions. Then, we introduce accompanying measures for evaluating the performance of learners and discuss the problem of model selection. In the last subsection, we explore best and worse case by analyzing the problem from a Bayesian perspective. 2.1

The Learning Task

The task of learning from label proportions can be defined as follows. Definition 1 (Learning from label proportions). Let X be an instance space composed of a set of features X1 × . . . × Xm and Y = {y1 , . . . , yl } be a set of categorical class labels. Let P (X, Y ) be an unknown joint distribution of observations and their class label. Given is a sample of unlabeled observations U = {x1 , . . . , xn } ⊂ X, drawn i.i.d. from P , partitioned into h disjunct groups G1 , . . . , Gh . Further given are the proportions πij ∈ [0, 1] of label yj in group Gi , for each group and label. Based on this information, we seek a function (model) g : X → Y that predicts y ∈ Y for observations x ∈ X drawn i.i.d. from P , such that the expected error ErrP = E[L(Y, g(X))]

(1)

3 Labeled examples (unknown)

Label proportions (known)

G1 = {(x1 , 1), (x3 , 1), (x7 , 0)} G2 = {(x2 , 0), (x4 , 0), (x5 , 1), (x6 , 1)} G3 = {(x8 , 0), (x9 , 0)}

Y = {0, 1}

Π=

Sample U (known)

G1 = {x1 , x3 , x7 } G2 = {x2 , x4 , x5 , x6 } G3 = {x8 , x9 }

n=9 h=3 l=2

η



y1

0.33 0.50 1.00

y2

 0.67 |G1 | = 3 0.50 |G2 | = 4 0.00 |G3 | = 2

0.56 0.44

Fig. 1: Example for a given label proportion matrix Π

for a loss function L(Y, g) is minimized. The loss penalizes the deviation between the known and predicted label value for an individual observation x. The main difference to a supervised learning scenario is that the labels of individual observations are unknown or hidden, i.e., there is no set of labeled training instances (xi , yi ) ∈ X × Y . The label proportions πij can be conveniently written as a h × l matrix Π = (πij ), where the values in a row Πi,· = (πi1 , . . . , πil ) sum up to one. (see Fig. 1). The proportion of label yj over sample U can be calculated from Π: h

η(Π, yj ) =

1X |Gi | · πij n i=1

(2)

By multiplication of πij with its respective group size |Gi |, one gets the frequency counts µij of observations with label yj ∈ Y in group Gi . 2.2

Training and Test Error

In a supervised learning scenario, a learner can adjust its current hypothesis based on the average loss on the training set. In contrast, when learning from label proportions, one can only measure how well the given proportions are matched. Applying the learned model g(X) to all xi ∈ U , the resulting label proportions can be calculated, i.e., in each group one counts the number of observations xi with g(xi ) = yj for each label yj ∈ Y and divides the counts by the size of their respective group. This leads to a new matrix Γg , containing the model-based label proportions:

Γg =

g (γij ),

g γij

1 X = I(g(x), yj ), |Gi | x∈Gi

I=

1 : g(x) = yj 0 : g(x) 6= yj

(3)

Similarly to defining a loss function for individual observations, it is now possible to define a loss function for individual matrix entries, for example by

4 g 2 ) . The total deviance between Π and Γg can then the squared error (πij − γij be defined as the average squared error over all matrix entries:

ErrM SE (Π, Γg ) =

h l 1 XX g 2 ) (πij − γij hl i=1 j=1

(4)

Calculating ErrM SE for a function g(X) on sample U might be seen as an analogon to the training error in supervised learning. However, the loss in ErrM SE uses aggregated label information, although by Definition 1, we really need to minimize the loss over individual observations. The mismatch between the two measures can lead to problems, because many labelings of U can minimize ErrM SE . For example, when randomly sampling µij many observations from Gi and assigning them label yj , the model-based label proportions will always match exactly the given proportions. Hence, labelings that minimize ErrM SE don’t necessarily minimize the average loss over individual observations, already on sample U . Therefore, obtaining a good estimate of ErrP is difficult. In supervised learning, one may select the model which has the lowest average loss over one or several test sets. But without labels for individual observations in the test set, only knowing its label proportions, a low ErrM SE on the test set is no reliable indicator for a good model. As for the training error, many different labelings can lead to the same label proportions, but only a few labelings will minimize ErrP . However, if given a labeled test set, it is possible to evaluate different models as in the supervised case. For the experiments in Sect. 4, the error between Π and Γg wasn’t measured by ErrM SE , but by ErrΠ , a combination of two different error measures: ErrΠ (Γg ) = Errweight (Π, Γg ) =

q Errweight (Π, Γg ) · Errprior (Π, Γg ) with

(5)

h l 1 XX |Gi | g 2 η(Π, yj ) (πij − γij ) hl i=1 j=1 n

(6)

and

l

Errprior (Π, Γg ) =

1X 2 (η(Π, yj ) − η(Γg , yj )) l j=1

(7)

Errweight weights the squared error of individual matrix entries by their relative group and class size. Errprior catches situations where two hypotheses g1 and g2 are indistinguishable from each other, because the total error sum over all matrix entries is the same. In such cases, they may be distinguished by their column differences, as calculated by η. If in addition to the label proportions, the labels c1 , . . . , ct of individual observations T = {a1 , . . . , at } ⊆ U are given, error criterion (5) can be easily extended by including the average loss ErrT over these training examples: t

ErrΠ =

p Errweight · Errprior · ErrT

with

ErrT =

1X L(ci , g(ai )) t i=1

(8)

5

2.3

Best and Worst Case from a Bayesian Perspective

From a Bayesian perspective, a good model can be obtained by estimating the conditional class density P (Y |X). Applying Bayes theorem, one recognizes that P (Y |X) may also be estimated from other unknown densities—the classconditional density P (X|Y ) and the class prior density P (Y ): P (Y |X) =

P (X|Y )P (Y ) P (X)

(9)

P (X) doesn’t need to be estimated, since it can be calculated from P (X|Y ) and P (Y ). When learning from label proportions, the class prior P (yj ) for label yj can be estimated as η(Π, yj ), the proportion of yj over sample U . However, finding a good estimate for P (X|Y ) depends on the distribution of observations over the given groups and the form of matrix Π. In the best case, each group Gi only contains observations from a single class and at least l groups contain observations from different classes. If πij = 1 appears in a row, all observations in the group must have the same label, which can be assigned to all group members. We then are in a familiar supervised learning scenario and can choose from many well-known classifiers for training. However, without further knowledge about the distribution of observations over the groups, the best we can assume is a uniform distribution. Here, in the worst case, all πij are equal. Then, if we interpret πij as an estimate for the class prior P (yj |Gi ) of group Gi , it equals the estimated class prior P (yj ). Since each observation has the same probability of being sampled into group Gi , and we assumed all priors to be equal, we can only guess the correct label with probability 1/l. In general, if the number of groups remains constant, P (yj |Gi ) will approximate P (yj ) for large sample sizes n. This can be seen if we imagine each group Gi to be an independent data set with observations sampled from the same distribution P (X, Y ). For cases where all P (yj ) are different, a better performance can be achieved than in the worst case. One can at least predict the majority class. The question is if one can get any better. Except for the best case, the estimation of P (X|Y ) is difficult, because observations with the same label are spread over all groups. The LLP algorithm, introduced in the next section, is based on the idea that observations sharing the same label might also have similar feature values.

3

Learning from Label Proportions by Clustering

The k-Nearest-Neighbour classifier predicts the majority label of k known observations closest to a given search point. It is presupposed that observations lying close together in local regions of the input space also share the same class label. If we could somehow identify these local groups of observations, which is the problem of cluster analysis, the only problem left was to assign the correct labels to the clusters.

6

Definition 2 (Cluster analysis). Given a sample U of n unlabeled observations x1 , . . . , xn and a measure d : X × X → R+ for the dissimilarity of observations, the aim of cluster analysis is to determine a set C = {C1 , . . . , Ck } (clustering) of subsets Ci ⊂ U (clusters), such that observations within the same cluster are more similar to each other than those in different clusters, as meaX sured by a quality function q : 22 → R+ . Many algorithms have been developed for solving this task. We focus on those returning disjunct clusters, like the well-known k-Means algorithm [16], which was also used for the experiments in Sect. 4. Given a clustering, it must be found out which cluster best represents which class. The problem is solved by assigning each cluster a label such that ErrΠ (5) is minimized (see Sect. 3.3). In how far similar observations share the same class label not only depends on P (X, Y ), but also on the chosen similarity measure. According to Hastie et al. [13], the relevance of features can have an enormous influence on the clustering results. Therefore, the similarity measure should respect weights wf ∈ [0, 1] for each feature, as given by a vector w = (w1 , . . . , wm ). In unsupervised learning, such weights are usually specified by a domain expert. Here, the relevance weights can be approximated automatically (see Sect. 3.2), based on criterion ErrΠ . In the next section, the accompanying optimization problem is stated. Then, the LLP algorithm for solving it is described. 3.1

Optimization Problem

Let the vector λC = (λ1 , . . . , λk ) with λj ∈ Y represent a labeling for a clustering C = {C1 , . . . , Ck }. Let mλC : U → Y be a mapping that returns for a given observation x ∈ Ci the label λi . Given a clustering C, we are searching for a labeling λC of the clusters that minimizes the error (5) between the model-based label proportions ΓmλC and the known label proportions: min ErrΠ (ΓmλC )

(10)

λC

Let qw be a function which is able to assess the quality of a clustering based on a similarity measure that respects feature weights. We are searching for a clustering which maximizes qw and whose labeling most minimizes ErrΠ , for all possible weight vectors w. This optimization problem can be stated as follows:

min ErrΠ (Γmλ∗ ), w

3.2

C

λ∗C = argmin ErrΠ (ΓmλC∗ ), λC ∗

C ∗ = argmax qw (C)

(11)

C

The LLP Algorithm

The LLP algorithm solves problem (11) by an evolutionary strategy. For each weight vector w, the sub-optimization problem of maximizing qw is solved by an inner clustering algorithm, where the particular qw depends on the algorithm.

7

Algorithm 1 The LLP algorithm. Input: Label proportion matrix Π, sample U , groups G = {G1 , . . . , Gh }, labels Y = {y1 , . . . , yl }, clustering algorithm clusterer, labeling algorithm labeler, parameters maxgen, psize, mutvar, crossprob, tournsize Output: Clustering C, labeling λC , weight vector w best_f it := −∞; generation := 0 Randomly initialize a population P of psize normalized weight vectors while generation < maxgen do for w ∈ P do C := clusterer( U , w ) (λC , ErrΠ ) := labeler( C, G, Π, Y ) if best_f it < −ErrΠ then best_f it := −ErrΠ ; best_C := C; best_λC := λC ; best_w := w end if end for generation := generation + 1 if generation < maxgen then P copy := P Gaussian mutation of all individual weights in P copy with variance mutvar P children := Uniform crossover on P ∪ P copy with probability crossprob P := Tournament selection with size tournsize on P ∪ P copy ∪ P children end if end while return best_C, best_λC , best_w

The only prerequisite for the clusterer is that it returns disjunct clusters and respects different feature weights. The sub-optimization problem (10) is independent from the clusterer and currently can be solved by two different labeling heuristics introduced in Sect. 3.3. Using an evolutionary strategy as a wrapper has the advantage that it is not necessary to integrate criterion ErrΠ into the optimization problem of the inner clustering algorithm. For example, we already have run LLP successfully with Kernel k-Means [10], DBSCAN [12] and PROCLUS [1], without modification. Moreover, LLP can be used with different error measures, for instance with criterion (8) that can respect individually labeled examples. LLP (see Alg. 1) takes a clustering algorithm clusterer and a labeling algorithm labeler as parameters, in addition to Π, U , G1 , . . . , Gh and Y , which are related to the label proportions learning task. LLP then approximates the optimal weight vector w and returns w, the related clustering C and labels λC for the clusters. The evolutionary strategy starts with a random population P of normalized weight vectors, wi ∈ [0, 1]. For each individual in P , the clustering algorithm clusterer is called. The clusters are labeled according to the given labeling algorithm labeler and the fitness is evaluated by criterion ErrΠ . If the fitness is higher than the best fitness seen so far, the newly found clustering, labeling and weight vector are memorized as the new best ones. In each generation, the

8

Algorithm 2 The greedy labeling algorithm. Input: Clustering C = {C1 , . . . , Ck }, groups G = {G1 , . . . , Gh }, label proportion matrix Π, labels Y = {y1 , . . . , yl } Output: Labeling λC = (λ1 , . . . , λk ) Initialize components of λC with y1 ; for i := 1 to k do lowest_error := 0; best_label := y1 ; for j := 1 to l do λC [i] := yj ; Γm := count_labels( G, C, λC ); current_error := ErrΠ (Γm ); if current_error < lowest_error then lowest_error := current_error; best_label := yj ; end if end for λC [i] := best_label; end for return λC

weight values in a copy of P are mutated by a Gaussian distribution and, with a certain probability, exchanged with P by a crossover operator. Then, the individuals take part in a tournament and only the best ones are kept in the next generation. This process is repeated until the maximum number of generations as specified by the user is reached. 3.3

Labeling Heuristics

The following two labeling algorithms solve the sub-optimization problem (10) heuristically. Greedy Labeling As shown in Alg.2, in the initial step, all clusters get label y1 . Then, consecutively for each cluster, we calculate Γm for all labels and memorize the label that most reduces ErrΠ (Γm ). The strategy has runtime k · l. Exhaustive Labeling Since k can be restricted to a small number and l = 2 for a binary classification problem, trying lk possible labelings for a clustering C is no problem. In our experiments (see section 4), good solutions often were found for k ≤ 6. For each labeling, we need to calculate ErrΠ , which takes linear time in the number of observations n. The calculations only involve basic operations like count, addition, multiplication and division (see (5)). 3.4

Run-Time Analysis

The user-specified parameters maxgen, psize and tournsize are constants. They do not depend on the number of observations n and limit the number of iterations

9

to be constant. As discussed in Sect. 3.3, the asymptotic run-time of the heuristic labeling strategies is linear in n, as k and l are constants and the evaluation of (5) takes linear time. The asymptotic run-time of LLP will otherwise depend on the used cluster algorithm. For example, if we allow for approximate solutions and limit the number of iteration steps, k-Means has linear run-time. Hence, LLP has linear run-time. 3.5

Generating a Prediction Model

The LLP algorithm returns labeled clusters of sample U . It is then possible to assign labels to individual observations xi ∈ U with mλC . To predict the labels of new observations, the clustering must be transformed into a prediction model. The way to do this depends on the used clustering algorithm. In the case of k-Means, one can simply assign new observations to their closest cluster mean and predict the corresponding cluster label. A big advantage of the cluster mean model is that it is usually very small, as k n, and therefore fast to apply. Another option for getting a prediction model is to train a classifier like Naïve Bayes [14] or a Support Vector Machine [22] in a subsequent step, based on the now labeled observations. However, this increases the training time.

4

Experiments

We have compared the LLP algorithm to three state-of-the-art methods for learning from label proportions: The Mean Map method [19], Inverse Calibration (Invcal) [21] and AOC Kernel k-Means (AOC-KK) [6]. For a further discussion of these methods, see Sect. 5. LLP has been implemented in Java. As inner clustering algorithm, we have used Fast k-Means [11], which is a variant of k-Means utilizing the triangle inequality for faster distance calculations. As distance measure, we have used the weighted Euclidean distance. We have implemented AOC-KK using a combination of Java and Matlab. For Mean Map and Invcal, we used R scripts provided by the author of Invcal [21]. 4.1

Prediction Performance Experiments

The prediction performance (accuracy) of LLP, AOC-KK, Invcal and Mean Map has been assessed on the eight UCI [3] data sets shown in Table 1. We have mapped each possible value of a nominal feature to a binary numerical feature with values 0 or 1. Numerical features were normalized to the [0, 1] interval. Table 1 shows the number of features m after this preprocessing step. In each single experiment, the accuracy has been assessed by a 10-fold crossvalidation. For learning from label proportions, we have partitioned the training set of a particular fold into groups of size σ, by uniform sampling of observations. We tried several group sizes σ: 2, 4, 8, 16, 32, 64 and 128 (with the last group smaller than σ, if necessary). The label proportions were calculated and the

10

Table 1: UCI data sets used for the experiments. Dataset credita vote colic ionosphere

n 690 435 368 351

m 42 16 60 34

Dataset sonar diabetes breast cancer heartc

n 208 768 286 303

m 60 8 38 22

individual labels removed. In each fold, the accuracy of the learned prediction model has been calculated on a labeled test set. The kernel methods Mean Map, Invcal and AOC-KK have been tested with the linear kernel, polynomial kernels of degree 2 and 3 and radial basis kernels (γ = 0.01, 0.1 and 1.0). LLP has been tested with both labeling heuristics (see Sect. 3.3), for cluster sizes k ∈ [2, 12]. As parameters for the evolutionary strategy, we used maxgen = 10, psize = 25, mutvar = 1.0, crossprob = 0.3 and tournsize = 0.25. By running LLP with k-Means, we get a prediction model consisting of cluster means. The same is true for AOC-KK. However, the cluster methods also assign labels to each observation in sample U , allowing for a subsequent training of other classifiers. Based on the clustering results, we have trained models for Naïve Bayes [14], kNN [2], Decision Trees [20], Random Forests [4], and the SVM [22] with linear and radial basis kernel. The model parameters have been optimized by a grid or evolutionary search. The combination of all datasets, group sizes, classifiers, their variants and parameters results in a total of 13.216 experiments: 672 for Mean Map and Invcal, 2.688 for AOC-KK and 9.856 for LLP. For group sizes 16, 32, 64 and 128 on the datasets colic and sonar, and for group size 128 on credita, we conducted additional experiments with LLP for maxgen = 5 and psize = 100. In some cases, we got a better prediction accuracy. All experiments took about three weeks. They were run in parallel on up to six machines with an AMD Dual-Core or Quad-Core Opteron 2220 processor and a maximum of 4 GB main memory.

4.2

Prediction Performance Results

Figure 2 contains plots of the highest achieved accuracies for all data sets and group sizes, based on the best performing models of LLP, AOC-KK, Invcal and MeanMap, over all conducted experiments. LLP shows a higher accuracy than Invcal for many group sizes on the data sets credita, vote, colic, sonar and breast cancer. On credita, vote, ionosphere, sonar and diabetes, the variance of accuracy between group sizes is smaller for LLP in comparison to the other methods. Mean Map performs worse than LLP and Invcal in many cases. The performance of AOC-KK varies, depending on the data set. It shows good performance on breast cancer and heartc, but not on the others. Except

0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.502

credita

0.95

LLP Invcal Mean Map AOC-KK 4 8

16

group size

32

64

accuracy LLP Invcal Mean Map AOC-KK 4 8

16

group size

32

64


16

group size

32

64

128

32

64

128

32

64

128

0.75


16

group size

32

64

0.602

128

breastcancer

16

32

64

128

16

group size

heartc

0.80 0.75 0.70

group size


0.85

0.68 LLP Invcal Mean Map AOC-KK 4 8

0.70 0.65

accuracy

accuracy

0.75

diabetes

0.72

0.622

128

ionosphere

sonar

0.70

0.64

64

0.85

0.652

128

0.74

0.66

32

0.80

0.70

accuracy

accuracy accuracy

0.76

16

group size

0.90

0.70

0.90 0.85 0.80 0.75 0.70 0.65 0.60 0.55 0.502


0.95

0.75

0.552

0.85 0.80

0.702

128

0.80

0.65

0.90

0.75

colic

0.85

0.60

vote

1.00

accuracy

accuracy

11

2


16

group size

Fig. 2: Highest accuracies for all data sets and group sizes, over all 13.216 runs of LLP, AOC-KK, Invcal and MeanMap (plus the additional runs of LLP with maxgen = 5 and psize = 100). Some values for Mean Map and group size 128 are missing in the plots, due to an error in the R script.

12

Table 2: Average ranks of classifiers by group size, and their difference to LLP’s rank, based on the best models for each data set and group size. Positive difference values indicate a better performance of LLP. Highest ranks and significant differences (higher than CD) at the 10%-level are marked in bold. σ LLP AOC-KK Invcal Mean Map AOC-KK Invcal Mean Map

2

4

8 16 32 Average Ranks 2.500 1.875 1.500 1.875 1.625 2.000 2.750 3.000 2.875 2.625 2.000 1.875 2.375 2.125 2.125 3.500 3.500 3.125 3.125 3.625 Differences, CD 2. At the 10%-level, LLP is significantly better than AOC-KK for σ = 8, better than Invcal for σ = 128 and better than Mean Map for σ = 4, 8, 32 and 64. In all other cases, LLP performs equivalently. The ranks in Table 3 are based on different models than those in Table 2. For LLP and AOC-KK, we have only included the best performing cluster mean models. We have compared them to the best performing models of Invcal and Mean Map, i.e. to different kernels. The cluster mean models perform significantly better than Mean Map for σ = 64 and better than Invcal for σ = 128. In

13

Table 3: Average ranks of classifiers by group size, and their difference to LLP’s rank. Ranks are based on the best performing models of Invcal and Mean Map and the best performing cluster mean models of LLP and AOC-KK. Positive difference values indicate a better performance of LLP. Highest ranks and significant differences (higher than CD) at the 10%-level are marked in bold. σ LLP AOC-KK Invcal Mean Map AOC-KK Invcal Mean Map

2

4

8 16 32 Average Ranks 2.375 2.375 2.000 2.250 2.250 3.750 3.250 3.125 2.875 2.875 2.125 1.625 2.125 1.750 1.625 2.750 2.750 2.750 3.125 3.250 Differences, CD 2, LLP had a higher average rank than AOC-KK. LLP needs only linear training time, while in contrast, AOC-KK solves a quadratic optimization problem in each iteration step of Kernel k-Means.

6

Conclusions and Future Work

We have presented a new approach for learning from label proportions, the LLP algorithm. With k-Means as the clustering algorithm, LLP has only linear worst-case training time and its cluster mean models are small and fast to apply. In comparison to state-of-the-art methods, which need more training time, the cluster mean models have shown significantly better or equivalent prediction accuracy. By training other classifiers on the labeled clusters, the highest achieved accuracy of LLP was significantly different in even more cases, and LLP had the highest average rank for all σ > 2. In the future, we want to evaluate LLP’s performance on data from the steel factory, as mentioned in the introduction. Moreover, it would be interesting to assess LLP’s prediction performance with multi class problems and additional labeled observations. Another direction is to use different clustering algorithms with LLP and compare their performance. Acknowledgements This work has been supported by the DFG, Collaborative Research Center SFB876, project B3.

References 1. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: Proc. of the Int. Conf. on Management of Data. pp. 61–72. SIGMOD ’99, ACM, New York, NY, USA (1999)

16 2. Aha, D.: Tolerating noisy, irrelevant, and novel attributes in instance-based learning algorithms. Int. J. of Man-Machine Studies 36(2), 267–287 (1992) 3. Asuncion, A., Newman, D.J.: UCI machine learning repository (2007) 4. Breimann, L.: Random forests. Machine Learning 45, 5–32 (2001) 5. Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, Cambridge, MA (2006) 6. Chen, S., Liu, B., Qian, M., Zhang, C.: Kernel k-Means based framework for aggregate outputs classification. In: Proc. of the Int. Conf. on Data Mining Workshops (ICDMW). pp. 356–361 (2009) 7. Dara, R., Kremer, S., Stacey, D.: Clustering unlabeled data with SOMs improves classification of labeled real-world data. In: Proc. of the 2002 Int. Joint Conf. on Neural Networks (IJCNN). vol. 3, pp. 2237–2242 (2002) 8. Demiriz, A., Bennett, K., Bennett, K.P., Embrechts, M.J.: Semi-supervised clustering using genetic algorithms. In: Proc. of Artif. Neural Netw. in Eng. (ANNIE). pp. 809–814. ASME Press (1999) 9. Demsar, J.: Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006) 10. Dhillon, I., Guan, Y., Kulis, B.: Kernel k-Means: spectral clustering and normalized cuts. In: Proc. of the 10th Int. Conf. on Knowl. Discov. and Data Mining. pp. 551– 556. SIGKDD ’04, ACM, New York, NY, USA (2004) 11. Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proc. of the 20th Int. Conf. on Machine Learning (ICML) (2003) 12. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proc. of the 2nd Int. Conf. on Knowl. Discov. and Data Mining. pp. 226–231. AAAI Press (1996) 13. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Statistics, Springer, 2nd edn. (2009) 14. John, G.H., Langley, P.: Estimating continuous distributions in bayesian classifiers. In: Proc. of the 11th Conf. on Uncertainty in Artif. Int. pp. 338–345. Morgan Kaufmann, San Francisco (1995) 15. Kueck, H., de Freitas, N.: Learning about individuals from group statistics. In: Uncertainty in Artif. Int. (UAI). pp. 332–339. AUAI Press, Arlington, Virginia (2005) 16. MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Symp. Math. stat. & prob. pp. 281–297 (1967) 17. Mitchell, T.M.: Machine Learning. McGraw-Hill (1997) 18. Musicant, D.R., Christensen, J.M., Olson, J.F.: Supervised learning by training on aggregate outputs. In: Proc. of the 7th Int. Conf. on Data Mining (ICDM). pp. 252–261. IEEE Computer Society, Washington, DC, USA (2007) 19. Quadrianto, N., Smola, A.J., Caetano, T.S., Le, Q.V.: Estimating labels from label proportions. J. Mach. Learn. Res. 10, 2349–2374 (December 2009) 20. Quinlan, J.R.: Induction of decision trees. Machine Learning 1(1), 81–106 (1986) 21. Rüping, S.: SVM classifier estimation from group probabilities. In: Proc. of the 27th Int. Conf. on Machine Learning (ICML) (2010) 22. Vapnik, V.: The nature of statistical learning theory. Springer, New York, 2nd edn. (1999) 23. Witten, I.H., Eibe, F., Hall, M.A.: Data Mining: Practical Machine Learning Tools and Techniques. Data Management Systems, Elsevier, Inc., Burlington, MA, 3rd edn. (2011)