Sample Size Issues in the Choice between the Best ...

Sample Size Issues in the Choice between the Best Classifier and Fusion by Trainable Combiners Sarunas Raudys1 , Giorgio Fumera2 , Aistis Raudys1 , and Ignazio Pillai2 1

2

Vilnius University, Faculty of Mathematics and Informatics Naugarduko st. 24, LT-03225 Vilnius, Lithuania {sarunas.raudys,aistis.raudys}@mif.vu.lt University of Cagliari, Dept. of Electrical and Electronic Eng., Piazza d’Armi, 09123 Cagliari, Italy {fumera,pillai}@diee.unica.it http://pralab.diee.unica.it

Abstract. We consider an open issue in the design of pattern classifiers, i.e., choosing between the best classifier among a given ensemble, and combining all the available ones using a trainable fusion rule. While the latter choice can in principle outperform the former, their actual effectiveness is affected by small sample size problems. This raises the need of investigating under which conditions one choice is better than the other one. We provide a first contribution, by deriving an analytical expressions of the expected error probability of best classifier selection, and by comparing it with the one of a well known linear fusion rule, implemented with the Fisher linear discriminant. Keywords: data dimensionality, sample size, complexity, collective decision, expert fusion, accuracy, classification.

1

Introduction

Classifier ensembles have become a state-of-the-art approach for designing pattern classifiers (or “experts”), as an alternative to the traditional approach of using a single classification algorithm. One of the reasons is that identifying the best expert for a given task is often difficult, due to small sample size effects that arise when the data available for expert design are scarce. In this case, expert’s performance can not be reliably estimated, whereas combining all the available experts (e.g., by averaging their outputs, or by majority voting) can prevent the choice of the worst one. In principle, trainable fusion rules (e.g., weighted averaging or voting) can even outperform the best expert; in practice, since their parameters have to be estimated from a data set, also their performance is affected by small sample size. Sample size problems have been studied both theoretically (e.g., [10,5]), including the choice among fixed and trained fusion rules [1,6,7], and in applications like investment portfolio design [8]. Instead, the problem of choosing between best expert selection, and experts’ combination E. Corchado et al. (Eds.): IDEAL 2014, LNCS 8669, pp. 45–52, 2014. c Springer International Publishing Switzerland 2014

46

S. Raudys et al.

with trainable fusion rules, has been addressed so far only for regression tasks [3]. In this paper we address this problem in the context of classification tasks, and provide a first contribution toward the investigation of the conditions under which one choice is better than the other, in terms of the sample size, and of factors like the ensemble size, the performance of the individual experts, and their correlations. Our main contribution is the derivation of an analytical expression of the expected error probability resulting from the selection of the best expert of a given ensemble (Sect. 2). This allows us to compare it, in Sect. 4, with the error probability of a well known trainable fusion rule, the linear combination of expert’ outputs by Fisher Linear Discriminant (summarized in Sect. 3), obtaining some preliminary insights and suggestions for future work.

2

Accuracy of Best Expert Selection

Problem formulation. We consider a common setting in classifier design, when m different experts C1 , . . . , Cm are available for a given classification task (obtained, e.g., using different classification algorithms), and the designer has to choose among using only the best (most accurate) individual expert, and combining all the available ones with a given trainable fusion rule. Each expert implements a decision function Ci : X → Y, where X is a given feature space and Y denotes the set of class labels. We assume that the experts have already been trained, and that a validation set V made up of nV i.i.d. samples drawn from the (unknown) distribution P (x, y), x ∈ X , y ∈ Y, is available for estimating their performance and for training the fusion rule. We assume that V is different from the training set, to avoid optimistically biased estimates. We denote as eGi = P (Ci (x) = y), i = 1, . . . , m, the true (“genuine”), but unknown error probability of Ci , and with eVi the corresponding estimate computed as the error rate on V (i.e., the fraction of misclassified validation samples), which is a random variable (r.v.). The goal of this section is to derive an analytical expression of the expected error probability incurred by selecting the expert exhibiting the lowest error rate mini eVi (i.e., the apparent best expert), assuming that ties are randomly broken. This is very difficult when the r.v. eVi are statistically dependent, as happens if they are computed on the same validation set. Therefore, we start by considering the simplest, albeit less realistic case in which a distinct and independent validation set of size nV is used for each expert, which implies that also the eVi ’s are independent. Then we will refine our results by developing a tractable model of their correlation. Independent estimates of the error rate. Under this assumption, the joint probability of the eVi ’s conditioned to the eGi ’s is given by: P (eV1 , . . . , eVm |eG1 , . . . , eGm ) = m (1) i=1 P (eVi |eGi ) . Let ri ∈ {0, . . . , nV } be a r.v. denoting the number of validation samples misclassified by Ci . Since each eVi is estimated on nv i.i.d. samples as ri /nv , each of them follows a Binomial distribution:

Sample Size Issues in the Choice between the Best Classifier and Fusion

s nV ! (eGi )s (1 − eGi )nV −s . P (ri = s) = P eVi = |eGi = nV (nV − s)!s!

47

(2)

To compute the error probability of the apparent best expert, we have to consider m disjoint events, denoted as S1 , . . . , Sm , where Sk is the event that k different experts attain the same, smallest error rate, and thus, for k > 1, one of them is randomly selected as the best expert. Consider first the event S1 , and denote as Ci the expert that exhibits the smallest error rate. This means that Ci misclassifies s ∈ {0, . . . , nV − 1} validation samples (with probability given by Eq. 2), and all the other experts Cj , j = i, misclassify more than s samples. The probability of the latter event, denoted as PiS1 (s), is given by: PiS1 (s) = m (3) j=1,j=i P (rj > s) , where P (rj > s) can be computed as: nV s P (rj > s) = P eVj > |eGj = P (rj = s ) . nV

(4)

s =s+1

The probability that Ci exhibits the smallest error rate is thus given by: nV −1 S1 PiS1 = s=0 Pi (s)P (ri = s) .

(5)

Consider now the event S2 , and denote as Ci1 and Ci2 the experts that exhibit the smallest error rate. Similarly to Eq. (3), the probability that all the other experts misclassify more than s samples is: PiS12,i2 (s) = m (6) j=1,j=i1 ,i2 P (rj > s) , and thus the probability that Ci1 and Ci2 exhibit the smallest error rate is: nV −1 S2 (7) PiS12,i2 = s=0 Pi1 ,i2 (s)P (ri1 = s)P (ri2 = s) . One can similarly derive the probabilities of events S3 , . . . , Sm . Eventually, ind the expected error probability of best expert selection, denoted as PSEL (“ind” denotes the underlying independence assumption), is given by: ind = PSEL

m m m e +e S1 P S2 Gi1 Gi2 + i=1 Pi eGi + i1 =1 m i2 =1,i2 =i1 S3i1 ,i2 eGi1 2+eGi2 +eGi3 m m + ... i1 =1 i2 =1,i2 =i1 i3 =1,i3 =i1 ,i2 Pi1 ,i2 ,i3 3

(8)

When the eGi ’s are assumed to be known for the purpose of a theoretical analysis, for moderate m values Eq. 8 can be computed exactly. Nevertheless, a good approximation can be obtained by considering only the first few terms of Eq. 8, depending on the sample size nV , and on the values of eG1 , . . . , eGm . As an example, we evaluated the accuracy of the approximation obtained using only the three terms explicitly shown in Eq. 8. To this aim we considered a two-class problem, an ensemble of m = 7 experts with [eG1 , . . . , eG7 ] = [0.015, 0.02, 0.04, 0.07, 0.09, 0.11, 0.13], and different values of nV , assuming that

48

S. Raudys et al.

Fig. 1. Left: empirical error rate of best expert selection (squares), and theoretical value (circles) approximated with Eq. 8, as functions of nV , for independent validation sets. Right: the same comparison for the case of dependent validation sets, for nV = 200 and nV = 400, as a function of nVB ; theoretical values are comuputed using Eq. 9.

nV /2 samples of each class are present in the validation sets. For each Ci we generated the number ri of misclassified samples from a Binomial distribution with parameters eGi and nV (see Eq. 2), and selected the appartent best expert. We then computed the average true error of the selected expert, over 100, 000 runs of the above procedure. We finally compared this value with the theoretical one of Eq. 8, considering only the first three terms. Fig. 1 (left) shows the empirical values of PSEL and the approximated theoretical values, as functions of nV . It can be seen that the approximation is very good, despite the approximation error tends to increase when the sample size nV decreases. Dependent estimates of the error rate. If the validation sets used for the m experts are not independent, the r.v. eVi are not independent either. In practice, the same validation set is typically used for all experts, but the analytical derivation of PSEL in this case is infeasible, since all m(m − 1)/2 pairwise correlations between the eVi ’s must be taken into account. We therefore resort in this paper to a simplifying assumption which allows us to model the correlations in a tractable way, such that an analytical appoximation of PSEL can be derived. The investigation of other correlation models that can lead to a better approximation is left as a future work. Let us denote as I[Ci (x) = y] the classification outcome (either a correct classification or a misclassification) of Ci on a given sample (x, y), where I[A] denotes the identity function (I[A] = 1 if A = True, and I[A] = 0 otherwise). We assume that the validation set of each expert is made up of two parts, VA of size nVA and VB of size nVB = nV − nVA , such that the correlation ρC between the classification outcome of any pair of experts, I[Ci (x) = y] and I[Cj (x) = y], equals 1 for any (x, y) ∈ VA , and equals 0 for any (x, y) ∈ VB . In other words, any sample in VA is either correctly classified or misclassified by all the experts, whereas the classification outcomes are independent on any sample in VB (similarly to the case discussed above). Accordingly, the error rates of the experts on VA are identical, and their ranking depends


49

ind only on the samples in VB . Let eGmin = mini eGi , and PSEL denote the error probability of the best expert selected on VB (which can be computed as in Eq. 8, using nVB instead of nV ). The expected error probability of best expert selection under the above correlation model is: n VA nVB ind corr = eGmin + P . (9) PSEL nV nV SEL

To give an example of how accurately Eq. 9 approximates the error probability of best expert selection in a realistic scenario in which the same validation set is used for all experts, we consider again a two-class problem, an ensemble of m = 7 experts, and the same eGi values as in the example above. We first generated nV artificial soft outputs for each expert, with identical m(m − 1)/2 pairwise correlations, computed the corresponding true error rates by thresholding the outputs at zero, and selected the best expert. The average, true error of the best expert selected was computed over 50, 000 runs of the above procedure, and was compared with the theoretical approximation of Eq. 9. Fig. 1 (right) shows this comparison for two validation set sizes, nV = 200 and nV = 400, as a function of nVB . In the considered case when the m(m − 1)/2 pairwise correlations are identical, the approximation turns out to be very good.

3

Accuracy of Linear Expert Fusion

To pursue our original goal, analytical expressions of the expected error probability of trainable fusion rules are needed, beside that of best expert selection. In the following we focus on the well known and widely used linear combination of soft outputs, whose expected error probability has already been analytically approximated in previous works, for the case when it is implemented with the Fisher Linear Discriminant (FLD) for two-class classification problems. Denoting as y ∈ {−1, +1} the class labels, mand as yi (x) ∈ R the soft output of Ci , this fusion rule is defined as f (x) = i=1 wi yi (x)+w0 , where wi ∈ R, i = 0, . . . , m, and the decision function is sign (f (x)). Let w = (w1 , . . . , wm ) , the column vectors ˆ2 denote the m-dimensional mean of the experts’ soft outputs on class μ ˆ1 and μ ˆ 1 and Σ ˆ 2 denote the estimates 1 and 2 estimated on validation samples, and Σ ˆ1 + Σ ˆ = 1 Σ ˆ 2 , and of the corresponding covariance matrices; moreover, let Σ 2 ˆ F = (1 − λ)Σ ˆ + λD denote the estimate of the regularized pooled sample let Σ covariance matrix, where D is a diagonal matrix obtained by the diagonal of ˆ and 0 ≤ λ ≤ 1 is a regularization parameter. Using the FLD, w and w0 are Σ, ˆ −1 1 computed as w = (ˆ μ1 − μ ˆ2 ) Σ μ1 + μ ˆ2 ) w. The expected F , and w0 = − 2 (ˆ error probability is then [4]: δ PFLD = Φ −

, (10) 2 Tμ TΣ where δ denotes the Mahalanobis distance of the two classes in the space of nV expert’s outputs, and the scalars Tμ = 1 + n2m 2 and TΣ = n −m account for the Vδ V

50

S. Raudys et al.

corr (Eq. 9), for different Table 1. Numerical comparison between PFLD (Eq. 10) and PSEL classifier ensembles (see text for the details)

e G1 e Gi , i > 1 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.15 0.1 0.20 0.1 0.20

m 9 9 9 9 9 20 20 20 100 100 100 100

nV 100 100 100 100 100 100 200 500 200 500 200 500

ρS 0.50 0.70 0.90 0.95 0.99 0.50 0.50 0.50 0.50 0.50 0.50 0.50

ρC 0.26 0.42 0.66 0.75 0.87 0.26 0.26 0.26 0.28 0.28 0.30 0.30

PFLD 0.086 0.111 0.108 0.080 0.008 0.101 0.084 0.075 0.168 0.099 0.201 0.127

corr PSEL 0.120 0.117 0.111 0.108 0.103 0.124 0.119 0.107 0.125 0.114 0.117 0.101

inexact, sample-based estimates of the mean vectors μ1 and μ2 , and covariance matrix Σ, respectively. Expression 10 turns out to be the best approximation known so far of the expected error probability of the FLD combiner [11,9].

4

Analytical and Empirical Comparison: An Example

In the following we show an example of how, exploiting the above results, one can compare the performance attained by the two considered design choices (selecting the bext expert, and combining the available ones with the LDF fusion rule), aimed at understanding the conditions under which one is preferable than the other. We remind the reader that the eventual, practical outcome of such an investigation is the derivation of guidelines for the design of pattern classifiers, analogous, e.g., to the guidelines derived in [2] for the choice between simple and weighted average fusion rules in ensemble design. We first carry out a numerical corr and PFLD (respectively Eqs. 9 comparison of the analytical expressions of PSEL and 10). We then carry out an empirical comparison using artificial data, that allows one to assess the validity of the conclusions that can be drawn from an analytical comparison, also when the underlying assumptions are not satisfied. Numerical comparison. In this example we consider different ensembles made up of m = 9, 20, and 100 experts, in which the true best expert (say, C1 ) has an error probability eG1 = 0.1, and the remaining ones have identical error probabilities eG2 = . . . = eGm = 0.15, and 0.20. We also consider validation set sizes nV = 200, 400, 1000, different pairwise correlations ρS between experts’ classification outputs, and different pairwise correlations ρS between the corresponding soft outputs, as shown in the left-most columns of Table 1. The right-most columns corr (Eq. 9). show the values of PFLD (Eq. 10), and of PSEL The most evident fact from the example in Table 1 is that the FLD trainable combiner tends to outperform the best expert selection for smaller ensemble


51

0.0225 0.022

Classification error

1 0.0215 2 0.021 0.0205 0.02 3 0.0195 0.019 200

400

600 800 Validation set size

1000

Fig. 2. Expected error probability of best expert selection (1: empirical, 2: theoretical approximation with Eq. 9), and of the FLD combiner (3: empirical), as functions of validation set size nV

sizes m, while the opposite happens for larger m. As one can expect, increasing the validation set size nV is beneficial both for solutions; instead, for a fixed nV , increasing the ensemble size is detrimental, especially for the FLD trainable combiner. Instead, no clear pattern related to the effect of the correlations emerges from this simple example: a wider investigation is required to this aim. Experiments on artificial data. In this example, we consider an ensemble of m = 9 experts for a two-class problem. We artificially generated their soft outputs from a Gaussian distribution with identical covariance matrices Σ1 = Σ2 , such that [eG1 , . . . , eG9 ] = [0.139, 0.083, 0.021, 0.019, 0.021, 0.025, 0.026, 0.027, 0.038], and ρC = 0.3. We therefore set nVA = 0.3nV . For the FLD combiner, the regularization parameter was set to λ = 0.2. In Fig. 2 we show the error probability of best expert selection and of the FLD combiner, as a function of validation set size, assuming that the number of validation samples of both classes is the same. The empirical values were computed as averages over 1,000 independent runs of the above procedure. The theoretical values were approximated using Eqs. 9 and 10. In this example the FLD combiner outperforms the best expert selection notably (note that in this case nV m). The theoretical values of PFLD (Eq. 10), not reported here, turned out to be an almost exact approximation of the empirical values (curve 3 in Fig. 2). It can also be seen that the approximacorr by Eq. 9 exhibits a slightly lower, but still good accuracy, than in tion of PSEL the example of Fig. 1 (right).

5

Concluding Remarks

In this paper we started the investigation of a relevant open issue related to pattern classifier design, and to multiple classifier systems in particular. It consists of investigating the conditions under which the selection of the apparent

52

S. Raudys et al.

best expert, out of a given ensemble, is a better solution than combining all the available experts using a trainable fusion rule. The main contribution of this paper is the derivation of the first analytical expression known so far of the expected classification probability of the best expert selection strategy, capable of taking into account small sample size effects due to the use of a finite set of hold-out samples for performance estimation. We also developed a simple model that accounts for the pairwise correlations between experts’ misclassifications, that in practical settings are unlikely to be statistically independent. Our results can be exploited in future works for a thorough analytical comparison with the expected error probability of trained combiners. In this paper we made an example of such a comparison, involving the well known and widely used FLD linear combiner. Our example pointed out the role of validation set size, ensemble size, and correlations between the experts, in determining which of the considered design choices is most effective in finite sample size situations. The development of more accurate models of the error probability of best expert selection, taking into account experts’ correlations, as well as the analytical derivation of the expected error probability of other trained combiners, are relevant issues for future work. Acknowledgment. This research was funded by grant MIP 057/2013 from Research Council of Lithuania.

References 1. Duin, R.P.W.: The Combining Classifier: To Train Or Not To Train. In: Proc. 16th Int. Con. Pattern Recognition, vol. II, pp. 765–770 (2002) 2. Fumera, G., Roli, F.: A Theoretical and Experimental Analysis of Linear Combiners for Multiple Classifier Systems. IEEE Trans. Pattern Analysis and Machine Intelligence 27(6), 942–956 (2005) 3. Rao, N.S.V.: On fusers that perform better than best sensor. IEEE Trans. Pattern Analysis and Machine Intelligence 23(8), 904–909 (2001) 4. Raudys, S.: On the amount of a priori information in designing the classification algorithm. Engineering Cybernetics N4, 168–174 (1972) (in Russian) 5. Raudys, S.: Statistical and Neural Classifiers. Springer, London (2001) 6. Raudys, S.: Experts’ Boasting in Trainable Fusion Rules. IEEE Trans. Pattern Analysis and Machine Intelligence 25(9), 1178–1182 (2001) 7. Raudys, S.: Trainable fusion rules. I. Large sample size case. Neural Networks 19, 1506–1516 (2006); Trainable fusion rules. II. Small sample-size effects. Neural Networks 19, 1517–1527 (2006) 8. Raudys, S.: Portfolio of automated trading systems: Complexity and learning set size issues. IEEE Trans. Neural Networks Learning Systems 24(3), 448–459 (2013) 9. Takeshita, T., Toriwaki, J.: Experimental study of performance of pattern classifiers and the size of design samples. Patt. Rec. Lett. 16, 307–312 (1995) 10. Vapnik, V.N.: The Nature of Statistical Learning Theory. Springer, Berlin (1995) 11. Wyman, F., Young, D., Turner, D.: A comparison of asymptotic error rate expansions for the sample linear discriminant function. Patt. Rec. 23, 775–783 (1990)