Automatic Bayes Carpentry Using Unlabeled Data ... - Semantic Scholar

Automatic Bayes Carpentry Using Unlabeled Data in Semi-Supervised Classification

Hui Zou Department of Statistics Stanford University Stanford, CA 94305 [email protected]

Ji Zhu Department of Statistics University of Michigan Ann Arbor, MI 48109 [email protected]

Trevor Hastie Department of Statistics Stanford University Stanford, CA 94305 [email protected]

Abstract In semi-supervised classification one important question is how to use the enormous amount of unlabeled data to improve the classifier built on the labeled data. We point out that under certain situations where the labeled data are collected by biased sampling rather than random sampling, supervised learning algorithms are no longer Bayes consistent due to an inherent bias. However, we show that the unlabeled data can be used adaptively to infer the sampling scheme and estimate the bias. Using the inferred result, we propose a universal carpentry technique that ensures the Bayes consistency for most popular supervised classifiers including boosting and kernel machines. A direct bias-correction formula is also presented for estimation-calibrated classifiers. We demonstrate the efficacy of our proposed carpentry methods by simulation.

1

Introduction

In the standard two-class classification problem, let y ∈ {1, −1} denote the class label and X = (Xj ) , j = 1, 2, . . . , p be the features. A classifier f is a measurable function f : X → R where X is the feature space and the classification rule is to assign label Sign(f (X)) to the observation with features X. The primary goal in classification is to achieve a minimum misclassification error on unseen future data. Under the 0-1 loss, the smallest error is given by the Bayes rule, which is usually unknown. Supervised learning builds a classifier based on a given training data set. In order to well approximate the Bayes rule, supervised classification usually requires a fairly large set of training data which can be very expensive in many applications. For example, in the text documents classification problem enormous amount of unlabeled documents are cheaply obtained; then some unlabeled documents are chosen and labeled by human experts as the training data for a supervised classifier. Labeling a document is time consuming and costly, thus the size of the labeled documents set is limited. In this kind of scenario, it is natural to ask whether the huge amount of unlabeled documents can help the classification. There is a rather optimistic view in the machine learning literature. For example, Nigam & McCallum [9] combined the Expectation-Maximization (EM) algorithm and a naive Bayes classifier for learning from both labeled and unlabeled documents in text classifications.

They claimed that the use of unlabeled data reduced the classification error by up to 30%. Another popular algorithm is the transductive support vector machines [4]. However, the justification for combining labeled and unlabeled data is still unclear. Cozman et al. [2] showed that in semi-supervised learning of mixture models the unlabeled data can lead to an increase in classification error if the model assumptions are not exactly satisfied. Zhang & Oles [14] also pointed out that the transductive SVM is unreliable since it may maximize the “wrong margin” and declared that “the success reported in the literature might be due to their specific experimental setups rather than the general advantage of a transductive SVM versus a supervised SVM”. Considering these two opposite opinions, we feel it is safe to say that when and how to combine the unlabeled data with labeled data is still an open problem in semi-supervised learning. In this work we study the use of unlabeled data from a different point of view. We consider the scenario where the labeled data are collected by biased sampling rather than random sampling. When biased sampling occurs, supervised learning algorithms will produce inconsistent classifiers due to an inherent bias caused by biased sampling. However, we propose that the unlabeled data can be used in the algorithms to adaptively eliminate the bias, thus guaranteeing the Bayes consistency of the classifiers. Hence the unlabeled data are always helpful. We call this property automatic Bayes carpentry. In the next section, we expose the role of the random sampling assumption for the consistency of modern supervised classifiers, then show the inconsistency caused by biased sampling. Section 3 describes a universal carpentry technique for the most popular supervised classifiers including boosting and kernel machines. A direct bias-correction formula is also available for estimation-calibrated classifiers. In Section 4, we illustrate the proposed carpentry methods by simulation.

2

Random Sampling vs. Biased Sampling

Before delving into the details of our automatic Bayes carpentry techniques, we briefly review the consistency result of supervised classifiers ³ and the issue ´ of biased sampling. Let p(y=1|X) p(y, X) be a probability measure and δ(X) = log p(y=−1|X) . Suppose p(y, X) is the distribution generating future data in the classification problem, then Sign(δ(X)) is the Bayes rule. The classification error of a classifier f is, assuming the 0-1 loss, Z Rf = I (y 6= Sign(f (X))) dp(y, X). (1) The classification error of Sign(δ(X)) is denoted as RBayes .

2.1

Consistency and random sampling

A learning algorithm produces a classifier fˆn based on training data (yi , Xi ), i = 1, 2, . . . , n. fˆn is said to be Bayes consistent if Rfˆn → RBayes in probability as n → ∞. Typically the training data (yi , Xi ) are assumed i.i.d. samples from p(y, X), which is referred to as random sampling in the literature. The Bayes consistency of fˆn is well studied under the random sampling assumption. Although it is often assumed to be true without much care, the random sampling assumption is crucial for the consistency of fˆn . To illustrate this point clearly, we examine the consistency of some popular classifiers in practice: kernel machines and boosting. These classifiers are derived from the Loss + P enalty paradigm n 1X fˆn = arg min φ(yi , f (Xi )) + λn J(f ). f n i=1

(2)

Usually φ is a convex surrogate for the 0-1 loss to ensure a computationally tractable algorithm. If the regularization term J(f ) = kf k2Hk , where Hk is a RKHS with reproducing kernel K (Wahba [13]), then (2) is a kernel machine. For example, the support vector machine (SVM) uses the hinge loss and a reproducing kernel. Boosting minimizes the empirical φ risk using a different regularizationP method. Let H = {ht , t = 1, 2, . . . , m} be m a dictionary of basis functions, and f (X) = t=1 βt ht (X), then it is shown [3,7,10] that Pm J(f ) = t=1 |βt | in an idealized version of boosting. The exact form of regularization is not crucial in our arguments.

Definition 1 Consider the infinity-sample φ risk E[φ(y, f (X))] where the expectation is taken with respect to some probability measure p(y, X). Let f∞ = arg min E[φ(y, f (X))]. (3) f

If Sign(f∞ ) = Sign(δ(X)), φ is said to be classification-calibrated. The classification-calibrated loss is a notion adopted from Bartlett [1], which is also named Fisher consistency in Lin [5]. It is shown [11,15] that under the random sampling assumption, the classification-calibrated condition is sufficient for the consistency of fˆn . The above arguments can be understood as follows. As n → ∞, the empirical φ risk converges to ET r [φ(y, f (X))], where the expectation is taken with respect to the distribution of training data. The random sampling assumption implies that ET r [φ(y, f (X))] is the infinity-sample φ risk E[φ(y, f (X))]. Thus f∞ is the target function that fˆn tries to estimate. Some technical conditions (λn vanishes at the right rate and Hk (H) is rich) ensure the convergence Rfˆn → Rf∞ . Note that Sign(f∞ ) is actually the Bayes rule of the target distribution p(y, X) by the classification-calibrated condition. Therefore fˆn is Bayes consistent. 2.2

Problem of biased sampling

As shown in the previous subsection, random sampling decides the target function of fˆn and the classification-calibrated condition ensures the target function is the Bayes rule. It turns out that all popular convex loss used in practice are classification-calibrated. However, the random sampling assumption is not necessarily true in many situations. Let n+ and n− be the sample sizes of classes {1} and {−1}, respectively. Often nn+ and nn−1 are set by the experimenter and are not random variables faithfully representing p(y = 1) and p(y = −1). Let us demonstrate this argument by considering the retrospective sampling frequently used in medical studies [8]. Suppose class {−1} indicates some disease and class {1} means healthy. Features X are the medical measurements of each person. Since the disease is a rare event, random sampling is inefficient for gathering data to study the relation between the disease and X. A much more efficient retrospective sampling is often conducted where all patients are included with a similar sized sample of healthy people. The above described sampling scheme is called biased sampling generally, in which a number of samples are randomly drawn from each class, while the weight of class {1} corresponds a chosen prior πc . Recall that the labeled data in semi-supervised learning are labeled by human, it is more likely that πc is not the true prior p(y = 1), thus the random sampling assumption is violated. Let pc (y, X) be the distribution of the collected training samples. Given the label y, features X follow the true conditional distribution p(X|y). Then pc (y, X) is given by pc (y = 1, X) = p(X|y = 1)πc and pc (y = −1, X) = p(X|y = −1)(1 − πc ). (4) Thus we can define µ ¶ µ ¶ pc (y = 1|X) p(X|y = 1)πc δc (X) = log = log . (5) pc (y = −1|X) p(X|y = −1)(1 − πc )

Now suppose we still use the same algorithm in subsection 2.1 to build fˆn . Note that pc (y, X) is the distribution of the training data, thus as n → ∞ the limit of empirical φ risk becomes Z φ(y, f (X))dpc (y, X). (6)

φ is classification-calibrated, so the sign of the minimizer of (6) is equal to Sign(δ c (X)). Hence fˆn approximates Sign(δc (X)). However, the true distribution is p(y, X) and the ˆ Bayes rule is Sign(δ(X)), which ³ implies that f´n approaches the wrong target function. p(X|y=1)π Moreover, we see δ(X) = log p(X|y=−1)(1−π) and ¶ µ πc (1 − π) . (7) δc (X) − δ(X) = log π(1 − πc )

δc (X) − δ(X) is the bias due to the biased sampling. (7) shows the bias is independent of p(X|y). When πc 6= π, Rδc (X) > Rδ(X) = RBayes since the Bayes rule is unique. We arrive at a conclusion that under biased sampling, fˆn is asymptotically sub-optimal. It is important to note that the bias is inherent for two reasons. First, the bias exists for all algorithm being consistent under random sampling; and secondly the bias cannot be removed if only using the labeled data unless the learner has extra knowledge of π.

3

Automatic Bayes Carpentry

In this section we show that the bias can be eliminated if one uses the unlabeled data appropriately. In this sense the unlabeled data are always helpful. 3.1

Inferring the correct prior using unlabeled data

We construct some π ˆ using the unlabeled data such that π ˆ→π

in probability as n → ∞ .

(8)

Let us consider a nonparametric estimate Rfor π via moment equations. We evaluate the expectation of a function h(X) : Ep [h] = h(X)p(X)dX. By Bayes theorem so we have

p(X) = p(X|y = 1)π + p(X|y = −1)(1 − π),

Ep [h] = πEp(X|y=1) [h] + (1 − π)Ep(X|y=−1) [h]. Hence if Ep(X|y=1) [h] 6= Ep(X|y=−1) [h], we have the identity π=

Ep [h] − Ep(X|y=−1) [h] . Ep(X|y=1) [h] − Ep(X|y=−1) [h]

(9)

(10)

(11)

Using the empirical distribution (or the sample moments), the unlabeled data give a consistent estimate for Ep [h], while Ep(X|y=1) [h] and Ep(X|y=1) [h] can be consistently estimated by the labeled data. Therefore, a simple consistent estimate for π is given by ˆp [h] − E ˆp(X|y=−1) [h] E π ˆ= . (12) ˆp(X|y=1) [h] − E ˆp(X|y=−1) [h] E

The moment estimate for π is fully non-parametric, requiring no assumption on p(X) or p(X|y). It is also worth pointing out that we can use any subset of X in (12), e.g. h(X) = h(X1 ), thus avoiding the curse of dimensionality in estimating π. The quality of the estimate may depend on the choice of h. It is also possible to take the average of several estimates using different h functions to get a more stable estimate π ˆ.

3.2

Universal carpentry via weighted loss

Once a consistent π ˆ is obtained, the Bayes consistency can be ensured by a weighted loss technique, as long as the loss is classification-calibrated. Theorem 1 Suppose π ˆ is the estimate for π given in (8). Let n+ and n− denote the number of labeled data in class {1} and {−1} respectively; and n = n+ + n− . Let φ be a classification-calibrated loss and define n

1X fˆnw = arg min wyi φ (yi , fi ) + λn J(f ) f n i=1 where w1 =

nˆ π n+

and w−1 =

n(1−ˆ π) n− .

(13)

Assuming the technical conditions as stated in

subsection 2.1 , fˆnw is Bayes consistent.

Proof We can rewrite the weighted loss as n

1X wy φ (yi , fi ) n i=1 i

=

n− 1 X 1 n+ φ (1, fi ) + w1 w−1 n n+ i:y =1 n n− i

X

X

φ (−1, fi )

i:yi =−1

X 1 1 φ (1, fi ) + (1 − π ˆ) φ (−1, fi ) n n =1 + i:y =−1 −

=

π ˆ

→

πEc [φ (1, f (X)) |y = 1] + (1 − π)Ec [φ (−1, f (X)) |y = −1]

i:yi

i

Where Ec means the expectation under pc (X|y). Since pc (X|y) = p(X|y) and π = p(y = 1), we have n 1X (14) wy φ (yi , fi ) → E [φ (y, f (X))] , n i=1 i

where E[·] now is under the target distribution p(y, X). Thus fˆnw has the right target function. The rest of consistency proof is the same as that for the random sampling case. ¥ The weighted loss technique has a root in statistical theory. Note that the weights in (13) are i ,Xi ) , thus it works as importance sampling. actually estimates for the likelihood ratio pp(y c (yi ,Xi ) It is straightforward to show that the weighted loss technique not only works for the case of biased sampling, but also for any sampling scheme, as long as we can consistently estimate the likelihood ratio. Remark 1 Vardi [12] considered the non-parametric maximum likelihood estimate of some CDF F on the basis of samples from weighted version of F , with a known weight function. Vardi’s method replaces the usual non-parametric maximum likelihood criterion with its weighted version. Remark 2 When nearly finishing the paper, we noted that Lin et al. [6] proposed a similar technique for SVMs under nonstandard situations. In their work the weights are assumed known. Our paper has two additional significant contributions: (1) it works for all classification-calibrated loss functions, not limited to the hinge loss of the SVM, and (2) the weights are adaptively provided by the unlabeled data. 3.3

Estimation-calibrated loss

In practice, for many φ loss functions, not only Sign(f∞ ) agrees with the Bayes rule, but also f∞ itself is related to p(y = 1|X) via a deterministic function.

Definition 2 A loss function φ is said to be estimation-calibrated with ψ, if ψ(f∞ ) = p(y = 1|X) for some continuous nondecreasing function ψ : R → [0, 1] such that ψ(0) = 12 and ψ(a) < 12 ∀a < 0, ψ(a) > 21 ∀a > 0. It was shown that under the technical conditions in subsection 2.1, fˆn converges to f∞ in pc (y=1|X) p(y=1|X) πc 1−π probability [1,11,15]. On the other hand, we know 1−p = 1−p(y=1|X) × 1−π π . c (y=1|X) c These two facts suggest the following bias-correction method. Its proof is based on basic probability arguments, thus omitted. ´ ³ rˆ Lemma 1 Suppose φ is estimation-calibrated with ψ, then Sign ψ(fˆn ) − 1+ˆ r is Bayes consistent, assuming the technical conditions in subsection 2.1, where rˆ =

n+ (1−ˆ π) n− π ˆ .

It is easy to see that any estimation-calibrated loss function must be classificationcalibrated, but the reverse is not always true. For example, the hinge loss population minimizer is exactly the Bayes rule, providing no information about p(y = 1|X) [5]. This is actually regarded as a major drawback of SVMs [3,15,16]. The next lemma describes a family of estimation-calibrated loss functions including almost all popular smooth loss functions used in practice, and gives the explicit expression of ψ for a given φ. Hence Lemma 1 is an applicable result. Lemma 2 Let φ be a margin-based loss function, i.e.,φ(y, f (X)) = φ(yf (X)). Suppose φ is convex and φ0 is continuous, then φ is estimation-calibrated if and only if φ0 (0) < 0. Sketch of proof For the necessary condition, note that any estimation-calibrated φ must be classification-calibrated, then the convexity of φ requires φ0 (0) < 0 by Theorem 6 in Bartlett [1]. Sufficient condition. Let a∗ = inf{a : φ0 (a) = 0}, if it is an empty set, a∗ = ∞. φ0 (0) < 0 implies a∗ > 0. Fix a X, f∞ (X) minimizes g(a) = p(y = 1|X)φ(a)+p(y = 1|X)φ(−a). Differentiating g, we have p(y = 1|X)φ0 (f∞ (X)) = p(y = −1|X)φ0 (−f∞ (X)), thus 0 is increasing in f∞ (X) ∈ (−a∗ , a∗ ) if 0 < p(y = 1|X) < 1. We show φφ(−a) 0 (a) ∗ ∗ 0 0 0 0 (−a , a ). Let a1 > a2 , then φ (−a1 ) < φ (−a2 ) and φ (a1 ) > φ (a2 ). Note that 0 0 1) 2) φ(a) < 0 ∀a ∈ (−a∗ , a∗ ), so φφ(−a > φφ(−a Define ψ as follows: ψ(a) = 0 (a ) 0 (a ) . 1 2 φ0 (−a) φ0 (−a)+φ0 (a) ,

a ∈ (−a∗ , a∗ ). If a∗ < ∞, then let ψ(a) = 1 ∀a ≥ a∗ ; ψ(a) = 0 ∀a ≤ −a∗ . ψ is the desired function. ¥

4

Experiment

In this section we use simulation to demonstrate the efficacy of our Bayes carpentry techniques. Following the setup in [3] (Chap 2), we designed p(X|y = 1) and p(X|y = −1) as a mixture of 10 Gaussians such that the Bayes decision boundary is nonlinear. We set p(y = 1) = 0.75 and simulated 5000 unlabeled data (X) from p(X) = 0.75 · p(X|y = 1) + 0.25 · p(X|y = −1). We generated 100 labeled data (y, X) points from each class. The SVM and kernel logistic regression (KLR) [16] classifiers were fitted on the labeled data. We used the 5000 unlabeled data to help us estimate π, then constructed modified SVM and KLR classifiers using the carpentry methods. For each classifier, its test error rate was calculated by the true model p(y, X). The above process was repeated 20 times. As can be seen from Table 1, naive uses of the SVM and KLR, i.e., ignoring the biased

Table 1: Error rates of the SVM and KLR before carpentry. Bayes SVM KLR 0.133 0.233 (0.006) 0.212 (0.004) Table 2: Error rates of the SVM and KLR after carpentry. h(X) π ˆ WSVM WKLR KLR-EC (X1 − X2 ) 0.759 (0.010) 0.144 (0.002) 0.145 (0.002) 0.146 (0.002) 0.754 (0.005) 0.142 (0.001) 0.142 (0.002) 0.142 (0.001) (X1 − X2 )3 (X1 − X2 )5 0.768 (0.008) 0.144 (0.002) 0.142 (0.001) 0.142 (0.001) exp(X1 − X2 ) 0.736 (0.014) 0.143 (0.002) 0.145 (0.003) 0.145 (0.002) sampling, have undesirable classification errors around 0.23 while the Bayes error is 0.133. Table 2 shows the results after carpentry. The weighted SVM (WSVM) and weighted KLR (WKLR) do a much better job, with misclassification error around 0.144. Moreover, the KLR by the estimation-calibrated bias-correction (KLR-EC) also achieves a low error rate close to the Bayes error. We tried four different h(X) functions in the carpentry methods. The results are very similar. Figure 1 displays the decision boundaries of the SVM and KLR classifiers before and after carpentry. It clearly visualizes the effect of carpentry. KLR with Radial Kernel

SVM with Radial Kernel

wklr klr−ec

bayes

wsvm

bayes

klr

svm

wklr

PSfrag replacements

PSfrag replacements SVM with Radial Kernel

KLR with Radial Kernel

Figure 1: The decison boundaries before and after carpentry, compared with the Bayes decison boundary. The effect of carpentry is transparent. The small triangles and circles indicate class {1} and {-1}, respectively.

5

Summary

We have emphasized the importance of the sampling scheme in learning, which is often considered “nuisance” in learning algorithms. We have proved the inconsistency of popular classifiers caused by the sampling bias. Semi-supervised learning has the advantage that the unlabeled data can be adaptively used to non-parametrically infer the sampling

bias. General carpentry techniques are proposed to guarantee the Bayes consistency of the classifier. We have demonstrated the efficacy of the carpentry methods by simulation. We should be cautious when the sampling distribution is not the target distribution of learning, and the importance sampling idea can help to correct the sampling bias. If the weights of importance sampling are unknown, we may consider using their consistent estimates. In semi-supervising learning, the enormous amount of unlabeled data could be useful in estimating the weights. As for the future work, we suggest an infinity unlabeled data framework for studying the value of the unlabeled data. Suppose Q is a learning algorithm based on n labeled data. Assume p(X) is known, then the question is that can we construct a new algorithm Q p using both p(X) and the labeled data such that Qp has a lower (asymptotic) error rate than Q? We have obtained some tentative interesting results to be presented in a future paper. References [1] Bartlett, P., Jordan, M. & McAuliffe, J. (2003) Convexity, classification, and risk bounds. Technical Report, Division of Computer Science and Department of Statistics, U.C. Berkeley. [2] Cozman, F. G. & Cohen, I. & Cirelo, M. C. (2003) Semi-Supervised learning of mixture models. proceedings of the Twentieth International Conference on Machine Learning, 99-106. [3] Hastie, T., Tibshirani, R. & Friedman, J. (2001) The Elements of Statistical Learning. SpringerVerlag, New York. [4] Joachims, T. (1999) Transductive inference for text classification using support vector machines. Proceedings of the Sixteenth International Conference on Machine Learning 350-358. San Francisco: Morgan Kaufmann. [5] Lin, Y. (2002) A note on margin-based loss functions in classification. Technical Report, Department of Statistics, University of Wisconsin, Madison. [6] Lin, Y., Lee, Y. & Wahba, G. (2000) Support vector machines for classification in nonstandard situations. Machine Learning, 46, 191-202 [7] Lugosi, G. & Vayatis, N. (2004) On the Bayes risk consistency of regularized boosting methods. Annals of Statistics, 32, 30-55. [8] McCullagh, P. & Nelder, J. A. (1989) Generalized Linear Models, 2nd edition. Chapman & Hall. [9] Nigam, K., McCallum, A. K., Thrun, S. & Mitchell, T. (2000) Text classification from labeled and unlabeled documents using EM. Machine Learning, 39 103-134. [10] Rosset, S., Zhu, J. & Hastie., T. (2004). Boosting as a regularized path to a maximum margin classifier. Journal of Machine Learning Research, To appear. [11] Steinwart, I. (2003). Consistency of support vector machines and other regularized kernel classifiers. Technical Report, Los Alamos National Laboratory, Los Alamos, NM [12] Vardi, Y. (1985) Empirical distributions in selection bias models. Annals of Statistics, 13, 178203. [13] Wahba, G. (1999) Support vector machine, reproducing kernel Hilbert spaces and the randomized GACV. Advances in Kernel Methods-Support Vector Learning, 69-88, MIT press. [14] Zhang, T. & Oles, F. (2000) A probability analysis on the value of unlabeled data for classification problems. Int. Joint Conf. on Machine Learning, 1191-1198. [15] Zhang, T. (2004) Statistical behavior and consistency of classification methods based on convex risk minimization. Annals of Statistics, 32, 56-85. [16] Zhu, J. & Hastie, T (2002) Kernel logistic regression and the import vector machine. In Advances in Neural Information Processing Systems, 14. MIT Press, Cambridge, MA.