FAMER - Donald Bren School of Information and Computer Sciences

FAMER: Making Multi-Instance Learning Better and Faster Wei Ping

∗

Ye Xu†

Jianyong Wang

Abstract Kernel method is a powerful tool in multi-instance learning. However, many typical kernel methods for multi-instance learning ignore the correspondence information of instances between two bags or co-occurrence information, and result in poor performance. Additionally, most current multiinstance kernels unreasonably assign all instances in each bag an equal weight, which neglects the significance of some “key” instances in multi-instance learning. Last but not least, almost all the multi-instance kernels encounter a heavy computation load, which may fail in large datasets. To cope with these shortcomings, we propose a FAst kernel for MultiinstancE leaRning named as FAMER. FAMER constructs a Locally Sensitive Hashing (LSH) based similarity measure for multi-instance framework, and represents each bag as a histogram by embedding instances within the bag into an auxiliary space, which captures the correspondence information between two bags. By designing a bin-dependent weighting scheme, we not only impose different weights on instances according to their discriminative powers, but also exploit co-occurrence relations according to the joint statistics of instances. Without directly computing in a pairwise manner, the time complexity of FAMER is much smaller compared to other typical multi-instance kernels. The experiments demonstrate the effectiveness and efficiency of the proposed method.

1 Introduction Multi-instance learning originated from investigating drug activity prediction [1]. In multi-instance learning, training examples are bags containing many instances. A bag is positive if it contains at least one positive instance; otherwise it is labeled as negative bag. The labels of bags in training set are known. However, we do not know the exact labels of instances in the bags. The framework of multi-instance has attracted ∗ Tsinghua National Laboratory for Information Science and Technology(TNList), School of Software, Tsinghua University; Email: [email protected] † Computer Science Department, Dartmouth College; Email: [email protected] ‡ Department of Computer Science and Technology, Tsinghua University; Email: [email protected] § Media Computing Group, Microsoft Research Asia; Email: [email protected]

‡

Xian-Sheng Hua

§

much attention in various application domains such as object detection [2], information retrieval [3], image classification [4], biomedical informatics [5] and so on. Due to the fact that the labels of instances are implicit, traditional supervised learning algorithms or semi-supervised learning algorithms can not be used directly. On the other hand, unsupervised learning algorithms that simply ignore the labels of bags are not suitable to this framework as well. During the past several years, many algorithms [6] [7] [8] [9] [10] [11] [12] [13] have been proposed for multiinstance framework, among which kernel method is a powerful class to deal with the multi-instance learning problem. However, current kernel methods for multi-instance learning have several shortcomings. Firstly, most of such methods neglect the significance of correspondence relations or the co-occurrence relations among instances. The implicit exponent matching scheme explored in [9] and the label consistency of instance pair based weighting scheme used in [12] indicate that a high correspondence of instances in two bags implies high similarity, which is important when computing kernel values. A vivid example is Fig. 1. Both of the two image bags (an image is segmented into several regions/instances as [2]) contain coast, sea and sky. The correspondence relation implies that the two image bags have a high similarity. If the correspondence information between two images can be taken advantage of, the computed kernel value will be more accurate. [13] demonstrates that the co-occurrence of instances is able to disclose the non-i.i.d. information conveyed by instances, which is helpful to make the multi-instance learning tasks easier to cope with. (Here, non-i.i.d. means non-independently identically distributed.) Take Fig. 2 [14] for example. In this object detection problem, we aim at detecting monkeys. As indicated in Fig. 2, monkeys are very likely to live in the trees. It means in an image bag, the instances that contain monkeys are very likely to co-occur with the instances that contain trees. Such co-occurrence information among instances is helpful to the object detection applications [2]. Therefore, simply ignoring the correspondence or co-occurrence information is very likely to undermine the accuracy of the computed ker-

nel value. Secondly, typical multi-instance kernels take it for granted that each instance in a bag plays an equal role when considering the similarity of two bags. However, as indicated in [9] and [15], some “key” instances are critical under multi-instance learning framework. The LASSO procedure [16] employed by [11] dictates that a weighting scheme benefits multi-instance learning because instances within a bag usually have different discriminabilities. Thus it is more desirable to impose higher weight on those “key” instances. Specially, [17] proposed a algorithm to obtain the salience of each instance in each bag with respect to the bag’s label. Last but not least, these multi-instance kernel algorithms suffer from heavy computation load, especially when the number of instances in each bag is large. To cope with the above problems of current multiinstance kernel algorithms, we propose a FAst kernel for Multi-instancE leaRning named as FAMER. In our work, we reveal the importance of correspondence for multi-instance learning for the first time, and design a Locally Sensitive Hashing (LSH) [18] based similarity measure to detect the correspondence relationship of instances between two bags. For the purpose of efficient computing, we embed each bag into an auxiliary space and represent the bag as a histogram. Here, a histogram is a high dimensional vector, and each dimension records the number of instances that have been mapped into the corresponding bin. This histogram scheme guarantees that similar instances are very likely to be mapped into the same bins. Another contribution of FAMER is that, we design a weighting scheme for each bin to reflect their discriminative ability. In this way, we implicitly weight instances within each bag according to their discriminative powers thus no longer treat each instance in the bag equally when computing the kernel value. Moreover, the co-occurrence information represented by the instance joint statistics [19] can be captured by this weighting scheme. The third contribution of our work is the time complexity. Avoid using pairwise relations among instances, FAMER has a low computing overhead compared with most current multi-instance kernel algorithms, particularly when the number of instances in each bag is large. We organize the rest of this paper as follows. In section 2, we briefly introduce some related works. In section 3, we present the preliminaries and notations. FAMER is proposed in section 4 in detail. In section 5, we report on experimental results. Finally in Section 6, we conclude the paper.

Figure 1: The correspondence information (i.e., the respective coast, sea and sky regions) between the two images indicates the two image bags have a high similarity.

Figure 2: In the image bag, instances(segmentations) containing monkeys are very likely to co-occur with those instances containing trees.

Diverse Density [6], Bayesian-KNN [7], MI Kernel (NSK and STK) [8], MI SVMs [9], MI Ensembles [10], MI Instance Selection [11], Marginalized Kernels [12], MIL for 2 Related Work Sparse Positive Bags [20], PPMM kernel [21], MIGraph, Multi-instance learning has been investigated for several miGraph [13] etc, among which kernel method plays an years, and many algorithms have been proposed, such as important role.

Normalized Set Kernel (NSK) [8] is the first kernel method under multi-instance learning framework, which applies pairwise scheme to compute the kernel between two bags. On one hand, the pairwise computing scheme simply assumes the instances within each bag are independent of each other, which ignores the cooccurrence information; it also implicitly assumes that all instances within a bag are equally important, which leads to unsatisfactory performance. On the other hand, the pair-wise computing procedure results in quadratic time complexity with the average number of instances in each bag, which is intolerable if this number is large. Statistic Kernel (STK) is another kernel method proposed in [8], which only considers the minimum and maximum of all features across each instance within a bag. To the best of our knowledge, STK is the only reported fast kernel whose time complexity is linear with the average number of instances in each bag. However, due to the simple heuristics it used, STK ignores correspondence and co-occurrence information, and is easy to fail under the condition that the number of instances in each bag is large, which undermines its advantage. Marginalized Kernel (MG-ACC Kernel) [12] considers the pairwise relations of instances between two bags and weights every pair of instances by the consistency of their probabilistic instance labels to acquire correspondence information. However, the co-occurrence information among instances is overlooked. Moreover, [12] suffers from heavy computing load for both inferring the probabilistic instance labels and pair-wise computing. P-Posterior Mixture-Model Kernel (PPMM Kernel) [21] aims at detecting adaptive mechanisms of how instances decide bag labels for various applications. It trains a Gaussian Mixed Model (GMM) for instances in each bag and summarizes the frequencies when computing the kernel value between two bags. In this way, the correspondence information between two bags is captured. Nevertheless, the co-occurrence information is ignored, and training the GMM suffers a heavy computation load. MIGraph Kernel and miGraph Kernel [13] are two recent works that employ an ²−graph scheme to capture the co-occurrence information conveyed by instances within each bag. However, the correspondence information is ignored and the ²−graph scheme costs a lot of computing time as well. As far as we know, there are few if any multi-instance kernels that consider both correspondence and co-occurrence information. Therefore, it is desirable to propose a fast multi-instance kernel method which can effectively incorporate both correspondence information and co-occurrence information. Our work also relates to Locally Sensitive Hashing (LSH) which is originally proposed to tackle nearest search approximation in high dimensionality [22].

As [23] summarized, several LSH families [24] [25] [18] have been proposed in the context of nearest search approximation. From another point of view, LSH method is a powerful family of embedding techniques, which constructs short similarity-preserving sketches for object in database [24] [18]. This technique has been widely used in information retrieval [26] [27] and computer vision [28]. For example, [26] adopts it to solve the similarity search tasks. [28] employs LSH to execute efficiently matching between sets of features. However, as far as we know, hash embedding techniques including LSH remain untouched in multi-instance learning framework. 3 Preliminaries In this section we give the formal description of multiinstance learning and the technique, LSH, which is used to implement the hashing embedding. 3.1 Multi-Instance Learning. The original description of multi-instance learning [1] is as follows. Denote X as the instance space. Given a data set S = {(X1 , L1 ), ..., (Xi , Li ), ..., (XN , LN )}, where Xi = {xi1 , ..., xij , ..., xi,ni } ⊂ X is called a bag, Li ∈ L = {−1, +1} is the label of Xi , and N is the number of training bags. Here xij ∈ Xi is an instance [xij1 , ..., xijk , ..., xijd ]> , where xijk is the value of xij at the k th attribute, ni is the number of instances in Xi , and d is the dimensionality of instance space X . If there exists p ∈ {1, ..., ni } such that xip is a positive instance, then Xi is a positive bag and thus Li = +1, but the concrete value of the index p is usually unknown; otherwise Li = −1. The goal is to learn some concept from the training set for correctly labeling unseen bags. With several years development in different applications, the above unambiguous mechanism(e.g. a bag is positive if and only if at least one instance is positive) is generalized to many forms [29][30]. These generalizations are necessary for some applications. For example, as indicated in [11], in object recognition, if a positive instance label indicates that the instance appears to be part of the object(due to the hardness to perform perfect segmentation), a negative bag may contain positive instances as well. However, as pointed out by [21], even the explicit mechanism of how the instances within the bag determine the bag labels can hardly benefit a MIL algorithm deterministically due to the unknown instance labels. Moreover, this mechanism is implicit for some applications. As a result, the general kernel methods, which do not explicitly take advantage of the specific mechanism, usually achieve superior results in the real world applications.

3.2 Locally Sensitive Hashing. There are two equivalent formal definitions for LSH, which can be found respectively in [22] and [18]. The definition in [22] is under the context of nearest neighbor search approximation. We adopt the way in [18] which fits our purpose better: Locality Sensitive Hashing scheme is a random distribution on a family H of hash functions operating on a collection of objects, such that the probability of two object being mapped to the same value reflects the similarity between them. Formally, given a feature space X , if for ∀ x, y ∈ X , the following equation satisfies,

b ∈ R is sampled from the uniform distribution U [0, W ). W is called window size, which controls the distance range that the mapping is sensitive to. A statistical analysis in [27] reveals that, the following relation holds: (3.5) sH = Eh∈H [sh ] Z 1Z =1− 0

0

1

W X W φ[ (2j + x + y)]dxdy d d j∈Z

Therein, φ is the probability density function of the standard Gaussian distribution, and j ∈ Z is an arbitrary integer. It should be noted that hr,b (x) in Eq.(3.4) is usually (3.1) P rh∈H [h(x) = h(y)] = sH (x, y) called atomic hash function. In practice, these atomic then, H is a LSH family. Therein, h is the hash function functions are usually concatenated to form final hash sampled from H, and sH is a similarity measure of X , function in order to enhance local sensitiveness. which is induced by LSH family H. Given an arbitrary locally sensitive hash function 4 The Proposed FAMER h∈ H, we define the corresponding induced similarity In this section, we propose a fast kernel for multiinstance learning. At first, we present a case study measure as to reveal the importance of correspondence scheme in (3.2) sh (x, y) = δh(x),h(y) multi-instance kernel, and design a LSH based similarity measure to capture the correspondence information Therein, the Kronecker delta δ means sh (x, y) = 1 if between two bags. For the purpose of practical computh(x) = h(y); otherwise sh (x, y) = 0. From (3.1) and ing, we construct a histogram and map every bag into it. (3.2), we can easily achieve the following equation, In order to fully take advantage of the “key” instances within each bag, we design an weighting scheme to im(3.3) Eh∈H [sh (x, y)] = sH (x, y) plicitly weight each instances. This weighting scheme There are several LSH families which have been de- for occurrence histogram respects co-occurrence relavised for various distance or similarity measures. For tions according to the instance’s joint statistics. example, bit sampling for hamming distance [22], minIn the rest of this section, we give the detailed wise independent permutation for set similarity (Jac- description of FAMER. In section 4.1, we design a card index) [24], random projection for L2 distance in LSH based metric and an efficient computing scheme, Euclidean space [25] and random hyperplane for cosine which captures the correspondence information. Then similarity [18]. As summarized in the previous work in section 4.2, we propose an weighting scheme based [23], the choice of LSH families depends on the back- on entropy. At last, we summarize the algorithm and ground of applications. For example, in text document analyze the time complexity in section 4.3. retrieval, the LSH family of random hyperplane for cosine similarity will be a preferable choice. In our imple- 4.1 Construct Similarity Metric with Corrementation, we apply the random projection for L2 dis- spondence. Many typical multi-instance kernels eastance due to its generality. (Actually, users can adopt ily neglect the important correspondence information other LSH families according to various requirements.) between two bags, which undermines the accuracy of To complete the detailed description of FAMER computed kernel values. Therefore, in our work, we afterwards, we will give a brief introduction of the LSH aim at detecting the correspondence using LSH based families for L2 distance as follows [23]: pick a random technique. projection of Rd onto a 1-dimensional line, shifted by a random value b ∈ [0, W ), and chop the line into 4.1.1 LSH based Similarity Metrics for MIL. segments of length W. Formally, for ∀x ∈ Rd , Previous works [8] [31] [20] indicate that the Normalized Set Kernels(NSK) between two bags in Eq. (4.6) usually r·x+b gets decent results for most of real-world datasets, c mod 2 (3.4) hr,b (x) = b W nj ni X 1 X Therein, the projection vector r ∈ Rd subjects to (4.6) k(xia , yjb ) K(Xi , Xj ) = ni nj a=1 a d−dimensional standard Gaussian distribution, and b=1

Table 1: Classification Accuracy of NSK using RBF and Linear Kernel respectively under several tasks Data set Musk1 Musk2 Elept Fox Tiger N SKLinear 86.1% 82.4% 79.0% 55.5% 77.3% N SKRBF 88.0% 89.3% 84.3% 60.3% 84.2%

Therein, k(xia , yjb ) = exp(−γ||xia − yjb ||2 ) is RBF kernel between instances, and ni and nj are number of instances inside bag Xi and Xj respectively. This observation can be explained in two ways: the preferable relaxed optimization constraint indicated in [20], and the implicit correspondence scheme obtained from RBF kernel between instances, which is first pointed out in our work. As demonstrated by Figure 3, the value of RBF kernel between similar instance pair xia and yjb makes significant contributions to the summation by exponential amplification, which can capture the correspondence information to some degree. The effectiveness of this soft correspondence can be empirically validated in Table 1. The detailed description of these data sets used for this case study can be referred in Section 5. We can see that NSK employing RBF kernel as instance kernel has significant improvement of classification accuracy compared with it employing linear kernel. The above case study indicates that the exponential amplification characteristic of RBF kernel captures the implicit correspondence of similar pair instances between two bags. Therefore, in order to exploit the desirable correspondence information as N SKrbf , we define an LSH based similarity metrics between bags for multiinstance learning as follows, (4.7)

SM I,H (Xi , Xj ) nj ni X 1 X = exp(−γ(1 − sH (xia , xjb ))) ni nj a=1 b=1

Figure 3: The relation between RBF Kernel Value and L2 distance using the empirical rule of thumb γ = 1/2d2 indicated in [8]. For Musk2 (referred in Section 5), γ = 0.000018. It should be noted that the distances between pair of instances for Musk2 range in [0, 8027803]. Thus if the distance is larger than 250000, it hardly makes any contribution to the bag kernel.

given an arbitrary locally sensitive hash function h∈ H, we also define the corresponding similarity measure similar with Eq.(4.7) as follows, (4.8)

SM I,h (Xi , Xj ) nj ni X 1 X exp(−γ(1 − sh (xia , yjb ))) = ni nj a=1 b=1

The below Lemma will disclose the statistical connection between (4.7) and (4.8). Lemma 1. SM I,H (Xi , Xj ) in (4.7) is the lower bound of Eh∈H [SM I,h (Xi , Xj )].

Therein, sH (xia , xjb ) is the similarity metric between two points xia , xjb as Eq.(3.1), which is induced from Proof. The expectation of SM I,h (Xi , Xj ) can be rewritthe family of hash functions H. From Eq.(3.5), 1 − ten as follows by linearity of expectation, sH (xia , yjb ) in Eq.(4.7) is monotonic increasing as the L2 distance between xia and yjb increases. As a Eh∈H [SM I,h (Xi , Xj )] result, the LSH based similarity SM I,H (Xi , Xj ), which nj ni X 1 X has a similar form as Eq.(4.6), is able to capture the Eh∈H [exp(−γ(1 − sh (xia , yjb )))] = correspondence information between two bags. ni nj a=1 b=1

4.1.2 Statistical Approximation. As indicated in (3.3), sH (xia , yjb ) is an expected value and can be Note that sh (xia , yjb ) is a binary random variable that calculated by the expectation of sh (xia , yjb ), which is equals to 0 or 1, and f (z) = exp(−γ(1 − z)) is induced from a single hashing function h. By analogy, convex function under z∈ [0, 1]. According to Jensen’s

B and hash functions: the larger the B is, the more sensitive the hash function becomes. In other words, as B increases, only the most similar pair of instances between two bags can make significant contribution to the bag similarity under the context of multi-instance learning. Then, for practical computing, we set up several independent histograms for each bag based on the Bbit hash function, and estimate Eh∈H [SM I,h (Xi , Xj )] using these histograms. Now, we formalize how to embed a bag of instances into one histogram. According to the description, let X be the instance space, H be the B-bit hash functions family mapping from X to D = {0, 1, . . . , 2B −1}, and {e[i] | i ∈ D} be the standard basis of the |2B |-dimensional vector space. For any i, j ∈ D, the inner product e[i] · e[j] = δi,j . Hence, given h ∈ H, the embedding histogram Th for any bag Figure 4: The relation between the Concatenating B-bit X ⊂ X is defined as follows, i hash function and atomic hash functions: the larger the B is, the more sensitive the hash function becomes. X (4.9) Th (Xi ) = e[h(xij )] inequality, we obtain that,

xij ∈Xi

Therein, Th (Xi ) is determined by the hash function h sampled from H. Through the definition of Th , we can compute S M I,h (Xi , Xj ) in an efficient way. Assume we have b=1 already gotten the embedding histograms Th (Xi ) and n ni X j 1 X T (X ) An important h j for Xi and Xj respectively. exp(−γ(1 − sH (xia , yjb )))] = ni nj a=1 observation from Eq.(4.8) is that there are Th (Xi ) · b=1 T (X ) items whose s (x , x ) = 1 and [ni nj − h j h ia jb = SM I,H (Xi , Xj ) Th (Xi )·Th (Xj )] items whose sh (xia , xjb ) = 0 inside the which completes the proof. summarization. Therefore, Eq.(4.8) can be rewritten as Lemma 1 demonstrates that, Eh∈H [SM I,h (Xi , Xj )] follows, can be used as a biased approximation of (4.10)SM I,h (Xi , Xj ) SM I,H (Xi , Xj ). nj ni X 1 X = exp(−γ(1 − sh (xia , yjb ))) 4.1.3 Practical Computing. A disadvantage of ni nj a=1 a=1 Eq. (4.8) is that it applies a pairwise manner as other 1 typical multi-instance kernels, which is time consuming, = { Th (Xi ) · Th (Xj ) + [ni nj − Th (Xi ) · Th (Xj )] ni nj especially under the condition that the average number ∗exp(−γ)} of instances within each bag is large. Therefore, to solve this problem, we propose an efficient way to compute Eh∈H [SM I,h (Xi , Xj )]. According to Eq. (4.10), SM I,h (Xi , Xj ) can be calAt first, to enhance the local sensitiveness, we need culated directly using the embedding histogram. The to build the practically used LSH families in our method above equation only needs to scan all the instances from 0/1 valued atomic LSH families. Let A be the within each bag once and avoids considering the pairatomic LSH family defined in (3.4), we concatenate B wise relations between instances, which saves a lot of independent atomic hash functions to make a B-bit hash computational overhead. In implementation, we maintain several infunction: H = {(f1 , f2 , . . . , fB )|fi ∈ A}. Concatenating B atomic hash functions can enhance local sensitiveness. dependent embedding histograms to estimate Specifically, we choose M The similarity metric induced by A and H has the Eh∈H [SM I,h (Xi , Xj )]. relationship sH (xia , xjb ) = [sA (xia , xjb )]B for instance hash functions HM = (h1 , h2 , . . . , hM ) independently, xia ∈ Xi , xjb ∈ Xj . Figure 4 shows the relation between then the expectation is approached by the average of Eh∈H [SM I,h (Xi , Xj )] nj ni X 1 X exp(−γEh∈H [(1 − sh (xia , yjb ))]) ≥ ni nj a=1

SM I,hk (Xi , Xj ).

bag’s label. However, most of these weighting schemes suffer from heavy computing load. To cope with this M X problem, we propose an efficient weighting scheme for 1 (4.11) SM I,HM (Xi , Xj ) = SM I,hk (Xi , Xj ) histogram bins in this subsection, which not only imM k=1 plicitly weights the instances within each bags, but also According to the linearity of expectation, it can be easily respect the co-occurrence relations. According to properties of locally sensitive hashing, verified that the expectation of SM I,HM (Xi , Xj ): the instances, which have been mapped into same bin, can be viewed as similar instances (the distance between EHM ⊂H [SM I,HM (Xi , Xj )] = Eh∈H [SM I,h (Xi , Xj )] them is small) under some specific probabilities. From However, the variance of SM I,HM (Xi , Xj ) another point of view, each bin in the histogram corresponds to a probabilistic region in the original instance DHM ⊂H [SM I,HM (Xi , Xj )] space X . As revealed in DD framework [6], those regions M containing large number of “positive” instances(from X 1 SM I,hk (Xi , Xj )] = DHM ⊂H [ positive bag) and few negative instances have much disM k=1 criminative information for positive bag, while the other M regions containing few “positive” instances and large X 1 = 2 Dh∈H [SM I,hk (Xi , Xj )] number of negative instances have much discriminative M k=1 information for negative bag. Therefore, it is reasonable 1 to employ entropy, which reflects the concept of purity, = Dh∈H [SM I,h (Xi , Xj )] M to quantify the discriminative power of each bin. The probability of “positive” instance in specific bin Therefore, by replacing SM I,h (Xi , Xj ) to [i] is estimated as, SM I,HM (Xi , Xj ), the expectation remains same but the variance is lowered by a factor of 1/M, which N+ [i] + ² (4.13) P+ [i] = means a more stable estimation. N+ [i] + N− [i] + 2² If we define the super-histogram THM (Xi ) = (Th1 (Xi ), Th2 (Xi ), . . . , ThM (Xi )), Eq.(4.11) can be where N+ [i] and N− [i] are the number of “positive” instances and negative instances respectively in the bin rewritten as follows, [i] at current state, and ² is a small constant used M for smoothing. Similarly, the probability of negative 1 X SM I,hk (Xi , Xj ) (4.12) SM I,HM (Xi , Xj ) = instance in bin [i] is estimated as, M k=1

=

1 M

M X k=1

1 { Thk (Xi ) · Thk (Xj ) + [ni nj ni nj

−Thk (Xi ) · Thk (Xj )] ∗ exp(−γ)} 1 1 { THM (Xi ) · THM (Xj ) + [M ni nj = M ni n j −THM (Xi ) · THM (Xj )] ∗ exp(−γ)} 4.2 Entropy based Weighting Scheme. Most multi-instance kernels unreasonably impose an equal weight on all instances within a bag, which ignores the significance of “key” instances [15], and undermines the learning performance. Some investigations have been made in imposing weights on instances according to their discriminative abilities. For example, Marginalized Kernel[12] employs DD framework[6] to generate a weight for every instance which reflects the probabilistic label of the instance. Lasso process [16], where L1 Norm SVM is usually used, is another scheme to weight instances [11]. AP-Salience algorithm [17], obtains the salience of each instance in each bag with respect to the

(4.14)

P− [i] =

N− [i] + ² N+ [i] + N− [i] + 2²

Every bin is empty initially, and the prior probabilities of positive and negative are both 1/2, which is consistent with Eq.(4.13) and (4.14). It deserves noting that although we regard all instances in positive bags as “positive” when estimating the probability, the influence of those “false positive” instances in positive bags can be offset by the negative instances from corresponding negative regions in negative bags. Through the above definitions, the weight for bin [i] is defined as the variation of entropy through embedding, which is named as information gain. The initial entropy for bin [i] is H0 [i] = − 21 × ln( 12 ) − 12 × ln( 12 ) = ln2. After embedding, the probability for “positive” instances is P+ [i], and the probability for negative instances is P− [i], then (4.15)W [i] = ∆H[i] + C = H0 [i] − He [i] + C 1 1 = ln2 − (P+ [i] ln + P− [i] ln )+C P+ [i] P− [i]

where C is a system parameter used to both control the impact of information gain weight and smooth the estimation. In this way, all instances inside each bin are implicitly weighted by the weights of the corresponding bins. Another point deserves mentioning is that, the histogram naturally records the co-occurrence information of “positive”/negative instances. Therefore, the designed weighting scheme employs the information to generate weights for histogram, which also respects the co-occurrence relations to some degree: instead of calculating the similarity between pair of instances individually (e.g. NSK [8]), it takes advantage of the joint statistics of instances to calculate the similarity[19]. Let W be the learned weight vector for superhistogram, then we can finally present the fast kernel for multi-instance learning as follows, (4.16) KM I,HM (Xi , Xj ) 1 1 { (W ◦ THM (Xi )) · THM (Xj ) + [M ni nj = M ni nj −(W ◦ THM (Xi )) · THM (Xj )] ∗ exp(−γ) }

1: 2: 3:

4: 5: 6: 7:

Input: Training set {(X1 , L1 ), . . . , (Xi , Li ), . . . , (XN , LN )}. Initialize the super-histogram THM (Xi ) for every bag Xi with zero. For every bag Xi , map every instance xij ∈ Xi into the histogram THM (Xi ), statistics N+ [k] and N− [k] for every bin [k]. Compute Weight vector W using Eq.(4.15). Compute KM I,HM (Xi , Xj ) using Eq.(4.16) for every pair of bags. Normal training process of SVM. return Weight vector W, support vectors SNs {T (X HM v )}. v=1 Algorithm 1: Training Procedure for FAMER

1: 2: 3: 4:

Input: Test set {Xt1 , . . . , Xti , . . . , XtN }, Weight SNs vector W, support vectors v=1 {THM (Xv )}. Initialize the super-histogram THM (Xti ) for every bag Xti with zero. For every bag Xti , map every instance xij ∈ Xti into the histogram THM (Xti ). Compute KM I,HM (Xti , Xv ) using Eq.(4.16) for every pair of bags between test set and support vectors. Normal predict process of SVM. return predict result {Lt1 , . . . , Lti , . . . , LtN }.

where ◦ is the Hadamard product for entry wise product. 5: As we know, only positive semi-definite kernels 6: guarantee an optimal solution to kernel method based on convex optimization(e.g. SVMs). The following Algorithm 2: Test Procedure for FAMER Lemma will prove that the proposed FAMER (4.16) indeed satisfies mercer’s condition. Lemma 2. KM I,HM (Xi , Xj ) in (4.16) is Mercer 4.3 Summary of Algorithm and Complexity kernel. Analysis. According to the above descriptions, the detailed training procedure for FAMER is summarized Proof. (4.16) can be rewritten as, in Algorithm 1. First, we need to initialize all the superhistograms for every bag with zero. Then, we map KM I,HM (Xi , Xj ) every instance within each bag into the corresponding 1 1 histogram and count the occurrence of positive instances = { [(W ◦ THM (Xi )) · THM (Xj )] M n i nj and negative instances inside every specific bin, from which we can compute the weight vector as using Eq. ×[1 − exp(−γ)] + [M ni nj ∗ exp(−γ)] } (4.15). Finally, we compute the whole kernel matrix, Given that Mercer kernels are closed under both addi- and employ the normal training process of SVM. After tion and scaling operations (due to the closure prop- training, we get the weight vector and support vectors erties of valid Mercer kernel), we only need to prove (represented by super-histograms). (W ◦ THM (Xi )) · THM (Xj ) is Mercer kernel. The detailed test procedure for FAMER is summaWe construct a explicit mapping Φ such that, rized in Algorithm 2. First, we also need to initialize all the super-histograms for every test bag with zero. √ Then, we map every instance in each bag into the corΦ(Xi ) = W ◦ THM (Xi ) responding histogram. Finally, we compute kernel value √ Therein, W represents entry wise square root of W. between support vector and the super-histogram of the Therefore, Φ(Xi ) · Φ(Xj ) = (W ◦ THM (Xi )) · THM (Xj ). test bag using Eq. (4.16), and employ the normal preAs a result, (W ◦ THM (Xi )) · THM (Xj ) is positive semi- dicting process of SVM. definite kernel according to Mercer’s theorem, which The time complexity of the proposed FAMER(both completes the proof. training and test procedure) includes two parts: gener-

ating embedding histograms and computing kernel values. Denote d as dimensionality of instance space X , n as average bag size, and N as the whole number of bags in data set S. The embedding part involves calculating hash values and adding them to the histogram bins. We adopt dense representation for every histogram, so that the initialization time of single super-histogram for each bag is O(2B M ). Computing hash values and adding them to the histogram bins for a bag cost O(dBM n). As a result, the time complexity of embedding part for the whole dataset is O(N ·(dBM n+2B M )). Computing kernel values depends entirely on the histogram size, and takes O(2B M ) for every pair of bags and O(2B M N 2 ) for an N × N symmetric kernel matrix. This part does not explicitly depend on the dimensionality d and the bag size n, which leads to the potentiality for being fast. Most typical multi-instance kernels such as NSK, MIGraph, miGraph, MG-ACC kernel, and PPMM kernel adopt a pairwise manner to compute kernel value. Therefore, their time complexity is at least quadratic with the bag size n, which is very slow as n increases. 5 Experiment In this section, we focus on evaluating the effectiveness and efficiency of the proposed FAMER on three application domains. Several typical multi-instance kernels are compared with FAMER: Normalized Set Kernels (N SKrbf ) [8], Stat-Kernel (STK) [8], MIGraph Kernel [13], miGraph Kernel [13], MG-ACC Kernel [12], and PPMM Kernel [21]. Besides these typical kernels, we also compare the proposed FAMER with some other well-known multi-instance classification algorithms. 5.1 Experimental Methodology and Parameter Setting. In this experiments, we learn a kernel using FAMER and other typical kernel methods under some well-known Multi-Instance learning applications. Then LIBSVM[32] is employed to obtain the corresponding classification accuracy for each learned kernel. During training, 10-fold cross validation scheme is applied to tune the parameters for each method. In testing procedure, we repeat the experiments for serval times to achieve an average classification accuracy. To fairly validate the efficiency of the proposed method, we implement FAMER, N SKrbf , MIGraph Kernel, and miGraph Kernel in C++ programs. All these programs are compiled in Microsoft Visual Studio 2005 and executed on the PC with Intel Core 2 Duo CPU (2.10GHz). M is the number of random hash functions we use. According to the Law of Large Numbers, the larger M is, the more accuracy we can achieve for estimating the expectation value Eh∈H [SM I,h (Xi , Xj )]. In all the experiments, we simply set M = 400, which already guar-

antees a stable estimation of expectation in practical use. Intuitively, as the parameter B increases, only the most similar pair of instances corresponds to each other and contributes to the bag kernel. Therefore, B is an application dependent parameters, which reflects the degree of tolerating correspondence. In our experiments, we simply set B = 4 for most datasets(e.g. Musk1, Elephant, Fox, and Tiger), B = 5 for Musk2 and B = 6 for Protein, both of which have larger average instances number in the bag. Window size W in Eq.(3.4) is an important parameter. During the experiments, we follow the suggestion of [28] that W is set up as a portion of average r > · x in Eq.(3.4). 5.2 Drug Activity Prediction. Drug activity was the motivating application for the multiple instance representation[1]. In this subsection, we evaluate the proposed FAMER on the well-known datasets Musk1 and Musk2. More detailed information about the datasets can be available in [1] or Table 5. Under the datasets of Musk1 and Musk2, we use FAMER along with other typical kernel methods to learn a kernel respectively, and then apply LIBSVM to test the classification performance. The results are achieved by using a ten times 10-fold cross validation policy and listed in the top half of Table 2. To further validate the effectiveness of the proposed FAMER, we list the results of some other classification algorithms for multi-instance learning in the bottom half of Table 2. The best performances are highlighted with figures in bold typeface. Although a little lower than PPMM [21] under Musk1, the proposed FAMER is better than other methods, including most typical kernels for multiinstance framework. As for Musk2, the classification accuracy is the best. 5.3 Automatic Image Annotation. In this subsection, we evaluate the proposed FAMER under three image annotation datasets: Elephant, Fox, and Tiger. The tasks proposed in [9] aims at detecting three categories of animals, Tiger, Elephant, and Fox from background images. In this representation, an image(bag) consists of a set of segments (instances), each of which is characterized by color, texture and shape descriptors. Each of the three datasets has 100 positive and 100 negative bags. More details can be found in [9] or Table 5. As in last subsection, we use FAMER, along with other kernel methods to learn a kernel respectively under Elephant, Fox and Tiger. A ten times 10-fold cross validation policy is used to achieve the average classification results, which are listed in Table 3. Some other classification algorithms for multi-instance learning are shown in the bottom half of Table 3 as well.

Table 2: Classification Accuracy of N SKrbf , STK, MIGraph, miGraph, MG-ACC, PPMM, the proposed FAMER and other typical algorithms for multi-instance learning under Musk1 and Musk2. Algorithm N SKrbf [8] STK [8] MIGraph Kernel [13] miGraph Kernel [13] MG-ACC Kernel [12] PPMM Kernel [21] FAMER MI-SVM [9] mi-SVM [9] MissSVM [33] DD [6] EM-DD [34] APR [1] MI-Box[30]

Musk1 88.0% 91.6% 90.0% 88.9% 90.1% 95.6% 91.3% 77.9% 87.4% 87.6% 88.0% 84.8% 92.4% 91.2%

Musk2 89.3% 86.3% 90.0% 90.3% 90.4% 81.2% 93.3% 84.3% 83.6% 80.0% 84.0% 84.9% 89.2% 90.3%

Average 88.7% 89.0% 90.0% 89.6% 90.3% 88.4% 92.3% 81.1% 85.5% 83.8% 86.0% 84.9% 90.8% 90.8%

Table 3: Classification Accuracy of N SKrbf , STK, MIGraph, miGraph, PPMM, the proposed FAMER, and other typical algorithms for multi-instance learning under Elephant, Fox and Tiger. Algorithm Elephant Fox Tiger N SKrbf 84.3% 60.3% 84.2% STK 83.5% 63.0% 79.0% MIGraph Kernel 85.1% 61.2% 81.9% miGraph Kernel 86.8% 61.6% 86.0% PPMM Kernel 82.4% 60.3% 80.2% FAMER 87.5% 67.0% 87.0% MI-SVM 81.4% 59.4% 84.0% mi-SVM 82.0% 58.2% 78.9% EM-DD 78.3% 56.1% 72.1%

Table 4: True Positive rate(TP) and True Negative rate(TN) of N SKrbf , STK, MIGraph, miGraph, and FAMER under TrX Protein. Algorithm TP TN Average N SKrbf 69.0% 72.6% 70.8% STK 58.1% 66.1% 62.1% MIGraph Kernel 68.1% 70.3% 69.2% miGraph Kernel 64.1% 69.7% 66.9% The best performances are highlighted with figures in FAMER 70.1% 70.9% 70.5% bold typeface. The results show that the proposed FAMER achieves the best classification performance under all the three datasets and has significant improvement compared with other methods. It is mainly because FAMER not only captures the valuable correspon- be found in [35]. dence and co-occurrence information when computing Due to the serious imbalance between positive and the kernel value, but also imposes relatively high weights negative bags under the Protein data set, simply adopton those “key” instances with strong discriminability. ing a 10-fold cross validation scheme will lead to extreme high True Negative rate and very low True Positive rate, 5.4 Identifying Trx-fold Proteins. In this subsec- which is meaningless because True Positive rate is more tion, we use a Protein dataset to evaluate the proposed important in this protein identification task. (Actually, FAMER. This task has been modeled as a multiple in- simply classifying every bag as negative can obtain a stance problem in [30][35]. The objective is to classify trivial result: an average classification accuracy 88.9%, given protein sequences according to whether they be- a high TN 100%, but a low TP 0%.) Therefore, we long to the family of TrX proteins. The low conserva- adopt the same leave-one-out scheme used by Wang et tion of primary sequence in protein superfamilies such as al.[35]: in every turn, we hold out one positive bag for Thioredoxin-fold (Trx-fold) renders conventional model- testing, and use the rest of 19 positive bags for training methods(e.g. HMM) difficult to cope with the task. ing. As for negative data, the whole set of negative Wang et al.[35] applies multiple-instance learning as a bags is split into 8 equal-sized subsets(each contains 20 tool to identify new Trx-fold proteins. The given pro- bags). In other words, we train our algorithms on 19 teins are firstly aligned with respect to a motif which is positive bags plus one subset of negative bags, and then known to be conserved in members of the family. Each test using the held-out positive bag plus the remaining aligned protein is represented by a bag. A bag is labeled 7 subsets of negative bags. We repeat this procedure for positive if the protein belongs to the family, and nega- 20 × 8 = 160 times when learning each kernel respective otherwise. An instance in a bag corresponds to a tively. Table 4 lists the True Positive rate(TP), True position in a fixed length sequence around the conserved Negative rate(TN) and average accuracy of the promotif. The dataset has 20 positive bags and 160 nega- posed FAMER and other typical kernel methods. The tive bags. Each instance inside the bag is represented best performances are highlighted with figures in bold by a 8-dimensional vector(i.e. d = 8). More details can typeface. The results show that FAMER achieves the

Table 5: Detailed description of various data sets used for evaluating efficiency Dataset Musk1 Musk2 Elephant Fox Tiger Protein Number of bags 92 102 200 200 200 180 Number of positive bags 47 39 100 100 100 20 Number of negative bags 45 63 100 100 100 160 Average number of instances per bag 5.2 64.7 6.96 6.6 6.1 137.8 Dimensionality of instance space 166 166 230 230 230 8 Table 6: Time durations of FAMER, N SKrbf , MIGraph and miGraph Kernel Algorithm Musk1 Musk2 Elephant Fox Tiger Protein N SKrbf 375ms 67, 562ms 4,438ms 4,015ms 3,453ms 161,844ms MIGraph Kernel 1218ms 784, 422ms 9,875ms 9,203ms 7,816ms 7,123,830ms miGraph Kernel 385ms 77, 485ms 5,250ms 4,703ms 4,016ms 199,461ms FAMER 609 + 169 9, 431 + 461 1704 + 500 1625 + 501 1502 + 498 4, 359 + 3, 875 = 778ms = 9, 892ms = 2, 204ms = 2, 126ms = 2, 000ms 8, 234ms

best True Positive rate, which is very critical to this application. As for TN and average rate, FAMER is a little lower than N SKrbf , , but much better than other methods. 5.5 Efficiency. As we known, the time complexity of kernel methods reside in two parts: the time duration to compute the complete kernel matrix, and the duration to perform convex optimization. We only focus on the first part because the second part mainly depends on the adopted specific optimization techniques, which is out of the scope of this paper. The detailed descriptions of various datasets used for evaluating efficiency are listed in Table 5. These data sets roughly fall into two categories according to the average number of instances per bag: the category containing small number of instances(e.g. Musk1, Elephant, Fox, and Tiger) and the category containing large number of instances(e.g. Musk2 and Protein). We record the execution times of FAMER, N SKrbf , MIGraph and miGraph Kernel to compute complete kernel matrix under these datasets and list them in Table 6. We don’t compare FAMER with STK because as previous experiments indicated, the classification performance of STK is rather poor especially under databases that contain a large number of instances per bag. It should be pointed out that the execution time of FAMER include two parts, offline part for embedding and online part for computing kernel value. The proposed FAMER runs much faster than other kernels. As the average number of instances per bag increases, the efficiency superiority of FAMER becomes more and more significant. For example, FAMER runs only twice faster than N SKrbf and miGraph under small

databases such as Elephant, Fox and Tiger. However, when it comes to large datasets, it runs about 7 times faster than N SKrbf and miGraph under Musk2, and even 20 times faster under Protein. MIGraph is really slow, because the computational complexity of it is O(dn4 N 2 ). Another key point worth mentioning is that FAMER gains a high efficiency without at the cost of losing effectiveness. It achieves the best classification accuracy under most tasks. 6 Conclusion In this paper, we propose a fast kernel for multi-instance learning named as FAMER. By designing a LSH based similarity metric and and representing each bag as a histogram, FAMER is able to detect the correspondence information between two bags when computing the kernel value. Unlike many other typical kernels, FAMER designs a weighting scheme to not only implicitly weight instances according to their discriminative abilities, but also capture the co-occurrence information. Last but not least, without computing the pairwise relations of instances between two bags, FAMER achieves a low computing load compared with many other multi-instance kernels. The experiments dictate that the proposed FAMER achieves superior classification performance and high efficiency. 7 Acknowledgments We thank the anonymous reviewers, Kristen Grauman from University of Texas at Austin, Guoliang Li from Tsinghua University and Linjun Yang from MSRA for their invaluable inputs.

References [1] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple-instance problem with axisparallel retangles. Artificial Intelligence, 89(1):31–71, 1997. [2] Y. Chen and J. Z. Wang. Image categorization by learning and reasoning with regions. JMLR, 5:913–939, 2004. [3] J. Rousu, C. Saunders, S. Szedmak, and J. Taylor. Learning hierarchical multi-category text classification models. In ICML, pages 745–752, 2005. [4] G.-J. Qi, X.-S. Hua, Y. Rui, T. Mei, J. Tang, and H.J. Zhang. Concurrent multiple instance learning for image categorization. In CVPR, pages 1–8, 2007. [5] G. Fung, M. Dundar, B. Krishnappuram, and R. B. Rao. Multiple instance learning for computer aided diagnosis. In NIPS, pages 425–432, 2007. [6] O. Maron and T. Lozano-Perez. A framework for multiple-instance learning. In NIPS, pages 570–576, 1998. [7] J. Wang and J.-D. Zucker. Solving the multipleinstance problem: a lazy learning approach. In ICML, pages 1119–1125, 2000. [8] T. Gartner, P. A. Flach, A. Kowalczyk, and A.J. Smola. Multi-instance kernels. In ICML, pages 179– 186, 2002. [9] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines for multiple-instance learning. In NIPS, pages 561–568, 2003. [10] Z.-H. Zhou and M.-L. Zhang. Ensembles of multiinstance learners. In ECML, pages 492–502, 2003. [11] Y. Chen, J. Bi, and J. Z. Wang. Miles: Multipleinstance learning via embedded instance selection. TPAMI, 28(12):1931–1947, 2006. [12] J. T. Kwok and P.-M. Cheung. Marginalized multiinstance kernels. In IJCAI, pages 901–906, 2007. [13] Z.-H. Zhou, Y.-Y. Sun, and Y.-F. Li. Multi-instance learning by treating instances as non-i.i.d. samples. In ICML, pages 1249–1256, 2009. [14] W. Ping, Y. Xu, K. Ren, C.-H. Chi, and F. Shen. Non-i.i.d. multi-instance dimensionality reduction by learning a maximum bag margin subspace. In AAAI, pages 551–556, 2010. [15] Y.-F. Li, J. Kwok, I. Tsang, and Z.-H. Zhou. A convex method for locating regions of interest with multiinstance learning. In ECML, pages 15–30, 2009. [16] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society, 58:267–288, 1996. [17] K. L. Wagstaff and T. Lane. Salience assignment for multiple-instance regression. In ICML Workshop, pages 1–6, 2007. [18] M. S. Charikar. Similarity estimation techniques from rounding algorithms. In STOC, pages 380–388, 2002. [19] K. Grauman and T. Darrell. The pyramid match kernel: efficient learning with sets of features. JMLR, 8:725–760, 2007.

[20] R. Bunescu and R. Mooney. Multiple instance learning for sparse positive bags. In ICML, pages 105–112, 2007. [21] H.-Y. Wang, Q. Yang, and H. Zha. Adaptive pposterior mixture-model kernels for multiple instance learning. In ICML, pages 1136–1143, 2008. [22] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In STOC, pages 604–613, 1998. [23] A. Andoni and P. Indyk. Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. Communications of the ACM, 51(1):117–122, 2008. [24] A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. In STOC, pages 327–336, 1998. [25] M. Datar, N. Immorlica, P. Indyk, and V.S. Mirrokni. Locality-sensitive hashing scheme based on p-stable distributions. In SCG, pages 253–262, 2004. [26] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In VLDB, pages 518– 529, 1999. [27] W. Dong, M. Charikar, and K. Li. Asymmtric distance estimation with sketches for similarity search in highdimensional spaces. In SIGIR, pages 123–130, 2008. [28] W. Dong, Z. Wang, M. Charikar, and K. Li. Efficiently matching sets of features with random histograms. In SIGMM, pages 179–188, 2008. [29] N. Weidmann, E. Frank, and B. Pfahringer. A twolevel learning method for generalized multi-instance problems. In ECML, pages 468–479, 2003. [30] Q.Tao, S.Scott, N.V.Vinodchandran, and T.T.Osugi. Svm-based generalized multiple-instance learning via approximate box counting. In ICML, pages 779–806, 2004. [31] S. Ray and M. Carven. Supervised versus multiple instance learning: an empirical comparison. In ICML, pages 697–704, 2005. [32] C.-C.Chang and C.-J.Lin. LIBSVM: a Library for Support Vector Machines, 2001. Software availabel at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [33] Z.-H. Zhou and J.-M. Xu. On the relation between multi-instance learning and semi-supervised learning. In ICML, pages 1167–1174, 2007. [34] Q. Zhang and S. A. Goldman. Em-dd: An improved multiple-instance learning technique. In NIPS, pages 1073–1080, 2003. [35] C.Wang, S.Scott, J.Wang, Q.Tao, D.E.Fomenko, and V.N.Gladyshev. A study in modeling low-conservation protein superfamilies. In Technical Report TR-UNLCSE-2004-0003.