# Discriminative and consistent similarities in instance-level Multiple Instance Learning

## Authors

## Abstract

In this paper we present a bottom-up method to instance-level Multiple Instance Learning (MIL) that learns to discover positive instances with globally constrained reasoning about local pairwise similarities. We discover positive instances by optimizing for a ranking such that positive (top rank) instances are highly and consistently similar to each other and dissimilar to negative instances. Our approach takes advantage of a discriminative notion of pairwise similarity coupled with a structural cue in the form of a consistency metric that measures the quality of each similarity. We learn a similarity function for every pair of instances in positive bags by how similarly they differ from instances in negative bags, the only certain labels in MIL. Our experiments demonstrate that our method consistently outperforms state-of-the-art MIL methods both at bag-level and instance-level predictions in standard benchmarks, image category recognition, and text categorization datasets.

## 1. Introduction

Multiple-instance learning (MIL) [13] addresses a variation of classification problems where complete labels of training examples are not available. In the MIL setup, training labels are assigned to bags of instances rather than individual instances. In most standard MIL setups, a bag is positive if it contains at least one positive instance, and is negative if all of its instances are negative. The standard task in MIL is to classify unknown bags of instances (e.g., [31, 48] ). However, several application domains require instance-level predictions (e.g., [43] ). For example, in image segmentation (instances are superpixels, bags are images) the main goal is to find the exact regions of an image that correspond to the objects of interest.

Instance-level MIL has been approached either by a complex joint optimization over bag and instance classifiers (e.g., [1] ) or by identifying positive instances followed by bag classification. Latter involves similarity-based rea- soning where most methods either use standard similarity functions (e.g., [48] ) or learn a global similarity function for all instances (e.g., [43] ). Standard similarity functions are not necessarily discriminative (orange dashed links in Fig. 1 ) and cannot discover common properties among positive instances. Globally learned similarity functions cannot encode different types of similarities that tie together positive instances within groups (Fig. 1) .

In this paper, we introduce a new method for the problem of instance-level MIL with globally-constrained reasoning about local pairwise discriminative similarities. We introduce a novel approach that learns similarity functions specific to each instance and reasons about the underlying structure of similarities between positive instances using our notion of consistent similarities (Green cliques in Fig. 1 ). We introduce a discriminative notion of similarity that enables learning a similarity function for each pair of instances in positive bags (similarity patterns in Fig. 1 ). Typically, learning a similarity function requires training labels for similar instances. However, instance-level labels are not available in MIL. We use negative bag labels as the only certain labels in MIL to learn our discriminative similarity function. Instances in positive bags are similar if they are similarly different form instances in negative bags.

Pairwise similarities are not always transitive and can be confused with coincidental patterns in a high-dimensional feature space [38] (Purple dashed links in Fig. 1 ). For example, two images a and c cannot be similar to each other only because they are similar to another image b; a might be similar to b because both show a sunset over an ocean, and c might be similar to b because of coincidental patterns of similarity. A reliable pairwise similarity should be globally consistent across several pairs (green links in Fig. 1 ).

We introduce a novel clique-based notion of similarity that measures global consistency of pairwise similarities. We formulate the discovery of positive instances as a ranking problem where top rank instances in positive bags are highly and consistently similar to each other. The bag labels provide constraints to our optimization problem; real positive instances inside each bag should rank higher than negative instances in negative bags. We show that a random-walk based ranking algorithm that uses our globally-consistent pairwise similarities outperforms stateof-the-art MIL results in MIL benchmarks, text categorization, and image segmentation.

Related Work: Over the course of the previous decades several interesting approaches address the problem of baglevel multiple instance learning. Some examples include [31, 32, 44, 45, 1, 42, 11, 4, 6, 29, 46, 16, 37, 45, 10, 48, 5, 39, 2, 36, 17] . Please see [47] for a complete survey; space does not allow a comprehensive literature review. A group of approaches to instance-level MIL use joint optimization over bags and instances. This optimization is modeled in mi-SVM [1] by a max margin formulation and [28] in a convex form. In MILES [9] , most discriminative instances are selected by an L1-regularized bag-level classifier. In these settings, solving a joint optimization can lead to an NP-hard problem that has to be approximated. To avoid this problem our method firs discovers positive instances and then uses them to predict bag-level labels.

Similar to our method, other instance-level MIL approaches separate the two steps of discovering positive instances and bag classification. Xiao et al. [43] design a method that assigns two values to instances by measuring their similarity to positive and negative classes and then use a similarity weighted large margin model to learn the final classifier. Jia and Zhang [24] use two different loss functions for negative and positive bags used in a semisupervised fashion. Fu and Robles [15] introduce an EMlike algorithm that iterates between selecting positive prototypes and updating classifiers. Kim and Torre [25] approach MIL by Gaussian process latent variable models. Deselaers and Ferrari [12] consider each bag as a node in a conditional random field. In contrast, our method takes both local and global information into account through discriminative and consistency similarities between instances. In addition, our method uses a global ranking method for optimization. Due to these properties, our results show consistent improvement over previous MIL methods in different domains.

Discriminative similarities have shown to be effective in the natural language processing community [18, 26] for aligning sentences to events. Unlike our method, previous work does not incorporate consistency of similarities.

2. Overview of Our Method: mi-Sim Fig. 2 sketches an overview of our method. At training, bags B tr of instances X tr along with bag-level labels b tr are known; Instance-level labels are not given. Our training algorithm has two main steps: discovering "correct" positive instances L + among all instances X tr + in positive bags based on the training bag labels (Fig. 2 Step 1.a) and training a final binary classifier using the discovered positive instances L + and instances X tr − in negative bags in the training set ( At test time neither bag labels nor instance-level labels are known. We test our method on how well it can predict both bag-level and instance-level labels. For testing ( Fig. 2 Step 2), we use the final binary classifier to predict labels of individual instances in the test set. Bag-level labels are then predicted using instance-level labels; a bag is positive if it includes at least one predicted positive instance.

## 3. Discovering Positive Instances

Training involves discovering positive instances L + using bag-level labels in the training set ( Fig. 2 Step 1.a). Following most previous works in MIL (e.g., [12, 40] ) we assume that negative instances are negative in their own specific ways while positive instances are similar to each other. More intuitively, if positive bags have negative instances that are consistently similar to other negative instances in other positive bags, the instance discovery becomes ill defined. Our experimental results suggest that this is a mild assumption.

We formulate the positive instance discovery problem as the problem of searching for assignments of positive instances that maximize similarities between discovered positive instances. In particular, our goal is to find the best labeling L to training instances such that discovered positive instances L + are highly and consistently similar. The constraint in the optimization enforces each positive bag to have at least one positive instance. A. CSG ← the graph in which instances x i and x j are connected with a weight S( in the training set, and F is a form of similarity function between pairs of instances x i and x j . Below we describe how to model

max L xi,xj ∈L + F(x i , x j ) (1) k∈B L k ≥ 1 ∀B ∈ B tr + , L l ∈ {0, 1} ∀l ∈ X

x i , x j ) + C(x i , x j ) B. R xi ←

F(x i , x j ) as a combination of discriminative S(x i , x j ) and consistent similarities C(x i , x j ).

Discriminative similarity: Positive instances should not only be similar to each other but also be different from negative instances. Learning a similarity function for positive instances require having a labeled set of positive instances, but instance-level labels are not available. We anchor the learning of our similarity function on the only certain labels in MIL setup; negative bag labels. All instances in the negative bags are known to be negative. We propose to encode the similarity between two positive instances based on how similarly they differ from negative instances (Sec. 3.1).

## Consistency Of Similarities:

Pairwise similarities are not always transitive, they can be confused with coincidental patterns in the feature space [38] . For example, two images may become similar because they both depict the same scene or may be similar because of some irrelevant accidents in the feature space. Coincidental patterns in the feature space, by definition, are not repetitive. This suggests that a reliable pairwise similarity is the one that is homogeneous across several pairs. We introduce the notion of consistent similarity and measure the consistency of a similarity based on the cardinality of the clique containing both instances (Sec. 3.2).

Positive instance discovery as ranking: We aim to discover assignments of positive instances such that it maximizes the similarities between positive instances, the dis-crimination between positive and negative instances, and the consistency of similarities between positive instances. Searching for the optimal assignments of instance labels is a challenging optimization problem that is even hard to approximate. We formulate this problem by optimizing for a global ranking R xu for each instance x u such that the top rank instances in each bag jointly maximize the similarity S and consistency C among positive instances:

max R xi,xj ∈L + S(x i , x j ) + C(x i , x j ) (2) R xu R xv ∀x u ∈ L + , x v ∈ (X tr + \ L + ) L + = ∪ n ∆ R B tr n

where X tr + corresponds to all the instances in positive bags, L + is the discovered set of positive instances, R xu is the ranking score of the instance x u in its bag, ∆ R B tr n corresponds to top-ranked instances based on ranking R in positive bag B tr n . Finding the optimal ordering of instances according to the above optimization (Equation 2) is a challenging combinatorial optimization problem (NP-Hard). We approximate the above optimization with a random-walk based ranking that respects both similarity and consistency of similarities (Sec.3.3). To this end, we build a Consistent Similarity Graph (CSG) for instances in positive bags in the training set. Nodes in this graph are instances in positive bags and edges represent discriminative pairwise similarities along with their consistency scores. The weight of an edge between two nodes x i and x j correspond to

S(x i , x j ) + C(x i , x j ).

Definition 1 (Consistent Similarity Graph (CSG)). A CSG is an undirected weighted graph CSG = (V, E) where a node x i ∈ V corresponds to a training instance x i in a positive bag and an edge e ij ∈ E connects

x i to x j , if S(x i , x j ) > 0. The weight e ij is α.S(x i , x j ) + C(x i , x j )

where α is the balancing factor that takes into account the differences between the scales of S and C.

We approximate the best ranking by performing a random walk on this graph. The main intuition is nodes that are adjacent to high rank nodes with large edge scores will get high ranks. There are several results on why a random walk based approach results in decent approximations of the optimal ranking [33, 40, 34, 3, 35, 18, 26] .

## 3.1. Pairwise Discriminative Similarity

In this section we describe how to derive similarity S(x i , x j ) between two instances x i and x j (Fig.2 step 1 .a.i). We base our notion of similarity on using negative labels which are the only certain labels in MIL. We learn what is unique about each instance in positive bags by discriminating it from all instances in the negative bags; these are the examples we are sure they are not similar to positive instances. We learn an instance by what it is not like. We do this by fitting an SVM with only one positive instance and a large number of negative instances.

Recently in computer vision, Exemplar Support Vector Machines have shown great success in learning what is unique about an image that can distinguish it from all other images [30] . Despite being susceptible to overfitting, the hard negative mining method in [30] gets away from this issue. By fitting a classifier to only one positive instance and a large number of negative instances we learn how to weight features against each other in a discriminative manner. If a learned model for each instance produces a positive score when applied to another instance, the two instances are considered to be similar.

Based on our discriminative notion of similarity, two instances in positive bags are similar if they are similarly different from negative instances. To setup notations, for an instance x i ∈ X tr + , we fit a linear SVM (called iSVM) to x i as a positive training example and use instances X tr − in negative bags as negative examples. We balance the positive and negative examples by weighting them accordingly. The confidence of applying the SVM over a new instance x j is computed as Υ i,j = w T i • x j , where w i is the learned weight vector for instance x i .

The discriminative similarity between two instances x i and x j is defined according to their mutual confidences. The confidence scores are not directly comparable. To calibrate, we use the order of each instance among all the other instances. The similarity of two instances x i and x j is 1/ϕ(i, j).ϕ(j, i) where ϕ(i, j) represents the order of x j among all the instances with positive confidence when classified by w i .

S(x i , x j ) = 1 ϕ(i,j)•ϕ(j,i) if Υ i,j > 0 and Υ j,i > 0 0 otherwise (3) ϕ(i, j)

is the rank of x j among all the training instances x k where Υ i,k > 0.

## 3.2. Consistency Of Similarities

In this section we describe how to compute the consistency C(x i , x j ) of the similarity between two instances x i and x j (Fig.2 step 1 .a.ii). Pairwise similarities are not always transitive, they can be confused with coincidental patterns in the feature space [38] . This exposes subtleties to reasoning based on pairwise similarities. For example, two images may become similar because they both depict the same scene or may be similar because of some irrelevant accidents in the feature space. These accident happen largely because feature spaces are not perfect representations of the world, and similarities are typically modeled by some sort of a global distance in a high dimensional space. Similarities typically expose several modalities out of which very few are desirable.

Coincidental patterns in the feature space, by definition, are not repetitive. This suggests that a reliable pairwise similarity is the one that is homogeneous across several pairs in a large clique of instances. We define a consistency score between two instances as the size of the largest maximal clique that contains both instances. If a pairwise similarity is consistent then several homogeneous similarities can be found, thus the maximal clique is large in size. If a pairwise similarity is not consistent the corresponding maximal clique is small, resulting in a small consistency score. In CSG the nodes are all the samples from positive bags and edges are only between nodes that their discriminative similarity are greater than zero.

The notion of a clique is slightly rigid for pairwise similarities due to uncertainties in our discriminative similarity measure and inherent variations among positive instances. For these reasons we use the notion of quasi-cliques [8] that relaxes the constraint of completeness in cliques.

## Definition 2 (Quasi Clique

). A graph G = (V, E) is a γ- quasi clique if |E| ≥ γ |V | 2 , where 0 < γ ≤ 1.

A quasi-clique essentially represents a group of instances that are densely similar to each other. Under settings of MIL, it is natural to assume that positive instances correspond to higher cardinality quasi-cliques with high degrees of inner-clique-similarity.

Definition 3 (Maximal Quasi-Clique). A maximal quasiclique in an undirected graph is a quasi-clique that cannot be extended by adding any more node.

For each subgraph G i induced by a node x i there always exists a maximal quasi-clique containing x i . Every node might appear in more than one maximal quasi-clique.

We adopt a greedy approach [8] to find maximal quasicliques in a graph constructed by connecting similar instances in positive bags. For every node x i in CSG = (V, E), our method iteratively collects a set of vertices Q i to form a quasi-clique corresponding to the node x i . Initially, Q i is set to x i . At each iteration a candidate set of instances P i is a set of nodes x j ∈ V \ Q i that are connected to a large portion of current nodes in Q i .

`EQUATION (4): Not extracted; please refer to original document.`

where 1[.] is a 0-1 indicator function of its boolean argument. Then, a node x * j ∈ P i will be added to Q i where

x * j = arg max xj ∈Pi x k ∈Qi e kj .

We iterate until there is no new node that can be added to Q i i.e., P i = ∅. Corollary 1. Q i generated by the above algorithm returns a maximal γ-quasi-clique. Finally, we model consistency similarity between two instances x i and x j as the size of the largest maximal quasiclique that includes the edge between x i and x j .

## Definition 4 (Consistency Score)

. Let e ij be the edge connecting x i and x j in the graph. Let Q h denote different maximal quasi-cliques that include nodes x i and x j . Consistency C(x i , x j ) of the edge e ij is the size of the largest maximal quasi-clique that includes both x i and x j .

`EQUATION (5): Not extracted; please refer to original document.`

## 3.3. Ranking Instances In Csg

In this section we describe how to rank instances to optimize for Equation 2 (Fig.2 step 1.a.iii) . The optimal ranking imposes high consistent similarities among top ranked instances in positive bags. To rank, we perform a randomwalk algorithm to propagate scores in the consistent similarity graph. Recall that in CSG, we assign a node for each instance in positive bags. The edge scores are combinations of discriminative similarity score and the consistency score.

We rank nodes in CSG by computing ranking scores of all the instances X tr + in positive bags. We then select highest-scoring nodes as positive instances. The ranker in CSG should assign high ranks to nodes which are highly and consistently similar to many high rank instances. We adopt a random walk algorithm similar to Google PageRank [7] that encourages propagating scores among edges that have high similarity score.

Our algorithm computes a ranking score R xi corresponding to every node in the graph. The algorithm iteratively computes the score of a node according to Equation 6. At every iteration, R xi is the expected sum (with probability d) of scores of the adjacent nodes (computed at the previous iteration) and the self confidence value:

`EQUATION (6): Not extracted; please refer to original document.`

where e ij is the normalized weight of an edge in CSG, Υ i,i (as defined in section 3.1) is the confidence of each instance classifier on its own instance, N (x i ) is the set of adjacent nodes to x i in CSG, and d is a damping factor. R xi is initialized by random values. At iteration 1, only direct edges (length 1) are considered; R xi only adds up the scores of the nodes that are directly linked to x i . In next iterations, longer paths are considered; the effect of indirectly linked nodes to x i is included in the scores of adjacent nodes. We control the expected length of the paths with a damping factor d.

Once the scores of each instance is known, the most probable positives in each bag would be the highest scoring ones. In our experiments, we select positive instances if their ranking score is higher than 0.9. Knowing the positive instances in each bag, bag-level and instance-level MIL become straight forward.

## 4. Experimental Setup

Tasks: We compare our method with bag-level and instance-level state-of-the-art methods in different datasets. Training sets in all the experiments only include bag-level labels. During training our method discovers positive instances in each bag. Once positive instances are discovered we train SV M f inal using discovered positive instances as positive examples and all the examples in the negative bags as negative instances. At test time, in bag-level classification, bag-level labels are inferred from the instance-level predictions. A bag will be positive if it contains at least one predicted positive instance. In addition, we evaluate our algorithm for positive instance discovery (instance-level predictions). Datasets: Benchmark datasets: We evaluate our method on benchmark datasets for MIL including Drug Activity Prediction (Musk1, Musk2) and Localized Content-based Image Retrieval (Elephant, Tiger and Fox). More details on these datasets can be found in [1] . COREL-2000: We evaluate our method on COREL-2000, which is a standard dataset for image categorization using MIL and has been studied by previous work in MIL. It contains 1000 images in 20 categories of COREL. For every image, regions are extracted via segmentation, and each segment represented by a 9-dimensional feature vector. To cast image categorization to an MIL problem, every image is a bag of regions. The regions are considered as instances. An image is positive if at least one of its regions contain the object of interest. More details on this dataset can be found in [10, 9] . IL-MSRC: In order to evaluate instance-level MIL on image categorization tasks we introduce the Instance-level MSRC (IL-MSRC) dataset. We use images of MSRC dataset (30 images per category). We perform segmentation on each image using superpixel extraction of [14] . We then use ground-truth segments provided by MSRC to label each superpixel. A superpixel is positive, if it has more than 50% overlap (pixel area) with ground-truth segment otherwise it is negative. Images are bags and superpixels are instances in each bag. 20 Newsgroup: These datasets include texts from 20 newsgroups corpora. This dataset has been introduced in mi-Graph [48] to evaluate the role of MIL in text categorization. Each dataset has 100 bags: 50 positive and 50 negative. For every category, every positive bag includes a few news which are randomly selected for that category and many unrelated news. The main characteristic of this dataset is that instance-level labels are available and it has low witness rate (3%) i.e., there is almost one positive instance in each positive bag.

Parameters: Our method is not sensitive to the choice of parameters across different domains. All the parameters are fixed across all the experiments except thresholding on the final SVM classifier. We have to do this because the literature does not provide precision-recall values.

We set the parameter γ for finding quasi-cliques to 0.9 in all the experiments across both texts and images. We set α in Definition 1 to exp(10, log

C(xi,xj ) S(xi,xj ) )

to take into account the differences between the scales of the similarity scores and consistency scores. Our ranking algorithm produces a ranking score for all instances between 0 and 1. Instances with a score higher than 0.9 is considered as positive instance. For training iSVMs, the trade off parameter c is default and we compensate for imbalanced data by weighting negative instances according to the size of the negative set. Damping factor is set to 0.8 in all the experiments as suggested by Google. The final threshold over SVM scores are determined by cross-validation following the experimental setting in [28, 48] . To have an accurate comparison, in each experiments we followed the settings proposed by previous works on each experiment. Results of other methods are reported from the published work with the exception of experiments on IL-MSRC where we use the publicly available codes. Table 1 . Accuracy of our method, mi-Sim, in predicting bag labels by our approach compared to state-of-the-art bag-level and instance-level MIL approaches on benchmark tasks.

## 5. Experimental Results

To show the generality of our method, we apply our system to different MIL benchmarks, image, and text domains and show improvement over general MIL methods.

## 5.1. Bag-Level Predictions

System Performance on Standard Benchmarks: We compare our method, mi-Sim, with previous bag-level and instance-level methods for MIL on benchmark datasets. We report the accuracy of prediction of bag labels in percent via ten times 10-fold cross validation. Our method outperforms all the other methods on all these datasets except PPMM [41] and APR [13] on Musk1. PPMM uses an exhaustive search that may be prohibitive in practice. APR has been designed specially for the drug activity prediction. System Performance on Text Categorization: MIL has shown to be very effective in text categorization. Table 2(a) shows the results of comparisons on text categorization over twenty datasets introduced in miGraph [48] . We report ten times 10-fold cross validation following the experimental setting of miGraph and SVR-SVM [28] . Our method outperforms both methods in most datasets with a large margin. The results can further be improved using more advanced textual features to compute discriminative similarities [23, 19, 22, 20] . System Performance on Image Categorization: Previ- Table 2 . Accuracy of our method, mi-Sim, vs. state of the art in predicting bag labels on (a) the text categorization task across different datasets (b) image categorization on the dataset COREL-2000, (c) image categorization on the dataset IL-MSRC, and (d) ablation study on the benchmark Tiger dataset.

ous researchers show an interesting application of MIL in image categorization. Tables 2(b) and (c) show the results of our method versus previous MIL techniques for image categorization on COREL2000 and IL-MSRC. We used the same experimental setting as the previous work [48, 9] , repeat five times 5-random partitioning, and report the overall accuracy of 95% confidence intervals. For our new dataset IL-MSRC, we compared our method with BoW (simple bag-of-words model), MI-SVM, and mi-Graph (features are BOW of SIFT with 1000 codebooks). Our method outperforms all the previous techniques with a large margin. The results can further be improved using more advanced vision-based features to compute discriminative similarities [27] .

Contribution of System Components: Table 2(d) shows the accuracy for different controls on benchmark datasets to show the importance of each of the components. (−consistency) removes consistent similarities and only considers pairwise discriminative similarities, (−discriminative) removes discriminative similarities and only considers consistent similarities on edges. (−ranking) replaces random walk ranking with a baseline of using the degree of each node as its final score; this baseline examines the importance of our random walk algorithm to approximate optimization 2. (Gaussian) replaces the discriminative similarity with -graph of Gaussian similarity in a same way as it is proposed in miGraph (highest competition with our system). Results shows that each component in our model plays an important role in our final system and removing each component drops the performance with a big margin. System Running Time: Our system, for every instance, trains one iSVM which is a linear classifier and can be learned efficiently. We compare the running time of our full system with miGraph and miSVM on Tiger dataset which has 1220 instances and 200 bags. The training time (in second) of our method, miGraph and miSVM are 3.23, 3.12, 856 respectively and testing times are 2.51, 2.71, 2.63. In addition, the training time of our method for each iSVM classifier in MSRC dataset takes less than 0.5 sec. Table 1 shows that our method outperforms previous instance-level MIL methods in bag-level predictions. To show that our approach is also successful in instance-level predictions, we evaluate our system on 20 Newsgroups and our new dataset IL-MSRC that include instance-level labels.

## 5.2. Instance-Level Predictions

20 Newsgroups: Figure 3 compares the instance-level accuracy for our method and mi-SVM [1] on all the datasets in the 20 newsgroup using F1-measure. Positive instances found by our method and mi-SVM are those that have highest score within each bag. We significantly outperform mi-SVM in all those datasets. -./0123" 45-6/172" 819:;82#3127" 67#/-5:8-50" 3-7#/-5:8-50" 819:;82#<" =;52->0" -?.;2" 3;.;57@7>02" A-20A->>" /;7B0@" 271#75@6." 271#0>07.5;9172" 271#30:" 271#26-70" 7/512C-9" 6;>1C72#4?92" 6;>1C72#31:0-2." 6;>1C72#3127" 50>141;9#3127" Instance-level MSRC: The task is to find which superpixel corresponds to the object of interest. Figure 4(right) , shows the precision-recall curve on the IL-MSRC dataset. The precision-recall curve is traced by a threshold on the instance-level scores. Positive instances are those that are above a threshold (T ) within each bag. We do this because there could be several positive instances in each bag. Our method significantly outperforms mi-SVM. Figure 4 (left) depicts qualitative examples of the regions discovered by our method and mi-SVM. Note that image segmentation is not the focus of this paper. We used this task to showcase the generality of our approach. Our features are very simple and we avoid any vision specific tweaks to be comparable with other MIL methods.