Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Learning Everything about Anything: Webly-Supervised Visual Concept Learning


  • S. Divvala
  • Ali Farhadi
  • Carlos Guestrin
  • 2014 IEEE Conference on Computer Vision and Pattern Recognition
  • 2014
  • View in Semantic Scholar


Recognition is graduating from labs to real-world applications. While it is encouraging to see its potential being tapped, it brings forth a fundamental challenge to the vision researcher: scalability. How can we learn a model for any concept that exhaustively covers all its appearance variations, while requiring minimal or no human supervision for compiling the vocabulary of visual variance, gathering the training images and annotations, and learning the models? In this paper, we introduce a fully-automated approach for learning extensive models for a wide range of variations (e.g. actions, interactions, attributes and beyond) within any concept. Our approach leverages vast resources of online books to discover the vocabulary of variance, and intertwines the data collection and modeling steps to alleviate the need for explicit human supervision in training the models. Our approach organizes the visual knowledge about a concept in a convenient and useful way, enabling a variety of applications across vision and NLP. Our online system has been queried by users to learn models for several interesting concepts including breakfast, Gandhi, beautiful, etc. To date, our system has models available for over 50, 000 variations within 150 concepts, and has annotated more than 10 million images with bounding boxes.

1. Introduction

How can we learn everything (visual) about any concept? There are two main axes to this question. The everything axis corresponds to all possible appearance variations of a concept, while the anything axis corresponds to the span of different concepts for which visual models are to be learned.

The conventional paradigm for learning a concept model is to first discover the visual space of variance for the concept (variance discovery), then gather the training data i.e., images and annotations, and finally design a powerful model that can handle all the discovered intra-concept variations (variance modeling). For variance discovery, the common practice is to assume a manually defined vocabulary by relying on benchmark datasets [47] . For variance modeling, which is often approached in isolation from the discovery step, the majority of methods use a divide and conquer strategy, where the training data within a category is grouped into smaller sub-categories of manageable visual variance [13] . A variety of cues have been used to partition the data: viewpoint [9] , aspect-ratio [18] , poselets [5] , visual phrases [43] , taxonomies [11] , and attributes [16, 23] .

While the above paradigm has helped advance the recognition community, two fundamental and pragmatic questions remain unanswered: First, how can we ensure everything about a concept is learned? More specifically, how can we gather an exhaustive vocabulary that covers all the visual variance within a concept? Second, how can we scale the above paradigm to learn everything about anything? I.e., is it possible to devise an approach that alleviates the need for human supervision in discovering the vocabulary, gathering training data and annotations, and learning the models?

In this paper, we introduce a novel "webly-supervised" approach to discover and model the intra-concept visual variance. We show how to automatically gather an exhaustive vocabulary of visual variance for any concept, and learn reliable visual models using no explicit human supervision.

1.1. Variance Discovery And Modeling

Almost all previous works have resorted to the use of explicit human supervision for variance discovery and modeling. Using explicit supervision for variance discovery has a couple of drawbacks:

Extensivity: a manually-defined vocabulary of variance cannot enumerate all the visual variances for a concept, and is biased towards the cultural, geographical, or temporal biases of the people compiling them. For example, 'firewalking' is a popular phenomenon only in some parts of the world, and thus may get excluded in the vocabulary of 'walking'. When sampling the visual space for collecting data, arbitrary and limited vocabularies can result in highly biased datasets [47] . There is always a trade-off between the exhaustiveness of the vocabulary discovered and the complexity of the model used to constrain the visual variance; a more exhaustive vocabulary results in limited variance within each group, and thereby potentially alleviates the need for sophisticated models.

Specificity: pre-defined vocabularies do not typically generalize to new concepts. For example, the action of 'rearing' can modify a 'horse' with very characteristic appearance, but does not extend to 'sheep', while 'shearing' applies to 'sheep' but not to 'horse'. This makes the task of manually defining a vocabulary even more burdensome as one needs to define these vocabularies per concept.

Using explicit supervision for variance modeling has the following additional drawbacks:

Flexibility: the act of explicit human annotation leads to rigid decisions at the time of dataset creation (e.g., the list of attributes [16, 23] , or visual phrases [43] ). These decisions can seldom be modified once the annotations have been collected and thus often end up dictating the methods used to process the data. For example, a grouping based on horse breeds ('sorrel horse', 'pommel horse', etc) as in Imagenet [11] is not very useful for a shape (HOG)-based 'horse' detector [18] ; a grouping based on actions ('jumping horse', 'reining horse', etc) might be preferable. Thus, it will be beneficial if the annotations can be modified based on the feature representation and the learning algorithm.

Scalability: human annotation also presents a hurdle towards learning scalable models. Every new proposal to constrain the visual variance of the data poses a herculean task of either preparing a new dataset (e.g., ImageNet) or adding new annotations to an existing dataset. For example, in the case of phraselets [12] and attributes [16, 23] , new annota-tions had to be added to all the PASCAL VOC images. Furthermore, as the modeling step is typically approached in isolation from the discovery step, the annotations obtained for modeling the intra-concept variance are often different and disjoint from those gathered during variance discovery.

1.2. Overview

In this work, we propose a new approach to automatically discover and model the visual space of a concept that circumvents the above limitations (see Figure 2 ). To discover the vocabulary of variance, we leverage vast resources of books available online (Google Books Ngrams [33] ). This discovered vocabulary is not only extensive but also concept-specific. Given a term e.g., 'horse', the corpus includes ngrams containing all aspects of the term such as actions ('rearing horse'), interactions ('barrel horse'), attributes ('bridled horse'), parts ('horse eye'), viewpoints ('front horse'), and beyond (see Figure 1 , top row).

Figure 1. Not extracted; please refer to original document.
Figure 2: Approach Overview

To model the visual variance, we propose to intertwine the vocabulary discovery and the model learning steps. Our proposal alleviates the need for explicit human annotation of images, thus offering greater flexibility and scalability. To this end, we leverage recent progress in text-based web image search engines, and weakly supervised object localization methods. Image search has improved tremendously over the past few years; it is now possible to retrieve relevant sets of object-centric images (where the object of interest occupies most of the image) for a wide range of queries. While the results are not perfect, the top ranked images for most queries tend to be very relevant [32] . With the recent success of the Deformable Parts Model (DPM) detector [18] , weakly-supervised object localization techniques [36, 39] have risen back to popularity. Although these methods do not work well when presented with a highly diverse and polluted set of images, e.g., images retrieved for 'horse', they work surprisingly well when presented with a relatively clean and constrained set of objectcentric images, e.g., images retrieved for 'jumping horse'.

Our idea of intertwining the discovery and modeling steps is in part motivated by the observation that the VOC dataset was compiled by downloading images using an explicit set of query expansions for each object (see Table 1 in [15] ). However, the VOC organizers discarded the keywords after retrieving the images, probably assuming that the keywords were useful only for creating the dataset and not for model learning purposes, or presumed that since the keywords were hand-chosen and limited, focusing too much on them would produce methods that would not generalize to new classes. In this work, we show how the idea of systematic query expansion helps not only in gathering less biased data, but also in learning more reliable models with no explicit supervision. Our contributions include: (i) A novel approach for discovering a comprehensive vocabulary (covering actions, interactions, attributes, and beyond), and training a fullfledged detection model for any concept, including scenes, events, actions, places, etc., using no explicit supervision. (ii) Showing substantial improvement over existing weakly- An open-source online system (http://levan.cs.uw.edu) that, given any query concept, automatically learns everything visual about it. To date, our system has learned more than 50,000 visual models that span over 150 concepts, and has annotated more than 10 million images with bounding boxes.

Table 1: Examples of the vocabulary discovered and the relationships estimated for a few sample concepts.

2. Related Work

Taming intra-class variance: Previous works on constraining intra-class variance have considered simple annotations based on aspect-ratio [18] , viewpoint [9] , and feature-space clustering [13] . These annotations can only tackle simple appearance variations of an object [51] . Recent works have considered more complex annotations such as phrases [43] , phraselets [12] , and attributes [16, 23] . While explicit supervision is required to gather the list of phrases and their bounding boxes in [43] , the work of [12] needs heavier supervision to annotate joint locations of all objects within the dataset. Although [24, 27] discover phrases directly using object bounding boxes, their phrasal vocabulary is limited to object compositions, and cannot discover complex actions, e.g., 'reining horse', 'bucking horse', etc. Moreover, all of the methods [12, 24, 27] discover phrases only involving the fully annotated objects within a dataset, i.e., they cannot discover 'horse tram' or 'barrel horse' when tram and barrel are not annotated. Attributes [16, 23] are often ambiguous to be used independent of the corresponding object, e.g., a 'tall' rabbit is shorter than a 'short' horse; 'cutting' is an attribute referring to a sport for horses while it has a completely different meaning for sheep. To date, there exists no established schema for listing attributes for a given dataset [37] . Weakly-supervised object localization: The idea of training detection models from images and videos without bounding boxes has received renewed attention [2, 36, 39, 45] due to the recent success of the DPM detector [18] . While it is encouraging to see progress, there are a few limitations yet to be conquered. Existing image-based methods [2, 36, 45] fail to perform well when the object of interest is highly cluttered or when it occupies only a small portion of the image (e.g., bottle). Video-based methods [39] rely on motion cues, and thus cannot localize static objects (e.g., tvmonitor). Finally, all existing methods train their models on a weakly-labeled dataset where each training image or video is assumed to contain the object. To scale to millions of categories, it is desirable to adapt these methods to directly learn models from noisy web images. Learning from web images: Due to the complexity of the detection task and the higher supervision requirements, most previous works [4, 19, 28, 38, 44, 48] on using web images have focused on learning models only for image classification. The work of [21, 41] focuses on discovering commonly occurring segments within a large pool of web images, but does not report localization results. The work of [49] uses active learning to gather bounding box annotations from Turkers. The work of [7] aims at discovering common sense knowledge from web images, while our work focuses on learning exhaustive semantically-rich models to capture intra-concept variance. Our method produces well-performing models that achieve state-of-the-art performance on the benchmark PASCAL VOC dataset.

3. Discovering The Vocabulary Of Variance

In order to obtain all the keywords that modify a concept, we use the Google books ngram English 2012 corpora [33] . We specifically use the dependency gram data, which contains parts-of-speech (POS) tagged head=>modifier dependencies between pairs of words, and is much richer than the raw ngram data (see section 4.3 in [30] ). We choose ngram data over other lexical databases (such as Wordnet or Wikipedia lists [1] ) as it is much more exhaustive, general, and includes popularity (frequency) information. Using the books ngram data helps us cover all variations of any concept the human race has ever written down in books.

Given a concept and its corresponding POS tag, e.g., 'reading, verb', we find all its occurrences annotated with that POS tag within the dependency gram data. Using the POS tag helps partially disambiguate the context of the query, e.g., reading action (verb) vs. reading city (noun). Amongst all the ngram dependencies retrieved for a given concept, we select those where the modifiers are tagged either as noun, verb, adjective, or adverb 1 . We marginalize over years by summing up the frequencies across differ-ent years. Using this procedure, we typically end up with around 5000 ngrams for a concept.

Not all the ngrams gathered using the above procedure are visually salient, e.g., 'particular horse', 'last horse', etc. While our model learning procedure (section 4) is robust to such noise, it would be unnecessary to train full-fledged detectors for irrelevant ngrams. To avoid wasteful computation, we use a simple and fast image-classifier based pruning method. Our pruning step can be viewed as part of a cascade strategy that rejects irrelevant ngrams using a weak model before training strong models for relevant ngrams.

3.1. Classifier-Based Pruning

The goal here is to identify visually salient ngrams out of the pool of all discovered ngrams for a concept. Our main intuition is that visually salient ngrams should exhibit predictable visual patterns accessible to standard classifiers. This means that an image-based classifier trained for a visually salient ngram should accurately predict unseen samples of that ngram.

We start by retrieving a set of images I i for each ngram i. To maintain low latency, we only use thumbnails (64×64 pixels) of the first 64 images retrieved from Image Search. We ignore all near-duplicate images. We then randomly split this set into equal-sized training and validation sets

I i = {I t i , I v

i }, and augment the training images I t i with their mirrored versions. We also gather a random pool of background imagesĪ = {Ī t ,Ī v }. For each ngram, we train a linear SVM [6] W i with I t i as positive andĪ t as negative training images, using dense HOG features [18] . This classifier is then evaluated on a combined pool of validation images

{I v i ∪Ī v }.

We declare an ngram i to be visually salient if the Average Precision (A.P.) [15] of the classifier W i computed on {I v i ∪Ī v } is above a threshold. We set the threshold to a low value (10%) to ensure all potentially salient ngrams are passed on to the next stage, and only the totally irrelevant ones are discarded. Although our data is noisy (the downloaded images are not manually verified to contain the concept of interest), and the HOG+linearSVM framework that we use is not the prevailing state-of-the-art for image classification, we found our method to be effective and sufficient in pruning irrelevant ngrams. After the pruning step, we typically end up with around 1000 ngrams for a concept.

3.2. Space Of Visual Variance

Amongst the list of pruned ngrams there are several synonymous items. For example, 'sledge horse' and 'sleigh horse', 'plow horse' and 'plough horse', etc. Further, some non-synonymous ngrams correspond to visually similar entities, e.g., 'eating horse' and 'grazing horse' (see Figure 2) [31] . To avoid training separate models for visually similar ngrams, and to pool valuable training data across them, we need to sample the visual space of a concept more carefully. How can we identify representative ngrams that span the visual space of a concept? We focus on two main criteria: quality and coverage (diversity).

We represent the space of all ngrams by a graph G = {V, E} where each node represents an ngram and each edge represents the visual similarity between them. Each node has a score d i that corresponds to the quality of the ngram classifier W i . We set the score d i as the A.P. of the classifier W i on its validation data

{I v i ∪Ī v }.

The edge weights e ij correspond to the visual distance between two ngrams i, j and is measured by the score of the jth ngram classifier W j on the ith ngram validation set

{I v i ∪Ī v }.

To avoid issues with uncalibrated classifier scores, we use a rank-based measure. Our ranking function (R :

R |Ī v ∪I v i | → N |I v i | )

ranks instances in the validation set of an ngram against the pool of background images. In our notation R i,j corresponds to ranks of images in I v i againstĪ v scored using W j . We use the normalized median rank as the edge weight

e i,j = M edian(Ri,j ) |I v j |

. We scale the e i,j values to be between [0 1].

The problem of finding a representative subset of ngrams can be formulated as searching for the subset S ⊆ V that maximizes the quality F of that subset:

EQUATION (1): Not extracted; please refer to original document.


EQUATION (2): Not extracted; please refer to original document.

O is a soft coverage function that implicitly pushes for diversity:

EQUATION (3): Not extracted; please refer to original document.

This formulation searches for a subset of ngrams that are visually manageable (have reliable ngram classifiers) and cover the space of variance within a concept (similar to [3, 22] ). Fortunately, this objective function is submodular, hence there exists a greedy solution within a constant approximation of the optimal solution. We use an iterative greedy solution that adds at each stage the ngram i that provides the maximum gain over the current subset

(arg max i F(S ∪ i) − F(S)).

This algorithm provides the subset of representative ngrams that best describes the space of variance under a fixed budget k. We can use the same algorithm to also merge similar ngrams together to form superngrams. By setting the cost of adding similar ngrams in S to a really high value, each ngram l / ∈ S can be merged to its closest member in S. Our merging procedure reveals interesting relations between ngrams by merging visually similar actions, interactions, and attributes. For example, our method discovers the following ngrams of 'horse' as visually similar: {tang horse, dynasty horse}, {loping horse, cantering horse}, {betting horse, racing horse}, etc. Table 1 shows more examples for other concepts. Using this procedure, we reduce the number of ngrams to around 250 superngrams.

4. Model Learning

The images for training the detectors are gathered using Image Search with the query phrases as the ngrams constituting the superngram. We download 200 full-sized, 'full' color, 'photo' type images per query. We resize images to a maximum of 500 pixels (preserving aspect-ratio), discard all near-duplicates, and ignore images with extreme aspectratios (aspect ratio > 2.5 or < 0.4). We split the downloaded images into training and validation sets. Training a mixture of roots: Pandey et al., [36] demonstrated the possibility of training the DPM detector [18] using weak supervision. Directly applying their method to all the images within a concept (pooled across all ngrams) results in a poor model 2 . Therefore we train a separate DPM for each ngram where the visual variance is constrained.

[36] initializes the DPM with the full image as the bounding box. We found using this initialization often leads to the bounding box getting stuck to the image boundary during the latent reclustering step 3 . To circumvent this problem, we initialize our bounding box to a sub-image within the image that ignores the image boundaries. Using this initialization also avoids the two-stage training procedure used in [36] , where in the first stage latent root placements are identified and cropped, for training the DPM in the second stage.

Similar to [18] , [36] also initialized their components using the aspect-ratio heuristic. This is sub-optimal in the weakly supervised setting as image aspect-ratio is a poor heuristic for clustering object instances. To address this limitation, we initialize the model using feature space clustering as proposed in [13] . While our ngram vocabulary helps segregate the major appearance variations of a concept, the downloaded images per superngram still have some remaining appearance variations. For example, the 'jumping horse' ngram has images of horses jumping in different orientations. To deal with such appearance variations, we use a mixture of components initialized with feature space clustering. In the presence of noisy web images, this procedure provides a robust initialization. Some of the mixture components act as noise sinks, thereby allowing cleaner models to be learned [51] . In our experiments, we typically found 70% of the components per ngram to act as noise sinks. It is wasteful to train a full parts-based model for such noisy components. Therefore, we first train root filters for each component and subsequently prune the noisy ones. Pruning noisy components: To prune noisy components, we run each component detector on its own validation set and evaluate its performance. Given that the positive instances within the validation set for each ngram neither have the ground-truth bounding boxes nor the component labels, we treat this task as a latent image classification problem. Specifically, we first run the ngram mixture-of-components detector on its full validation set (held-out positive images as well as a random pool of background images). We then record the top detection for each image and use the component label of that detection to segregate the images. We now have a segregated pool of validation images per ngram component. In the absence of ground-truth boxes, we assume the top detections of positive images are true and negative images are false, and therefore compute the average precision (A.P.) by only using the detection scores (ignoring overlap). We declare a component to be noisy either if its A.P. is below a threshold (10%) or if its training or validation data has too few (< 5) positive instances. The second condition helps us discard exemplar components that overfit to incidental images. While a root-only filter model is relatively weak compared to the parts model, we found that it does an effective job here for pruning noisy components. Merging pruned components: Some of the components across the different ngram detectors end up learning the same visual concept. For example, a subset of 'hunter horse' instances are quite similar to a subset of 'jumping horse' instances. The merging step in section 3.2 considered a monolithic classifier trained with full image features.

value by preferring a larger box that includes the image boundary during the reclustering step.

As the mixture of component models are more refined (by way of localizing instances using detection windows), they can identify subtle similarities that cannot be found at the full image level. To pick a representative subset of the components and merge similar ones, we follow a similar procedure as outlined in section 3.2. Specifically, we represent the space of all ngram components by a graph G = {V, E}, where each node represents a component and each edge represents the visual similarity between them. The score d i for each node now corresponds to the quality of the component. We set it to the A.P. of the component (computed during the above pruning step). The weight on each edge e i,j is defined similarly as the median rank obtained by running the jth component detector on the ith component validation set. (We continue to use only the top detection score per image, assuming top detections on positives are true and on negatives are false.) We solve for the same objective function as outlined in equation 1to select the representative subset. We found this subset selection step results in roughly 50% fewer components. The final number of components averages to about 250 per concept. Figure 3 shows some of our discovered components.