Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Learning Object Detection from Captions via Textual Scene Attributes



Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets, outperforming recent approaches.


Object detection is one of the key tasks in computer vision. It requires detecting the bounding boxes of objects in a given image and identifying the category of each one. While object detection models have many real-life applications, they often also serve as a component in higher-level machinevision systems such as Image Captioning, Visual Question Answering (Anderson et al. 2018) , Grounding of Referring Expressions (Hu et al. 2017) , Scene Graph Generation (Xu et al. 2017; Zellers et al. 2017) , Densely Packed Scenes Detection (Goldman et al. 2019) ,Video Understanding (Zhou et al. 2019; Herzig et al. 2019; Materzynska et al. 2020) , and many more.

The simplest way to train an object detection system is via supervised learning on a dataset that contains images along with annotated bounding boxes for objects and their correct visual categories. However, collecting such data is time consuming and costly, thus limiting the size of the resulting datasets. An alternative is to use weaker forms of supervision. The most common instance of this approach is the problem of Weakly Supervised Object Detection (WSOD), where images are only annotated with the set of object labels that appear in them, but without annotated bounding boxes. Such datasets are of course easier to collect (e.g., from images collected from social media, paired with their userprovided hashtags (Mahajan et al. 2018)) , and thus much re- Figure 1 : An illustration of our novel scene graph refinement process. The model makes use of the "black" attribute to localize the "laptop" object in the image at train time. This will result in improved object detection accuracy at test time. search has been devoted for designing methods that learn detection models in this setting. However, WSOD remains an open problem, as indicated by the large performance gap of >50% between WSOD (Singh, Najibi, and Davis 2018) and fully supervised detection approaches (Ren et al. 2020) on the PASCAL VOC detection benchmark (Everingham et al. 2010 ).

Figure 1: An illustration of our novel scene graph refinement process. The model makes use of the “black” attribute to localize the “laptop” object in the image at train time. This will result in improved object detection accuracy at test time.

An alternative, and potentially rich, source of weak supervision is image captions. Namely, textual descriptions of images, that are fairly easy to collect from the web. The potential of captions for learning detectors was recently highlighted in Ye et al. (2019) , where they improve the extraction of object pseudo-labels from the captions.

In this work, we argue that captions contain much richer information than has been used so far, because a caption tells us more about an image than just the objects it contains. It can reveal the attributes of the objects in the image (e.g., a blue hat) and their relations (e.g., dog on chair). In the machine vision literature, such a description of objects, their attributes and relations is known as a scene graph (SG) (Johnson et al. 2015b) , and these have become central to many machine vision tasks (Johnson et al. 2015a,b; Xu et al. 2017; Liao et al. 2016; Zellers et al. 2017; Herzig et al. 2018) . This suggests that captions can be used to extract part of the scene graph of the image they accompany.

Knowing the scene graph of an image provides valuable information that can used as weak supervision. To understand why, consider an image with two fruits that are hard to identify, and the caption "a red apple next to a pear". Since it is relatively easy to recognize a red object, we can use this knowledge to identify that the red fruit should have the "apple" label. An illustration of such possible usage of visual attributes in the object classification process from the COCO Captions dataset (Chen et al. 2015 ) is shown in Figure 1 . We propose a learning procedure that captures this intuition by extracting "textual scene graphs" from captions, and using them within a weak supervision framework to learn object detectors. Our approach uses a novel notion of an entanglement loss that weakly constrains visual objects to have certain visual attributes corresponding to those describing them in the caption. Empirical evaluation shows that our model achieves significantly superior results over baseline models across multiple datasets and benchmarks.

Our contributions are thus: (1) We introduce a novel approach that aligns the structured representation of captions and images. 2We propose a novel architecture with an entanglement loss that uses textual SGs to enforce constraints on the visual SG prediction. 3We demonstrate our approach and architecture on the challenging MS COCO and Pascal VOC 2007 object detection datasets, leading to state of the art results and outperforming prior methods by a significant gap.

Related Work

Weakly Supervised Object Detection. WSOD is a specific task out of a broader class of problems, named Multiple Instance Learning (MIL). In MIL problems, instead of receiving a set of individually labeled instances, the learner receives a set of labeled bags, each containing many instances (Dietterich, Lathrop, and Lozano-Pérez 1997) . MIL is a valuable formalization for problems with a high input complexity and weak forms of supervision, such as medical images processing (Quellec et al. 2017) , action recognition in videos (Ali and Shah 2008) and sound event detection (Kumar and Raj 2016) . The motivation behind the MIL formalization stems from the fact that the correct label for the input bag can be predicted from a single instance. For example, in the WSOD task, an object label can be inferred from the specific image patch in which the object appears. This is referred to as the standard multiple instance (SMI) assumption (Amores 2013), i.e., every positive bag contains at least one positive instance, while in every negative bag all of the instances are negative.

Recent works as Oquab et al. (2015) ; Zhou et al. (2016) use the MIL formalization and propose a new Global Max (or Average) Pooling layer to learn the activation maps for object classes. Moreover, Bilen and Vedaldi (2016) introduced Weakly-Supervised Deep Detection Networks (WS-DDN) that use two different data streams for detection and classification, while the detection stream is used to weigh the classification predictions.

The works by Akbari et al. (2019) ; Gupta et al. (2020) aim to tackle the problem of Weakly Supervised Phrase Grounding, which also requires finding an alignment between image regions and caption words. However, as their task objective is to find relevant image regions rather than to train an object detector, the image captions are given as inputs also at test time, and the task does not aim to correctly identify all of the existing objects in an image but only those that are present in its caption. Moreover, their task setting assumes the existence of a pretrained Faster-RCNN object detector for the region proposal extraction, which is not allowed in our setting. Lately, the novel task of learning object detection directly from image captions was introduced by Ye et al. (2019) . Their work addresses the same task as ours, but as they focus on achieving better object pseudo-labels from the image captions, we show that using the captions data more efficiently and providing the model with better image understanding abilities is a more important direction. Furthermore, Ye et al. (2019) use additional supervision in the form of pairs which is costly to collect, and here we show that our use of captions obviates the need for this additional supervision.

Models for Images and Text. Even though the problem of modeling the relationship between images and text has attracted many research efforts throughout the years, the task of training an object detector from image captions is relatively novel. Recently, there has been a surge of works trying to build a unified general-purpose framework to tackle tasks that involve both visual and textual inputs (Lu et al. 2019; Su et al. 2019; Tan and Bansal 2019; Chen et al. 2019) . These works take large-scale parallel visual and textual data, pass it through an attention-based model to get contextualized representations, and apply a reconstruction loss to predict masked out data tokens (self-supervision). The resulting models were proven to achieve state-of-the-art results for various visual-linguistic tasks via transfer learning. This is related to recent advances in self-supervision in natural language processing, which train a transferable general-purpose language model with a similar masking objective (Devlin et al. 2018) . However, while these works are useful for scenarios that do not require prediction of the alignment between the visual and textual data, our objective is to explicitly classify the objects in the input image. Moreover, as we aim to train an object detector which naturally does not receive any textual input, we cannot use a model that requires both visual and textual inputs.

Scene Graphs. The machine vision community has been using scene graphs of images for representation of information in visual scenes. SGs are used in various computer vision tasks including image retrieval (Johnson et al. 2015b; Schuster et al. 2015) , relationship modeling (Krishna et al. 2018; Raboh et al. 2020; Schroeder, Tripathi, and Tang 2019) , image captioning (Xu et al. 2019) , image generation (Herzig et al. 2020; Johnson, Gupta, and Fei-Fei 2018) , and recently even video generation (Bar et al. 2020) .

Textual scene graph prediction is the problem of predicting SG representations from image captions. While the problem of generating semantically meaningful directed graphs from a sentence is well known in NLP research as Dependency Parsing, in the SG context the only three meaningful word classifications are objects, attributes and relations. Figure 2 : An illustration of our Textual Label Extraction (TLE) module. Given an image and its caption, we apply exact string matching to detect objects, and use a text-to-scene-graph model to generate a scene graph from the captions and aggregate object and attribute pairs.

Figure 2: An illustration of our Textual Label Extraction (TLE) module. Given an image and its caption, we apply exact string matching to detect objects, and use a text-to-scene-graph model to generate a scene graph from the captions and aggregate object and attribute pairs.

Early approaches to this problem use a dependency parser as a basis for the SG prediction (Schuster et al. 2015; Wang et al. 2018) . Recently, Andrews, Chia, and Witteveen 2019proposed to train a transformer model on a parallel dataset of image region descriptions and scene graphs, taken from the Visual Genome dataset (Krishna et al. 2017) .

The Sg2Det Model

We next describe our approach for using image captions to learn an object detector. We emphasize that image captions are our only source of supervision, and no manually annotated bounding boxes or even ground-truth image-level categories are used. We refer to our model as SG2Det since it uses textual scene graphs for learning object detectors.

Image captions can provide rich and informative data about visual scenes. Namely, in addition to providing the categories of objects in the image, the captions can suggest the relations between the objects, their positions within the image and even their visual attributes. In this work, we claim that by aligning the scene graph structure as extracted from the captions to the different image regions, we can provide the model with an improved understanding of the visual scenes and obtain superior object detection results.

The key element of our approach is an "entanglement loss" that aligns the visual attributes in the image with those described in the text. As an example, consider the image caption "a red stop sign is glowing against the dark sky". Instead of extracting only the "stop sign" object pseudo-label and discarding the rest of the caption information, we propose to use the "red" attribute to enhance our supervision. Namely, instead of training the model to find the image region that is the most probable to be a "stop sign", our training objective is now to find the one that is the most probable to be both a "stop sign" and "red". Technically, this is achieved by multiplying object and attribute probabilities, as described later in the The Attribute Entanglement Loss subsection.

Our proposed SG2Det model is, therefore, composed of the following three components:

1. The Textual Label Extraction (TLE) module, which extracts the object pseudo-labels and the scene graph information from the text captions.

Textual Label Extraction

In this module, we extract the object pseudo-labels and the scene graph data from the image captions. Figure 2 provides a high-level illustration of this module. The object category labels can be extracted in several different ways (e.g. string matching, synonym dictionary, a trained text classifier, etc.) as explored by Ye et al. 2019, but we choose simple string matching to highlight the usefulness of our novel loss. In addition to the object category labels, we also extract a textual scene graph representation for each of the captions, and use the pairs aggregated from them within the entanglement loss described later. For SG extraction, we use an off-the-shelf textual scene graph parser 1 based on Schuster et al. (2015) . We also experimented with scene graphs extracted by Wang et al. (2018) , but found these to achieve inferior results. We choose to split the object attributes to categories (e.g., color, shape, material, etc.) based on a categorization taken from the GQA dataset (Hudson and Manning 2019 Figure 3 : An overview of our attribute entanglement loss. First, region proposals (bounding boxes) are extracted from the image, and convolutional features are calculated for each. The image captions are used to extract a list of object and attribute pairs. The scores s i,c , s i,a,v are calculated for the object categories and attributes, and they are used within the entanglement loss that captures the product scores of the object-attribute pairs. dataset contains visual questions that are automatically generated using scene graphs. By looking at the semantic representation (logical form) of the questions, we can derive the 22 different attribute categories that were used by the authors for dataset generation. Some of these attribute categories are general (e.g. color, size), while others are specific to certain object classes (e.g. shape, pose, sportActivity).

Figure 3: An overview of our attribute entanglement loss. First, region proposals (bounding boxes) are extracted from the image, and convolutional features are calculated for each. The image captions are used to extract a list of object and attribute pairs. The scores si,c, si,a,v are calculated for the object categories and attributes, and they are used within the entanglement loss that captures the product scores of the object-attribute pairs.

The advantage of this categorization is that it provides the model with additional knowledge. Unlike object labels, each region proposal can have more than one attribute. For example, a cat can be both large and brown. Thus, we cannot model the attributes prediction as a multi-class problem. By using categories, we enforce mutual exclusivity within each category, preventing the model, for example, from predicting that an object is both black and white; while allowing multiple labels across different categories.

The output of this stage for each image is as follows:

• A set O of object categories. For example O = {cat, dog} indicates that the text describes a cat and a dog.

• For each o ∈ O we have a set A o containing attribute-value pairs (a, v) ∈ A o . Thus, A cat = {(color, brown), (size, large)} indicates that the cat is brown and large.

Visual Scores Extraction

We now consider the input image and extract bounding boxes from it. Then, fully connected (FC) classifiers are applied to these bounding boxes, resulting in model scores for object categories and attributes. Below we elaborate on this process.

First, we generate object region proposals r 1 , . . . , r m using the Selective Search algorithm (Uijlings et al. 2013) , and compute a convolutional feature map for each input image by feeding it to a convolutional network pre-trained on ImageNet (Deng et al. 2009) . Importantly, the convolutional backbone is not trained on an object detection dataset since our objective is to learn detection from the captions only. Then, we apply a ROIAlign layer (He et al. 2017) for cropping the proposals to fixed-sized convolutional feature maps. Finally, a box feature extractor is applied to extract a fixedlength descriptor φ(r i ) for each proposal r i .

Denote by C the number of different object classes and by m the number of different regions. We now apply an FC classifier followed by a softmax layer to achieve an object score s i,c for each i ∈ {1, . . . , m} and c ∈ {1, . . . , C + 1}, where C + 1 is the background class. Similarly, for each region i, attribute type a and attribute value v, we obtain a score s i,a,v via an FC prediction head for attribute a.

The output of this stage is as follows:

• A score s i,c for each bounding box (region) i and category value c.

• A score s i,a,v for each bounding box (region) i, attribute type a and attribute value v. E.g., a can be "shape" and v can be "rectangular".

The Attribute Entanglement Loss

The core element of our approach is a loss that enforces agreement between the textual and visual representations of the image. To achieve this, we adapt the MIL approach to capture also the attributes information. We begin by describing the standard loss used in MIL for the case where only object categories information is available in both the textual and visual descriptions. Namely, assume that from the text side we only have the set of object categories O, and from the image side we only have the model scores for all categories and bounding boxes s i,c . The intuition in this case is that each category in O should have at least one bounding box describing it. Thus, if cat ∈ O then s i,cat should be high for some i. This motivates the use of the following loss:

EQUATION (1): Not extracted; please refer to original document.

However, in our case we wish to go beyond object information and use attributes. Namely, if we have cat ∈ O and (color, brown) ∈ A cat , we would like some bounding box to both contain a cat and have the attribute brown. Namely, there should be a box i where both s i,cat and s i,color,brown are high. The use of and here is important, since a violation of either these conditions would imply this box does not contain the brown cat. The following entanglement loss precisely captures the intuition that attributes and categories should be dependent:

L entang = − 1 |O| c∈O,(a,v)∈Ac max i {log (s i,c • s i,a,v )} (2)

We note that although the log in the above can be written as log (s i,c ) + log (s i,a,v ), this objective is very different from using two objectives like (1), one for attributes and one for objects. This is because the maximum over i is applied to the log, and thus if one of s i,c or s i,a,v is very low, the bounding box i will not be the maximizer. We conclude by emphasizing that the whole training procedure operates without any explicit supervision of bounding boxes. Despite this, we shall see that the model succeeds in learning both detection (i.e., finding the right bounding boxes) as well as classification for both object categories and attributes.

Other Losses

In addition to the losses L entang and L obj , we use a loss that promotes high scores for categories in O and low scores for categories outside O. This is referred to by Tang et al. (2017) as the Multiple Instance Detection (MID) loss. We now pass our region descriptors through two parallel FC layers to get two different (m × C)-dimensional matrices. Then, we pass one of the matrices through a sigmoid function and the other through a softmax layer over the different regions, and multiply them element-wise to get a score for each pair. We denote these scores by s mid i,c for each i ∈ {1, . . . , m} and c ∈ {1, . . . , C}. Note that we do not add a "background" class here since the MID loss cannot propagate gradients for this class. Next, we aggregate the scores s mid i,c from all different boxes to a single image-level soft-binary scoreŷ c ∈ [0, 1] as follows:

EQUATION (3): Not extracted; please refer to original document.

where σ is the sigmoid function. Now, we consider the binary cross entropy betweenŷ c and the indicator corresponding to O. This results in the following loss term:

L mid = − C c=1 I[c ∈ O] logŷ c + I[c / ∈ O] log (1 −ŷ c ) (4)

Our overall loss is thus:

L total = L mid + λ 1 L obj + λ 2 L entang (5)

Online Refinement

So far we assumed that for each category c ∈ O only one bounding box contains the object pseudo-label (namely, the one that maximizes s i,c ). In practice, there could be other bounding boxes that highly overlap with the maximizing one, and should therefore be included in the loss. This intuition was used by the Online Instance Classifier Refinement (OICR) method introduced in Tang et al. (2017) , and also in Ye et al. (2019) . In order to provide a fair comparison to Ye et al. (2019) we also use OICR here. The OICR method uses K different score functions

s k i,c

for k ∈ {1, . . . , K} and L k corresponding losses. The loss L k is similar to Eq. (1) but with two differences: it uses all boxes that sufficiently overlap the maximizing box, and the max operator is applied to the scores from the previous OICR step. This provides a refinement process where each FC classifier uses the output of its predecessor as an instance-level supervision. The first score function s 0 i,c is obtained by applying a softmax function over the s mid i,c scores from the MID step. Here, we extend OICR to also use attribute scores s k i,a,v and similarly extend the loss in Eq. (2). Note that we do not apply an MID loss to the attributes as we do for objects (i.e., aggregating attribute scores from all regions, and comparing to the attribute pseudo-labels extracted from the captions) since we found this to harm the performance of our model. We hypothesize that this is because the assumption that a label is present in the caption if and only if it is present in the image is improbable for attributes, as image captions data is often sparse with attribute annotations. Thus, we do not have initial MID scores for the attributes as we do for the objects. Because of that, at the first OICR iteration we only train the attributes classifiers using the objects MID scores, and we do not apply the entanglement loss. E.g., if the image contains a "brown cat", we use the box that maximizes the "cat" probability as a supervision for the color classifier with the label "brown".


In this section, we show both qualitative and quantitative result for our SG2Det model. We show that compared to Figure 4 : Qualitative examples for our objects and attributes predictions. It can be seen that our attribute classifiers provides meaningful classification results across different categories, which validates the success of our attribute classification objective.