Learning Object Detection from Captions via Textual Scene Attributes

Achiya Jerbi
Roei Herzig
Jonathan Berant
Gal Chechik
A. Globerson
ArXiv
2020
View in Semantic Scholar

Abstract

Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this "textual scene graph" to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets, outperforming recent approaches.

Introduction

Object detection is one of the key tasks in computer vision. It requires detecting the bounding boxes of objects in a given image and identifying the category of each one. While object detection models have many real-life applications, they often also serve as a component in higher-level machinevision systems such as Image Captioning, Visual Question Answering (Anderson et al. 2018) , Grounding of Referring Expressions (Hu et al. 2017) , Scene Graph Generation (Xu et al. 2017; Zellers et al. 2017) , Densely Packed Scenes Detection (Goldman et al. 2019) ,Video Understanding (Zhou et al. 2019; Herzig et al. 2019; Materzynska et al. 2020) , and many more.

The simplest way to train an object detection system is via supervised learning on a dataset that contains images along with annotated bounding boxes for objects and their correct visual categories. However, collecting such data is time consuming and costly, thus limiting the size of the resulting datasets. An alternative is to use weaker forms of supervision. The most common instance of this approach is the problem of Weakly Supervised Object Detection (WSOD), where images are only annotated with the set of object labels that appear in them, but without annotated bounding boxes. Such datasets are of course easier to collect (e.g., from images collected from social media, paired with their userprovided hashtags (Mahajan et al. 2018)) , and thus much re- Figure 1 : An illustration of our novel scene graph refinement process. The model makes use of the "black" attribute to localize the "laptop" object in the image at train time. This will result in improved object detection accuracy at test time. search has been devoted for designing methods that learn detection models in this setting. However, WSOD remains an open problem, as indicated by the large performance gap of >50% between WSOD (Singh, Najibi, and Davis 2018) and fully supervised detection approaches (Ren et al. 2020) on the PASCAL VOC detection benchmark (Everingham et al. 2010 ).

Figure 1: An illustration of our novel scene graph refinement process. The model makes use of the “black” attribute to localize the “laptop” object in the image at train time. This will result in improved object detection accuracy at test time.

An alternative, and potentially rich, source of weak supervision is image captions. Namely, textual descriptions of images, that are fairly easy to collect from the web. The potential of captions for learning detectors was recently highlighted in Ye et al. (2019) , where they improve the extraction of object pseudo-labels from the captions.

In this work, we argue that captions contain much richer information than has been used so far, because a caption tells us more about an image than just the objects it contains. It can reveal the attributes of the objects in the image (e.g., a blue hat) and their relations (e.g., dog on chair). In the machine vision literature, such a description of objects, their attributes and relations is known as a scene graph (SG) (Johnson et al. 2015b) , and these have become central to many machine vision tasks (Johnson et al. 2015a,b; Xu et al. 2017; Liao et al. 2016; Zellers et al. 2017; Herzig et al. 2018) . This suggests that captions can be used to extract part of the scene graph of the image they accompany.

Knowing the scene graph of an image provides valuable information that can used as weak supervision. To understand why, consider an image with two fruits that are hard to identify, and the caption "a red apple next to a pear". Since it is relatively easy to recognize a red object, we can use this knowledge to identify that the red fruit should have the "apple" label. An illustration of such possible usage of visual attributes in the object classification process from the COCO Captions dataset (Chen et al. 2015 ) is shown in Figure 1 . We propose a learning procedure that captures this intuition by extracting "textual scene graphs" from captions, and using them within a weak supervision framework to learn object detectors. Our approach uses a novel notion of an entanglement loss that weakly constrains visual objects to have certain visual attributes corresponding to those describing them in the caption. Empirical evaluation shows that our model achieves significantly superior results over baseline models across multiple datasets and benchmarks.

Our contributions are thus: (1) We introduce a novel approach that aligns the structured representation of captions and images. 2We propose a novel architecture with an entanglement loss that uses textual SGs to enforce constraints on the visual SG prediction. 3We demonstrate our approach and architecture on the challenging MS COCO and Pascal VOC 2007 object detection datasets, leading to state of the art results and outperforming prior methods by a significant gap.

Related Work

Weakly Supervised Object Detection. WSOD is a specific task out of a broader class of problems, named Multiple Instance Learning (MIL). In MIL problems, instead of receiving a set of individually labeled instances, the learner receives a set of labeled bags, each containing many instances (Dietterich, Lathrop, and Lozano-Pérez 1997) . MIL is a valuable formalization for problems with a high input complexity and weak forms of supervision, such as medical images processing (Quellec et al. 2017) , action recognition in videos (Ali and Shah 2008) and sound event detection (Kumar and Raj 2016) . The motivation behind the MIL formalization stems from the fact that the correct label for the input bag can be predicted from a single instance. For example, in the WSOD task, an object label can be inferred from the specific image patch in which the object appears. This is referred to as the standard multiple instance (SMI) assumption (Amores 2013), i.e., every positive bag contains at least one positive instance, while in every negative bag all of the instances are negative.

Recent works as Oquab et al. (2015) ; Zhou et al. (2016) use the MIL formalization and propose a new Global Max (or Average) Pooling layer to learn the activation maps for object classes. Moreover, Bilen and Vedaldi (2016) introduced Weakly-Supervised Deep Detection Networks (WS-DDN) that use two different data streams for detection and classification, while the detection stream is used to weigh the classification predictions.

The works by Akbari et al. (2019) ; Gupta et al. (2020) aim to tackle the problem of Weakly Supervised Phrase Grounding, which also requires finding an alignment between image regions and caption words. However, as their task objective is to find relevant image regions rather than to train an object detector, the image captions are given as inputs also at test time, and the task does not aim to correctly identify all of the existing objects in an image but only those that are present in its caption. Moreover, their task setting assumes the existence of a pretrained Faster-RCNN object detector for the region proposal extraction, which is not allowed in our setting. Lately, the novel task of learning object detection directly from image captions was introduced by Ye et al. (2019) . Their work addresses the same task as ours, but as they focus on achieving better object pseudo-labels from the image captions, we show that using the captions data more efficiently and providing the model with better image understanding abilities is a more important direction. Furthermore, Ye et al. (2019) use additional supervision in the form of pairs which is costly to collect, and here we show that our use of captions obviates the need for this additional supervision.

Models for Images and Text. Even though the problem of modeling the relationship between images and text has attracted many research efforts throughout the years, the task of training an object detector from image captions is relatively novel. Recently, there has been a surge of works trying to build a unified general-purpose framework to tackle tasks that involve both visual and textual inputs (Lu et al. 2019; Su et al. 2019; Tan and Bansal 2019; Chen et al. 2019) . These works take large-scale parallel visual and textual data, pass it through an attention-based model to get contextualized representations, and apply a reconstruction loss to predict masked out data tokens (self-supervision). The resulting models were proven to achieve state-of-the-art results for various visual-linguistic tasks via transfer learning. This is related to recent advances in self-supervision in natural language processing, which train a transferable general-purpose language model with a similar masking objective (Devlin et al. 2018) . However, while these works are useful for scenarios that do not require prediction of the alignment between the visual and textual data, our objective is to explicitly classify the objects in the input image. Moreover, as we aim to train an object detector which naturally does not receive any textual input, we cannot use a model that requires both visual and textual inputs.

Scene Graphs. The machine vision community has been using scene graphs of images for representation of information in visual scenes. SGs are used in various computer vision tasks including image retrieval (Johnson et al. 2015b; Schuster et al. 2015) , relationship modeling (Krishna et al. 2018; Raboh et al. 2020; Schroeder, Tripathi, and Tang 2019) , image captioning (Xu et al. 2019) , image generation (Herzig et al. 2020; Johnson, Gupta, and Fei-Fei 2018) , and recently even video generation (Bar et al. 2020) .

Textual scene graph prediction is the problem of predicting SG representations from image captions. While the problem of generating semantically meaningful directed graphs from a sentence is well known in NLP research as Dependency Parsing, in the SG context the only three meaningful word classifications are objects, attributes and relations. Figure 2 : An illustration of our Textual Label Extraction (TLE) module. Given an image and its caption, we apply exact string matching to detect objects, and use a text-to-scene-graph model to generate a scene graph from the captions and aggregate object and attribute pairs.

Figure 2: An illustration of our Textual Label Extraction (TLE) module. Given an image and its caption, we apply exact string matching to detect objects, and use a text-to-scene-graph model to generate a scene graph from the captions and aggregate object and attribute pairs.

Early approaches to this problem use a dependency parser as a basis for the SG prediction (Schuster et al. 2015; Wang et al. 2018) . Recently, Andrews, Chia, and Witteveen 2019proposed to train a transformer model on a parallel dataset of image region descriptions and scene graphs, taken from the Visual Genome dataset (Krishna et al. 2017) .

The Sg2Det Model

We next describe our approach for using image captions to learn an object detector. We emphasize that image captions are our only source of supervision, and no manually annotated bounding boxes or even ground-truth image-level categories are used. We refer to our model as SG2Det since it uses textual scene graphs for learning object detectors.

Image captions can provide rich and informative data about visual scenes. Namely, in addition to providing the categories of objects in the image, the captions can suggest the relations between the objects, their positions within the image and even their visual attributes. In this work, we claim that by aligning the scene graph structure as extracted from the captions to the different image regions, we can provide the model with an improved understanding of the visual scenes and obtain superior object detection results.

The key element of our approach is an "entanglement loss" that aligns the visual attributes in the image with those described in the text. As an example, consider the image caption "a red stop sign is glowing against the dark sky". Instead of extracting only the "stop sign" object pseudo-label and discarding the rest of the caption information, we propose to use the "red" attribute to enhance our supervision. Namely, instead of training the model to find the image region that is the most probable to be a "stop sign", our training objective is now to find the one that is the most probable to be both a "stop sign" and "red". Technically, this is achieved by multiplying object and attribute probabilities, as described later in the The Attribute Entanglement Loss subsection.

Our proposed SG2Det model is, therefore, composed of the following three components:

1. The Textual Label Extraction (TLE) module, which extracts the object pseudo-labels and the scene graph information from the text captions.

Textual Label Extraction

In this module, we extract the object pseudo-labels and the scene graph data from the image captions. Figure 2 provides a high-level illustration of this module. The object category labels can be extracted in several different ways (e.g. string matching, synonym dictionary, a trained text classifier, etc.) as explored by Ye et al. 2019, but we choose simple string matching to highlight the usefulness of our novel loss. In addition to the object category labels, we also extract a textual scene graph representation for each of the captions, and use the pairs aggregated from them within the entanglement loss described later. For SG extraction, we use an off-the-shelf textual scene graph parser 1 based on Schuster et al. (2015) . We also experimented with scene graphs extracted by Wang et al. (2018) , but found these to achieve inferior results. We choose to split the object attributes to categories (e.g., color, shape, material, etc.) based on a categorization taken from the GQA dataset (Hudson and Manning 2019 Figure 3 : An overview of our attribute entanglement loss. First, region proposals (bounding boxes) are extracted from the image, and convolutional features are calculated for each. The image captions are used to extract a list of object and attribute pairs. The scores s i,c , s i,a,v are calculated for the object categories and attributes, and they are used within the entanglement loss that captures the product scores of the object-attribute pairs. dataset contains visual questions that are automatically generated using scene graphs. By looking at the semantic representation (logical form) of the questions, we can derive the 22 different attribute categories that were used by the authors for dataset generation. Some of these attribute categories are general (e.g. color, size), while others are specific to certain object classes (e.g. shape, pose, sportActivity).

Figure 3: An overview of our attribute entanglement loss. First, region proposals (bounding boxes) are extracted from the image, and convolutional features are calculated for each. The image captions are used to extract a list of object and attribute pairs. The scores si,c, si,a,v are calculated for the object categories and attributes, and they are used within the entanglement loss that captures the product scores of the object-attribute pairs.

The advantage of this categorization is that it provides the model with additional knowledge. Unlike object labels, each region proposal can have more than one attribute. For example, a cat can be both large and brown. Thus, we cannot model the attributes prediction as a multi-class problem. By using categories, we enforce mutual exclusivity within each category, preventing the model, for example, from predicting that an object is both black and white; while allowing multiple labels across different categories.

The output of this stage for each image is as follows:

• A set O of object categories. For example O = {cat, dog} indicates that the text describes a cat and a dog.

• For each o ∈ O we have a set A o containing attribute-value pairs (a, v) ∈ A o . Thus, A cat = {(color, brown), (size, large)} indicates that the cat is brown and large.

Visual Scores Extraction

We now consider the input image and extract bounding boxes from it. Then, fully connected (FC) classifiers are applied to these bounding boxes, resulting in model scores for object categories and attributes. Below we elaborate on this process.

First, we generate object region proposals r 1 , . . . , r m using the Selective Search algorithm (Uijlings et al. 2013) , and compute a convolutional feature map for each input image by feeding it to a convolutional network pre-trained on ImageNet (Deng et al. 2009) . Importantly, the convolutional backbone is not trained on an object detection dataset since our objective is to learn detection from the captions only. Then, we apply a ROIAlign layer (He et al. 2017) for cropping the proposals to fixed-sized convolutional feature maps. Finally, a box feature extractor is applied to extract a fixedlength descriptor φ(r i ) for each proposal r i .

Denote by C the number of different object classes and by m the number of different regions. We now apply an FC classifier followed by a softmax layer to achieve an object score s i,c for each i ∈ {1, . . . , m} and c ∈ {1, . . . , C + 1}, where C + 1 is the background class. Similarly, for each region i, attribute type a and attribute value v, we obtain a score s i,a,v via an FC prediction head for attribute a.

The output of this stage is as follows:

• A score s i,c for each bounding box (region) i and category value c.

• A score s i,a,v for each bounding box (region) i, attribute type a and attribute value v. E.g., a can be "shape" and v can be "rectangular".

The Attribute Entanglement Loss

The core element of our approach is a loss that enforces agreement between the textual and visual representations of the image. To achieve this, we adapt the MIL approach to capture also the attributes information. We begin by describing the standard loss used in MIL for the case where only object categories information is available in both the textual and visual descriptions. Namely, assume that from the text side we only have the set of object categories O, and from the image side we only have the model scores for all categories and bounding boxes s i,c . The intuition in this case is that each category in O should have at least one bounding box describing it. Thus, if cat ∈ O then s i,cat should be high for some i. This motivates the use of the following loss:

EQUATION (1): Not extracted; please refer to original document.

However, in our case we wish to go beyond object information and use attributes. Namely, if we have cat ∈ O and (color, brown) ∈ A cat , we would like some bounding box to both contain a cat and have the attribute brown. Namely, there should be a box i where both s i,cat and s i,color,brown are high. The use of and here is important, since a violation of either these conditions would imply this box does not contain the brown cat. The following entanglement loss precisely captures the intuition that attributes and categories should be dependent:

L entang = − 1 |O| c∈O,(a,v)∈Ac max i {log (s i,c • s i,a,v )} (2)

We note that although the log in the above can be written as log (s i,c ) + log (s i,a,v ), this objective is very different from using two objectives like (1), one for attributes and one for objects. This is because the maximum over i is applied to the log, and thus if one of s i,c or s i,a,v is very low, the bounding box i will not be the maximizer. We conclude by emphasizing that the whole training procedure operates without any explicit supervision of bounding boxes. Despite this, we shall see that the model succeeds in learning both detection (i.e., finding the right bounding boxes) as well as classification for both object categories and attributes.

Other Losses

In addition to the losses L entang and L obj , we use a loss that promotes high scores for categories in O and low scores for categories outside O. This is referred to by Tang et al. (2017) as the Multiple Instance Detection (MID) loss. We now pass our region descriptors through two parallel FC layers to get two different (m × C)-dimensional matrices. Then, we pass one of the matrices through a sigmoid function and the other through a softmax layer over the different regions, and multiply them element-wise to get a score for each pair. We denote these scores by s mid i,c for each i ∈ {1, . . . , m} and c ∈ {1, . . . , C}. Note that we do not add a "background" class here since the MID loss cannot propagate gradients for this class. Next, we aggregate the scores s mid i,c from all different boxes to a single image-level soft-binary scoreŷ c ∈ [0, 1] as follows:

EQUATION (3): Not extracted; please refer to original document.

where σ is the sigmoid function. Now, we consider the binary cross entropy betweenŷ c and the indicator corresponding to O. This results in the following loss term:

L mid = − C c=1 I[c ∈ O] logŷ c + I[c / ∈ O] log (1 −ŷ c ) (4)

Our overall loss is thus:

L total = L mid + λ 1 L obj + λ 2 L entang (5)

Online Refinement

So far we assumed that for each category c ∈ O only one bounding box contains the object pseudo-label (namely, the one that maximizes s i,c ). In practice, there could be other bounding boxes that highly overlap with the maximizing one, and should therefore be included in the loss. This intuition was used by the Online Instance Classifier Refinement (OICR) method introduced in Tang et al. (2017) , and also in Ye et al. (2019) . In order to provide a fair comparison to Ye et al. (2019) we also use OICR here. The OICR method uses K different score functions

s k i,c

for k ∈ {1, . . . , K} and L k corresponding losses. The loss L k is similar to Eq. (1) but with two differences: it uses all boxes that sufficiently overlap the maximizing box, and the max operator is applied to the scores from the previous OICR step. This provides a refinement process where each FC classifier uses the output of its predecessor as an instance-level supervision. The first score function s 0 i,c is obtained by applying a softmax function over the s mid i,c scores from the MID step. Here, we extend OICR to also use attribute scores s k i,a,v and similarly extend the loss in Eq. (2). Note that we do not apply an MID loss to the attributes as we do for objects (i.e., aggregating attribute scores from all regions, and comparing to the attribute pseudo-labels extracted from the captions) since we found this to harm the performance of our model. We hypothesize that this is because the assumption that a label is present in the caption if and only if it is present in the image is improbable for attributes, as image captions data is often sparse with attribute annotations. Thus, we do not have initial MID scores for the attributes as we do for the objects. Because of that, at the first OICR iteration we only train the attributes classifiers using the objects MID scores, and we do not apply the entanglement loss. E.g., if the image contains a "brown cat", we use the box that maximizes the "cat" probability as a supervision for the color classifier with the label "brown".

Experiments

In this section, we show both qualitative and quantitative result for our SG2Det model. We show that compared to Figure 4 : Qualitative examples for our objects and attributes predictions. It can be seen that our attribute classifiers provides meaningful classification results across different categories, which validates the success of our attribute classification objective.

Figure 4: Qualitative examples for our objects and attributes predictions. It can be seen that our attribute classifiers provides meaningful classification results across different categories, which validates the success of our attribute classification objective.

prior work, the additional scene graph information we extract from the image captions is indeed helpful, and provides significantly better object detection results on all of the benchmarks we evaluate on, without using any additional training data. Our model achieves state-of-the-art results on the COCO detection test-dev set and the PASCAL VOC 2007 detection test set, when training on multiple captions datasets. Specifically, when training on COCO captions, we achieve results that are comparable to the state-of-the-art on the PASCAL VOC 2007 test set for a WSOD model that was trained on COCO ground-truth labels.

Implementation Details

To the best of our knowledge, the only existing work to tackle the problem of training an object detector from image captions is Ye et al. (2019) . Therefore, to ensure a fair comparison between our works, we use the same algorithm and configuration for proposal boxes extraction and the same convolutional backbone and feature layers, and our model is based on their official paper implementation. 2 Specifically, we use the Selective Search algorithm (Van de Sande et al. 2011) to extract (at most) 500 proposals for each image, taken from the OpenCV library. We compute the region descriptor vectors by using the (Conv2d1a7x7 to Mixed4e) layers from Incep-tionV2 (Szegedy et al. 2016) for extracting the convolutional feature maps from the images. In addition, we use the (Mixed5a to Mixed5c) layers in the same model to extract the region descriptors after the ROIAlign (He et al. 2017) operation. Finally, the convolutional backbone network was pre-trained on ImageNet (Deng et al. 2009) .

We use the AdaGrad optimizer with a learning rate of 0.01 and set the batch size to 2. For the training data augmentation, we randomly flip the image left to right at training time and resize each image randomly to one of the four scales s ∈ {400, 600, 800, 1200}. We set the number of OICR iterations to K = 3, as we found this to yield the best performance. We use non-maximum-suppression (NMS) at the 2 https://github.com/yekeren/Cap2Det post-processing stage with an intersection-over-union (IOU) threshold of 0.4. We Follow Ye et al. (2019) by weighing the L obj term by λ 1 = 0.5. For the L entang term we experiment with different values in {1e − 2, 3e − 3, 1e − 3} and find λ 2 = 1e − 2 to yield the best results.

Unlike Ye et al. (2019) , we found that the performance of our SG2Det model continues to improve until 1M training steps when training on COCO Captions (∼17 epochs) and 300K steps for Flickr30K (∼19 epochs). We hypothesize that this is a result of the more complex optimization objective of our model. We pick the best model based on the validation set for each dataset. Our models are trained on a single Titan XP GPU for 7 days when training on COCO, and 2 days when training on Flickr30K.

As we experienced instability in the results when training the same model different times with different random seeds, all of the caption models' results we report were achieved by training the model 3 times, while the best one is chosen based on the validation set. Therefore, for almost all of the baseline models we report improved results over what is reported by Ye et al. (2019) , when some of the models have significant improvement gaps.

Table 1: Average precision on the VOC 2007 test set (learning from ground-truth annotations, COCO and Flickr30K captions). We train the detection models using the 80 COCO objects, but evaluate on only the overlapping 20 VOC objects.

Datasets

For training the SG2Det model, we use two popular image captions datasets: COCO Captions (Chen et al. 2015) and Flickr30K (Young et al. 2014) . For training on the COCO Captions dataset, we use 118, 287 images, each paired with five different human-generated captions, which sum up to 591, 435 captions in total. For training on the Flickr30K dataset, we use 31, 783 images, each paired with five different human-generated captions, which sum up to 158, 915 captions in total. When evaluating the VOC 2007 dataset, we use the train and validation set for validation and report our results on the test set. As the object labels vocabulary differs between the COCO and VOC datasets, when evaluating on VOC we use only the twenty overlapping VOC objects. We report the mAP@0.5 for each of the object labels and its mean across the labels for this dataset. When evaluating Table 2 : COCO test-dev mean average precision results when training on COCO captions. These numbers are computed by submitting our detection results to the COCO evaluation server. The best method is shown in bold.

Table 2: COCO test-dev mean average precision results when training on COCO captions. These numbers are computed by submitting our detection results to the COCO evaluation server. The best method is shown in bold.

on COCO, we use the val2017 set for validation and test our model by submitting our object detection predictions to the COCO test-dev2017 evaluation server. 3 We report the metrics provided by the server, where mAP@0.5:0.95 is the primary evaluation metric of the dataset.

Models

On all different benchmarks, we report our results for the following three models: 1. The EXACT MATCH (EM) baseline model proposed by Ye et al. (2019) . This model performs a simple string matching to extract object pseudo-labels from the captions. 2. The EM + TEXTCLSF model. This is the best model reported by Ye et al. (2019) . This model performs object pseudo-label extraction by training a text classifier on additional data of parallel image captions and object annotation pairs, taken from the COCO detection dataset. 3. Our novel EM + SG LOSS model. Our model also performs exact string matching for object label extraction, but additionally applies our novel SG entanglement loss. Note that our model is identical to EXACTMATCH when weighting the SG entanglement loss by 0.

Results Figure 4 shows qualitative examples for predictions of our SG2Det model. For consistency, we only visualize objects 3 https://competitions.codalab.org/competitions/20794 with detection confidence > 5%, and attribute categories that are meaningful across different object classes. We can see that the model identifies most of the objects and their attributes correctly. This further validates our claim that our SG loss gives the model better scene understanding abilities, which in turn allow it to obtain improved object detection results. This figure also shows the quality of the attribute classifiers we obtain as a by-product of our training process. Table 1 shows the results of our models on the VOC 2007 test set. At the top of the table, we show the results of models trained using the gold objects annotations on the COCO and VOC datasets (without bounding boxes annotations), as reported by Ye et al. (2019) . These results can be viewed as an upper bound for what is achievable by training from image captions using our method, since we use a weaker form of supervision.

In the middle part of the table, we report the VOC 2007 test results for models that were trained on COCO captions. Our novel EM + SG LOSS model achieves state-of-theart results on this dataset. It is worth noting that the performance of our model that was trained only on COCO Captions (45.9 mAP score), achieves comparable results to the WSOD model trained on the ground-truth COCO labels (46.3). This implies that our SG loss utilizes the image captions with close-to-maximal efficiency.

At the bottom of the table, we report the VOC 2007 test results for models trained on captions from the Flickr30K dataset. As before, our model achieves state-of-the-art results, which validates the contribution of our novel SG loss. Table 2 shows the results of our model on the COCO testdev dataset. The main metric for evaluation for this dataset is mAP @ 0.5:0.95, which is reported in the leftmost column. Our novel EM + SG LOSS model achieves state-of-the-art results on this baseline too.

To summarize, both our work and Ye et al. (2019) seek to improve over the simple EXACTMATCH baseline. Our method uses contextual scene understanding for this purpose, while the EM + TEXTCLSF method focuses on achieving better object-pseudo labels. When training on COCO Captions and predicting on VOC we can see that our model almost doubles the performance gap over the text classifier model, and when training on Flickr30K and evaluating on VOC, or training on COCO Captions and evaluating on COCO test-dev, our model's performance gap is about 4 times the text classifier gap, without using any additional training data. This validates our hypothesis that better scene understanding is a more critical factor for WSOD models than extraction of better pseudo-labels.

Discussion

We present a novel weakly supervised object detection approach that uses only image captions as supervision. Unlike previous approaches to this problem, we make use of the rich information available in the text in the form of visual attribute descriptions. We propose a novel entanglement loss that captures the coupling between the objects and attributes.

Our evaluation of the COCO and VOC datasets demonstrates state-of-the-art results, using less supervision than previous caption-based methods used by Ye et al. (2019) . Moreover, it shows the power of using grounding information from the text when analyzing images. Here our focus was on attributes only, although textual scene relations can also be explored. These are more technically challenging to handle since the number of potential relations is quadratic in the number of region proposals. However, these can be pruned in different ways. Finally, here we used a fixed pretrained text-to-scene-graph model. An exciting question is how to learn this jointly with the detection model. We leave these directions for future work.

. The Visual Scores Extraction (VSE) module, which finds bounding box proposals and outputs logits for the categories and attributes of these boxes.3. The Attribute Entanglement Loss (AEL) module, whichenforces agreement between the textual and visual representations.