Abstract
Reasoning about complex visual scenes involves perception of entities and their relations. Scene Graphs (SGs) provide a natural representation for reasoning tasks, by assigning labels to both entities (nodes) and relations (edges). Reasoning systems based on SGs are typically trained in a two-step procedure: first, a model is trained to predict SGs from images, and next a separate model is trained to reason based on the predicted SGs. However, it would seem preferable to train such systems in an end-to-end manner. The challenge, which we address here is that scene-graph representations are non-differentiable and therefore it isn’t clear how to use them as intermediate components. Here we propose Differentiable Scene Graphs (DSGs), an image representation that is amenable to differentiable end-to-end optimization, and requires supervision only from the downstream tasks. DSGs provide a dense representation for all regions and pairs of regions, and do not spend modelling capacity on regions of the image that do not contain objects or relations of interest. We evaluate our model on the challenging task of identifying referring relationships (RR) in three benchmark datasets: Visual Genome, VRD and CLEVR. Using DSGs as an intermediate representation leads to new state-of-the-art performance. The full code is available at https://github.com/shikorab/DSG.
1. Introduction
Understanding the full semantics of rich visual scenes is a complex task that involves detecting individual entities, as well as reasoning about the joint combination of the entities and the relations between them. To represent entities and their relations jointly, it is natural to view them as a graph, where nodes are entities and edges represent relations. Such representations are often called Scene Graphs (SGs) [16] . Because SGs allow to explicitly reason about images, substantial efforts have been made to infer them Equal Contribution. Figure 1 . Differentiable Scene Graphs: An intermediate "graphlike" representation that provides a distributed representation for each entity and pair of entities in an image. Differentiable scene graphs can be learned with gradient descent in an end-to-end manner, only using supervision about a downstream visual reasoning task. (referring relationships here). from raw images [15, 16, 40, 26, 46, 11, 48] .
Intermediate Graph Representation
Visual
While scene graphs have been shown to be useful for various tasks [15, 16, 13] , using them as a component in a visual reasoning system is challenging: (a) Because scene graphs are discrete and non-differentiable, it is difficult to learn them end-to-end from a downstream task. (b) The alternative is to pre-train SG predictors separately from supervised data, but this requires arduous and prohibitive manual annotation. Moreover, pre-trained SG predictors have low coverage, because the set of labels they are pre-trained on rarely fits the needs of a downstream task. For example, given an image of a parade and a question "point to the officer on the black horse", that horse might not be a node in the graph, and the term "officer" might not be in the vocabulary. Given these limitations, it is an open question how to make scene graphs useful for visual reasoning applications.
In this work, we describe Differentiable Scene-Graphs (DSG), which address the above challenges (Figure 1 ). DSGs are an intermediate representation trained end-toend from the supervision for a downstream reasoning task. The key idea is to relax the discrete properties of scene graphs such that each entity and relation is described with a dense differentiable descriptor.
We demonstrate the benefits of DSGs in the task of resolving referring relationships (RR) [21] (see Figure 1 ). Here, given an image and a triplet query subject, relation, object , a model has to find the bounding boxes of the sub-ject and object that participate in the relation.
We train an RR model with DSGs as an intermediate component. As such, DSGs are not trained with direct supervision about entities and relations, but using several supervision signals about the downstream RR task. We evaluate our approach on three standard RR datasets: Visual Genome [22] , VRD [27] and CLEVR [14] , and find that DSGs substantially improve performance compared to state-of-the-art approaches [27, 21] .
To conclude, our novel contributions are: (1) A new Differentiable Scene-Graph representation for visual reasoning, which captures information about multiple entities in an image and their relations. We describe how DSGs can be trained end-to-end with a downstream visual reasoning task without direct supervision of pre-collected scene-graphs.
(2) A new architecture for the task of referring relationships, using a DSG as its central component. 3New state-of-theart results on the task of referring relationships on the Visual Genome, VRD and CLEVR datasets.
2. Referring Relationship: The Learning Setup
In the referring relationship task [21] we are given an image I and a subject-relation-object query q = s, r, o . The goal is to output a bounding box B s for the subject, and another bounding box B o for the object. In practice sometimes there are several boxes for each. See Fig. 1 for a sample query and expected output.
Following [21] , we focus on training a referring relationship predictor from labeled data. Namely, we use a training set consisting of images, queries and the correct boxes for these queries. We denote these by
{(I j , q j , (B s j , B o j )} N j=1
. As in [21] , we assume that the vocabulary of query components (subject, object and relation) is fixed.
In our model, we break this task into two components that we optimize in parallel. We fine-tune the position of bounding boxes such that they cover entities tightly, and we also label these boxes as one of the following four possible labels. The labels "Subject" and "Object" disambiguate between the 's' and 'o' entities in the query. The label "Other" refers to boxes corresponding to additional entities, not mentioned in the query, and the label "Background" refers cases where the box does not describe an entity. We refer to these two optimization goals as Box Refiner and Referring Relationships Classifier.
3. Differentiable Scene Graphs
We start by discussing the motivation and potential advantages of using intermediate scene-graph-like representations, as compared to standard scene graphs. Then, we explain how DSGs fit into the full architecture of our model.
3.1. Why Use Intermediate Dsg Layers?
A scene graph (SG) represents entities and relations in an image as a set of nodes and edges. A "perfect" SG (representing all entities and relations) captures most of the information needed for visual reasoning, and thus should be useful as an intermediate representation. Such a SG can then be used by downstream reasoning algorithms, using the predicted SG as an input. Unfortunately, learning to predict "perfect" scene graphs for any downstream task is unlikely due to the aforementioned challenges: First, there is rarely enough data to train good SG predictors, and second, learning to predict SGs in a way that is independent of the downstream task, tends to yield less relevant SGs.
Instead, we propose an intermediate representation, termed a "Differentiable Scene Graph" layer (DSG), which captures the relational information as in a scene graph but can be trained end-to-end in a task-specific manner (Fig. 2) . Like SGs, a DSG keeps descriptors for visual entities and their relations. Unlike SGs, whose nodes and edges are annotated by discrete values (labels), a DSG contains a dense distributed representation vector for each detected entity (termed node descriptor) and each pair of entities (termed edge descriptor). These representations are themselves learned functions of the input image, as we explain in the supplemental material. Like SGs, a DSG only describes candidate boxes which cover entities of interests and their relations. Unlike SGs, each DSG descriptor encompasses not only the local information about a node, but also information about its context. Most importantly, because DSGs are differentiable, they are used as input to downstream visual-reasoning modules, in our case, a referring relationships module.
DGSs provide several computational and modelling advantages: Differentiability. Because node and edge descriptors are differentiable functions of detected boxes, and are fed into a differentiable reasoning module, the entire pipeline can be trained with gradient descent. Dense descriptors. By keeping dense descriptors for nodes and edges, the DSG keeps more information about possible semantics of nodes and edges, instead of committing too early to hard sparse representations. This allows it to better fit downstream tasks. Supervision using downstream tasks. Collecting supervised labels for training scene graphs is hard and costly. DGSs can be trained using training data that is available for downstream tasks, saving costly labeling efforts. With that said, when labeled scene graphs are avilable for given images, that data can be used when training the DSG, using an added loss component. Holistic representation. DSG descriptors are computed by integrating global information from the entire image using graph neural networks (see supplemental materials). Com- 4The DSG is used for both refining the original box proposals, as well as a Referring Relationships Classifier, which classifies each bounding box proposal as either Subject, Object, Other or Background. The ground-truth label of a proposal box will be Other if this proposal is involved in another query relationship over this image. Otherwise the ground truth label will be Background.
bining information across the image increases the accuracy of object and relation descriptors.
3.2. The Dsg Model For Referring Relationships
We now describe how DSGs can be combined with other modules to solve a visual reasoning task. The architecture of the model is illustrated in Fig. 2 . First, the model extracts bounding boxes for entities and relations in the image. Next, it creates a differentiable scene-graph over these bounding boxes. Then, DSG features are used by two output modules, aimed at answering a referring-relationship query: a Box Refiner module that refines the bounding box of the relevant entities, and an Referring Relationships Classifier module that classifies each box as Subject, Object, Other or Background. We now describe these components in more detail.
Object Detector. We detect candidate entities using a standard region proposal network (RPN) [35] , and denote their bounding boxes by b 1 , . . . , b B (B may vary between images). We also extract a feature vector f i for each box and concatenate it with the box coordinates, yielding
z i = [f i ; b i ].
See Details In The Supplemental Material
Relation Feature Extractor. Given any two bounding boxes b i and b j we consider the smallest box that contains the two boxes (their "union" box). We denote this "relation box" by b i,j and its features by f i,j . Finally, we denote the concatenation of the features f i,j and box coordinates
b i,j by z i,j = [f i,j ; b i,j ].
Differentiable Scene-Graph Generator. As discussed above, the goal of the DSG Generator is to transform the above features z i and z i,j into differentiable representa-tions of the underlying scene graph. Namely, map these features into a new set of dense vectors z' i and z' i,j representing entities and relations. This mapping is intended to incorporate the relevant context of each feature vector. Namely, the representation z i contains information about the i th entity, together with its image-wide context.
There are various possible approaches to achieve this mapping. Here we use the model proposed by [11] , which uses a graph neural network for this transformation. See supplemental materials for details on this network.
Multi-task objective. In many domains, training with multi-task objectives can improve the accuracy of individual tasks, because auxiliary tasks operate as regularizers, pushing internal representations away from overfitting and towards capturing useful properties of the input. We follow this idea here and define a multi-task objective that has three components: (a) a Referring Relationships Classifier matches boxes to subject and object query terms. (b) A Box Refiner predicts accurate tight bounding boxes. (c) A Box Labeler recognizes visual entities in boxes if relevant ground truth is available. We also tune an object detector RPN network producing box proposals for our model. Fig. 3 illustrates the effect of the first two components, and how they operate together to refine the bounding boxes and match them to the query terms. Specifically, Fig. 3c , shows how boxes refinement produces boxes that are tight around objects and subjects, and Fig. 3d shows how RR classification matches boxes to query terms.