Abstract
Reasoning about complex visual scenes involves perception of entities and their relations. Scene Graphs (SGs) provide a natural representation for reasoning tasks, by assigning labels to both entities (nodes) and relations (edges). Reasoning systems based on SGs are typically trained in a two-step procedure: first, a model is trained to predict SGs from images, and next a separate model is trained to reason based on the predicted SGs. However, it would seem preferable to train such systems in an end-to-end manner. The challenge, which we address here is that scene-graph representations are non-differentiable and therefore it isn’t clear how to use them as intermediate components. Here we propose Differentiable Scene Graphs (DSGs), an image representation that is amenable to differentiable end-to-end optimization, and requires supervision only from the downstream tasks. DSGs provide a dense representation for all regions and pairs of regions, and do not spend modelling capacity on regions of the image that do not contain objects or relations of interest. We evaluate our model on the challenging task of identifying referring relationships (RR) in three benchmark datasets: Visual Genome, VRD and CLEVR. Using DSGs as an intermediate representation leads to new state-of-the-art performance. The full code is available at https://github.com/shikorab/DSG.
1. Introduction
Understanding the full semantics of rich visual scenes is a complex task that involves detecting individual entities, as well as reasoning about the joint combination of the entities and the relations between them. To represent entities and their relations jointly, it is natural to view them as a graph, where nodes are entities and edges represent relations. Such representations are often called Scene Graphs (SGs) [16] . Because SGs allow to explicitly reason about images, substantial efforts have been made to infer them Equal Contribution. Figure 1 . Differentiable Scene Graphs: An intermediate "graphlike" representation that provides a distributed representation for each entity and pair of entities in an image. Differentiable scene graphs can be learned with gradient descent in an end-to-end manner, only using supervision about a downstream visual reasoning task. (referring relationships here). from raw images [15, 16, 40, 26, 46, 11, 48] .
Intermediate Graph Representation
Visual
While scene graphs have been shown to be useful for various tasks [15, 16, 13] , using them as a component in a visual reasoning system is challenging: (a) Because scene graphs are discrete and non-differentiable, it is difficult to learn them end-to-end from a downstream task. (b) The alternative is to pre-train SG predictors separately from supervised data, but this requires arduous and prohibitive manual annotation. Moreover, pre-trained SG predictors have low coverage, because the set of labels they are pre-trained on rarely fits the needs of a downstream task. For example, given an image of a parade and a question "point to the officer on the black horse", that horse might not be a node in the graph, and the term "officer" might not be in the vocabulary. Given these limitations, it is an open question how to make scene graphs useful for visual reasoning applications.
In this work, we describe Differentiable Scene-Graphs (DSG), which address the above challenges (Figure 1 ). DSGs are an intermediate representation trained end-toend from the supervision for a downstream reasoning task. The key idea is to relax the discrete properties of scene graphs such that each entity and relation is described with a dense differentiable descriptor.
We demonstrate the benefits of DSGs in the task of resolving referring relationships (RR) [21] (see Figure 1 ). Here, given an image and a triplet query subject, relation, object , a model has to find the bounding boxes of the sub-ject and object that participate in the relation.
We train an RR model with DSGs as an intermediate component. As such, DSGs are not trained with direct supervision about entities and relations, but using several supervision signals about the downstream RR task. We evaluate our approach on three standard RR datasets: Visual Genome [22] , VRD [27] and CLEVR [14] , and find that DSGs substantially improve performance compared to state-of-the-art approaches [27, 21] .
To conclude, our novel contributions are: (1) A new Differentiable Scene-Graph representation for visual reasoning, which captures information about multiple entities in an image and their relations. We describe how DSGs can be trained end-to-end with a downstream visual reasoning task without direct supervision of pre-collected scene-graphs.
(2) A new architecture for the task of referring relationships, using a DSG as its central component. 3New state-of-theart results on the task of referring relationships on the Visual Genome, VRD and CLEVR datasets.
2. Referring Relationship: The Learning Setup
In the referring relationship task [21] we are given an image I and a subject-relation-object query q = s, r, o . The goal is to output a bounding box B s for the subject, and another bounding box B o for the object. In practice sometimes there are several boxes for each. See Fig. 1 for a sample query and expected output.
Following [21] , we focus on training a referring relationship predictor from labeled data. Namely, we use a training set consisting of images, queries and the correct boxes for these queries. We denote these by
{(I j , q j , (B s j , B o j )} N j=1
. As in [21] , we assume that the vocabulary of query components (subject, object and relation) is fixed.
In our model, we break this task into two components that we optimize in parallel. We fine-tune the position of bounding boxes such that they cover entities tightly, and we also label these boxes as one of the following four possible labels. The labels "Subject" and "Object" disambiguate between the 's' and 'o' entities in the query. The label "Other" refers to boxes corresponding to additional entities, not mentioned in the query, and the label "Background" refers cases where the box does not describe an entity. We refer to these two optimization goals as Box Refiner and Referring Relationships Classifier.
3. Differentiable Scene Graphs
We start by discussing the motivation and potential advantages of using intermediate scene-graph-like representations, as compared to standard scene graphs. Then, we explain how DSGs fit into the full architecture of our model.
3.1. Why Use Intermediate Dsg Layers?
A scene graph (SG) represents entities and relations in an image as a set of nodes and edges. A "perfect" SG (representing all entities and relations) captures most of the information needed for visual reasoning, and thus should be useful as an intermediate representation. Such a SG can then be used by downstream reasoning algorithms, using the predicted SG as an input. Unfortunately, learning to predict "perfect" scene graphs for any downstream task is unlikely due to the aforementioned challenges: First, there is rarely enough data to train good SG predictors, and second, learning to predict SGs in a way that is independent of the downstream task, tends to yield less relevant SGs.
Instead, we propose an intermediate representation, termed a "Differentiable Scene Graph" layer (DSG), which captures the relational information as in a scene graph but can be trained end-to-end in a task-specific manner (Fig. 2) . Like SGs, a DSG keeps descriptors for visual entities and their relations. Unlike SGs, whose nodes and edges are annotated by discrete values (labels), a DSG contains a dense distributed representation vector for each detected entity (termed node descriptor) and each pair of entities (termed edge descriptor). These representations are themselves learned functions of the input image, as we explain in the supplemental material. Like SGs, a DSG only describes candidate boxes which cover entities of interests and their relations. Unlike SGs, each DSG descriptor encompasses not only the local information about a node, but also information about its context. Most importantly, because DSGs are differentiable, they are used as input to downstream visual-reasoning modules, in our case, a referring relationships module.
DGSs provide several computational and modelling advantages: Differentiability. Because node and edge descriptors are differentiable functions of detected boxes, and are fed into a differentiable reasoning module, the entire pipeline can be trained with gradient descent. Dense descriptors. By keeping dense descriptors for nodes and edges, the DSG keeps more information about possible semantics of nodes and edges, instead of committing too early to hard sparse representations. This allows it to better fit downstream tasks. Supervision using downstream tasks. Collecting supervised labels for training scene graphs is hard and costly. DGSs can be trained using training data that is available for downstream tasks, saving costly labeling efforts. With that said, when labeled scene graphs are avilable for given images, that data can be used when training the DSG, using an added loss component. Holistic representation. DSG descriptors are computed by integrating global information from the entire image using graph neural networks (see supplemental materials). Com- 4The DSG is used for both refining the original box proposals, as well as a Referring Relationships Classifier, which classifies each bounding box proposal as either Subject, Object, Other or Background. The ground-truth label of a proposal box will be Other if this proposal is involved in another query relationship over this image. Otherwise the ground truth label will be Background.
bining information across the image increases the accuracy of object and relation descriptors.
3.2. The Dsg Model For Referring Relationships
We now describe how DSGs can be combined with other modules to solve a visual reasoning task. The architecture of the model is illustrated in Fig. 2 . First, the model extracts bounding boxes for entities and relations in the image. Next, it creates a differentiable scene-graph over these bounding boxes. Then, DSG features are used by two output modules, aimed at answering a referring-relationship query: a Box Refiner module that refines the bounding box of the relevant entities, and an Referring Relationships Classifier module that classifies each box as Subject, Object, Other or Background. We now describe these components in more detail.
Object Detector. We detect candidate entities using a standard region proposal network (RPN) [35] , and denote their bounding boxes by b 1 , . . . , b B (B may vary between images). We also extract a feature vector f i for each box and concatenate it with the box coordinates, yielding
z i = [f i ; b i ].
See Details In The Supplemental Material
Relation Feature Extractor. Given any two bounding boxes b i and b j we consider the smallest box that contains the two boxes (their "union" box). We denote this "relation box" by b i,j and its features by f i,j . Finally, we denote the concatenation of the features f i,j and box coordinates
b i,j by z i,j = [f i,j ; b i,j ].
Differentiable Scene-Graph Generator. As discussed above, the goal of the DSG Generator is to transform the above features z i and z i,j into differentiable representa-tions of the underlying scene graph. Namely, map these features into a new set of dense vectors z' i and z' i,j representing entities and relations. This mapping is intended to incorporate the relevant context of each feature vector. Namely, the representation z i contains information about the i th entity, together with its image-wide context.
There are various possible approaches to achieve this mapping. Here we use the model proposed by [11] , which uses a graph neural network for this transformation. See supplemental materials for details on this network.
Multi-task objective. In many domains, training with multi-task objectives can improve the accuracy of individual tasks, because auxiliary tasks operate as regularizers, pushing internal representations away from overfitting and towards capturing useful properties of the input. We follow this idea here and define a multi-task objective that has three components: (a) a Referring Relationships Classifier matches boxes to subject and object query terms. (b) A Box Refiner predicts accurate tight bounding boxes. (c) A Box Labeler recognizes visual entities in boxes if relevant ground truth is available. We also tune an object detector RPN network producing box proposals for our model. Fig. 3 illustrates the effect of the first two components, and how they operate together to refine the bounding boxes and match them to the query terms. Specifically, Fig. 3c , shows how boxes refinement produces boxes that are tight around objects and subjects, and Fig. 3d shows how RR classification matches boxes to query terms.
(A) Referring Relationships Classifier. Given a DSG representation, we use it for answering referring relationship queries. Recall that the output of an RR query subject, relation, object should be bounding boxes B s , B o containing subjects and objects that participate in the query relation. Our model has already computed B bounding boxes b i , as well as representations z i for each box. We next use a prediction model F RRC (z i , q) that takes as input features describing a bounding box and the query, and outputs one of four labels {Subject, Object, Other, Background} where Other refers to a bounding box which is not the query Subject or Object and Background refers to a false entity proposal. Denote the logits generated by this classifier for the i th box by r i ∈ R 4 . The output set B s (or B o ) is simply the set of bounding boxes classified as Subject (or Object). See supplemental materials for further implementation details.
(B) Box Refiner. The DSG is also used for further refinement of the bounding-boxes generated by the RPN network. The idea is that additional knowledge about image context can be used to improve the coordinates of a given entity. This is done via a network F BR (b i , z i ) that takes as input the RPN box coordinates and a differentiable representation z i for box i, and outputs new bounding box coordinates. See Fig. 3 for an illustration of box refinement, and the supplemental material for further implementation details.
(C) Optional auxiliary losses: Scene-Graph Labeling. In addition to the Box Refiner and Referring Relationships Classifier modules described above, one can also use su-pervision about labels of entities and relations if these are available at training time. Specifically, we train an objectrecognition classifier operating on boxes, which predicts the label of every box for which a label is available. This classifier is trained as an auxiliary loss, in a multi-task fashion, and is described in detail below.
4. Training With Multiple Losses
We next explain how our model is trained for the RR task, and how we can also use the RR training data for supervising the DSG component. We train with a weighted sum of three losses: (1) Referring Relationships Classifier (2) Box Refiner (3) Optional Labeling loss. We now describe each of these components. Additional details are provided in the supplemental material.
4.1. Referring Relationship Classification Loss
The Referring Relationships Classifier (Sec. 3.2) outputs logits r i for each box, corresponding to its prediction (subject, object, etc.). To train these logits, we need to extract their ground-truth values from the training data. Recall that a given image in the training data may have multiple queries, and so may have multiple boxes that have been tagged as subject or object for the corresponding queries. To obtain the ground-truth for box i and query q = s, r, o we take the following steps. First, we find the ground-truth box that has maximal overlap with box i. If this box is either a subject or object for the query q, we set r gt i to be Subject or Object respectively. Otherwise, if the overlap with a ground-truth box for a different image-query is greater than 0.5, we set r gt i = Other (since it means there is some other entity in the box), and we set r gt i = Background if the overlap is less than 0.3. If the overlap is in [0.3, 0.5] we do not use the box for training. For instance, given a query woman, feeding, giraffe with ground-truth boxes for "woman" and "giraffe", consider the box in the RPN that is closest to the ground-truth box for "woman". Assume the index of this box is 7. Similarly, assume that the box closest to the ground-truth for "giraffe' has index 5. We would have r gt 7 = Subject, r gt 5 = Object and the rest of the r gt i values would be either Other or Background. Given these ground truth values, the Referring Relationship Classifier Loss is simply the sum of cross entropies between the logits r i and the one-hot vectors corresponding to r gt i .
4.2. Box Refiner Loss
To train the Box Refiner, we use a smooth L 1 loss between the coordinates of the refined (predicted) boxes and their ground truth ones.
4.3. Scene-Graph Labeling Loss
When ground-truth data about entity labels is available, we can use it as an additional source of supervision to train the DSG. Specifically, we train two classifiers. A classifier from features of entity boxes z i to the set of entity labels, and a classifier from features of relation boxesz i,j to relation labels. We then add a loss to maximize the accuracy of these classifiers with respect to the ground truth box labels.
4.4. Tuning The Object Detector
In addition to training the DSG and its downstream visual-reasoning predictors, the object detector RPN is also trained. The output of the RPN is a set of bounding boxes. The ground-truth contains boxes that are known to contain entities. The goal of this loss is to encourage the RPN to include these boxes as proposals. Concretely, we use a sum of two losses: First, a RPN classification loss, which is a cross entropy over RPN anchors where proposals of 0.8 overlap or higher with the ground truth boxes were considered as positive. Second, an RPN box regression loss which is a smooth L1 loss between the ground-truth boxes and proposal boxes.
5. Experiments
In the following sections we provide details about the datasets, training, baselines models, evaluation metrics, model ablations and results. Due to space consideration, the implementation details of the model are provided in in the supplemental material.
5.1. Datasets
We evaluate the model in the task of referring relationships across three datasets, each exhibiting a unique set of characteristics and challenges. CLEVR [14] . A synthetic dataset generated from scenegraphs with four spatial relations: "left", "right", "front" and "behind", and 48 entity categories. It has over 5M relationships where 33% are ambiguous entities (multiple entities of the same type in an image). VRD [27] . The Visual Relationship Detection dataset contains 5,000 images with 100 entity categories and 70 relation categories. In total, VRD contains 37,993 relationship annotations with 6,672 unique relationship types and 24.25 relations per entity category. 60.3% of these relationships refer to ambiguous entities. Visual Genome [22] . VG is the largest public corpus for visual relationships in real images, with 108,077 images annotated with bounding boxes, entities and relations. On average, images have 12 entities and 7 relations per image. In total, there are over 2.3M relationships where 61% of those refer to ambiguous entities.
For a proper comparison with previous results [21] , we used the data from [21] including the same entity and relation categories, query relationships and data splits.
5.2. Evaluation Metrics
We compare our model to previous work using the average IOU for subjects and for objects. To compute the average subject IOU, we first generate two L × L binary attention maps: one that includes all the ground truth boxes labeled as Subject (recall that few entities might be labeled as Subject) and the other includes all the box proposals predicted as Subject. If no box is predicted as Subject, the box with the highest score for the label Subject is included in the predicted attention map. We then compute the Intersection-Over-Union between the binary attention maps. For a proper comparison with previous work [21] , we use L = 14. The object boxes are evaluated in the exact same manner.
5.3. Baselines
The Referring Relationship task was introduced recently [21] , and the SSAS model was proposed as a possible approach (see below). We report the results for the baseline models in [21] . When evaluating our Differentiable Scene-Graph model, we use exactly the evaluation setting as in [21] (i.e., same data splits, entity and relation categories). The baselines reported are:
1. SYMMETRIC STACKED ATTENTION SHIFTING (SSAS): [21] An iterative model that localizes the relationship entities using attention shift component learned for each relation. 2. SPATIAL SHIFTS [23] : Same as SSAS, but with no iterations and by replacing the shift attention mechanism with statistically learned shift per relation that ignores the semantic meaning of entities.
3. Co-Occurrence [6]
: Uses an embedding of the subject and object pair for attending over the image features.
4. Visual Relationship Detection (Vrd) [27]:
Similar to Co-Occurrences model, but with an additional relationship embedding. Table 1 provides average IOU for Subject and Object over the three datasets described in Sec. 5.1. We compare our model to four baselines described in Sec. 5.3. Our Differentiable Scene-Graph approach outperforms all baselines in terms of the average IOU.
6. Results
Our results for the CLEVR dataset are significantly better than those in [21] . Because CLEVR objects have a small set of distinct colors (Fig 5) , object detection in CLEVR is much easier than in natural images, making it easier to achieve high IOU. The baseline model without the DSG layer (NO-DSG) is an end-to-end model with a two-stage detector in contrast to [21] and already improves strongly over prior work with 93.7%, and our novel DSG approach further improves to 96.3% (reducing error by 50%).
6.1. Analysis of success and failure cases.
6.2. Model Ablations
We explored the power of DSGs through model ablations. First, since the model is trained with three loss components, we quantify the contribution of the Box Refinement loss and the Scene-Graph Labeling loss (it is not possible to omit the Referring Relationships Classifier loss). We further evaluate the contribution of the DSG compared with a two-step approach which first predicts an SG, and then reasons over it. We compare the following models:
1. DSG: The Differentiable Scene-Graph model described in Sec. 3.2 and trained as described in Sec. 4.
2. Two Steps
: Two-step model. We first predict a scenegraph, and then match the query with the SG. The SG predictor consists of the same components used in the DSG: A box detector, DSG dense descriptors, and an SG labeler. It is trained with the same set of SG labels used for training the DSG. Details in the supplemental material.
3. DSG -SGL: DSG without the Scene-Graph Labeling component described in Sec. 4.3). As mentioned in Sec. 4.3 in "Scene-Graph Labeling Loss" we can use the DSG for generating a labeled scene graph, corresponding to a fixed set of entities and relations. (b) shows this scene graph (i.e., the output of the classifiers predicting entity labels and relations), restricted to the largest confidence relations. It can be seen that most relations are correct, despite not having trained this model on complete scene graphs. Table 2 provides results of ablation experiments for the Visual Genome dataset [22] on the validation set. All model variants based on scene representation perform better than the model that does not use the DSG representation (i.e., DSG -SG) in terms of average IOU over subject and object, demonstrating the power of contextualized scene representation.
4. Dsg -Br
The DSG model outperforms all model ablations, illustrating the improvements achieved by using partial supervision for training the differentiable scene-graph. Fig. 7 illustrates the the effect of ablating various components of the model.
6.3. Inferring Sgs From Dsgs
While the DSG layer is not designed for predicting scene graphs from images, it can be be used for inferring scene graphs. We decoded SG nodes that are contained in the RR vocabulary by constructing two classifiers: A 1-layer classifier mapping the node to logits over the entity vocabulary, and a 1-layer classifier that maps the edge descriptors to logits over the relation vocabulary. Fig. 6 illustrates the result of this inference, showing a Scene-Graph inferred from the DSG, trained using this loss. The predicted graph is indeed largely correct despite the fact that it was not directly trained for this task.
We further analyzed the accuracy of predicted SGs by comparing to ground-truth SG on visual genome (complete SGs were not used for training, only for analysis in this rebuttal). SGs decoded from DSGs achieve accuracy of 76% for object labels and 70% for relations (calculated for proposals with IOU ≥ 0.8). In the first column, TWO STEP model the scenegraph did not include the shirt of one of the the men, therefore this "subject" prediction was missed. In the second column, the DSG -SGL predicted failed to distinct between few entity classes 'woman" and "child". In the third column, the DSG refine the box of "sky" to cover all of the sky area. In the last column, the NO-DSG didn't classify the "object" box correctly.
7. Related Work
Graph Neural Networks. Recently, major progress has been made in constructing graph neural networks (GNN). These refer to a class of neural networks that operate directly on graph-structured data by passing local messages [7, 24] . Variants of GNNs have been shown to be highly effective at relational reasoning tasks [38] , classification of graphs [2, 3, 31, 4] , and classification of nodes in large graphs [18, 9] . The expressive power of GNNs has also been studied in [11, 45] . GNNs have also been applied to visual understanding in [11, 41, 39, 10] and control [37, 1] . Similar aggregation schemes have also been applied to object detection [12] .
Visual Relationships. Earlier work aimed to leverage visual relationships for improving detection [36] , action recognition and pose estimation [5] , semantic image segmentation [8] or detection of human-object interactions [42, 32, 25] . Lu et al. [27] were the first to formulate detection of visual relationships as a separate task. They learn a likelihood function that uses a language prior based on word embeddings for scoring visual relationships and constructing scene graphs, while other recent works proposed better methods for relationships detection [19, 47] .
Scene Graphs. Scene graphs provide a compact representation of the semantics of an image, and have been shown to be useful for semantic-level interpretation and reasoning about a visual scene [13] . Extracting scene graphs from images provides a semantic representation that can later be used for reasoning, question answering [43] , and image retrieval [15, 34] .
Previous scene-graph prediction work used attention [33] or neural message passing [40] . [30] suggested to predict graphs directly from pixels in an end-to-end manner. NeuralMotif [46] considers global context using an RNN by reading sequentially the independent predictions for each entity and relation and then refines those predictions.
Referring Relationships. Several recent studies looked into the task of detecting an entity based on a referring expression [17, 20] , while taking context into account. [28] described a model that has two parts: one for generating expressions that point to an entity in a discriminative fashion and a second for understanding these expressions and detecting the referred entity. [44] explored the role of context and visual comparison with other entities in referring expressions.
Modelling context was also the focus of [29] , using a multi-instance-learning objective. Recently, [21] introduced an explicit iterative model that localizes the two entities in the referring relationship task, conditioned on one another using attention from one entity to another. However, in contrast to this work, we show an implicit model that uses latent scene context, resulting in new state of the art results on three vision datasets that contain visual relationships.
8. Conclusion
This work is motivated by the assumption that accurate reasoning about images may require access to a detailed representation of the image. While scene graphs provide a natural structure for representing relational information, it is hard to train very dense SGs in a fully supervised manner, and for any given image, the resulting SGs may not be appropriate for downstream reasoning tasks. Here we advocate DSGs, an alternative representation that captures the information in SGs, which is continuous and can be trained jointly with downstream tasks. Our results, both qualitative (Fig 4 ) and quantitative ( Table 1 ,2), suggest that DSGs effectively capture scene structure, and that this can be used for down-stream tasks such as referring relationships.
One natural next step is to study such representations in additional downstream tasks that require integrating information across the image. Some examples are caption generation and visual question answering. DSGs can be particularly useful for VQA, since many questions are easily answerable by scene graphs (e.g., counting questions and questions about relations). Another important extension to DSGs would be a model that captures high-order interactions, as in a hyper-graph. Finally, it will be interesting to explore other approaches to training the DSG, and in particular finding ways for using unlabeled data for this task.