Neural Motifs: Scene Graph Parsing with Global Context

Rowan Zellers
Mark Yatskar
Sam Thomson
Yejin Choi
2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
2018
View in Semantic Scholar

Abstract

We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graphs that further improves over our strong baseline by an average 7.1% relative gain. Our code is available at github.com/rowanz/neural-motifs.

1. Introduction

We investigate scene graph parsing: the task of producing graph representations of real-world images that provide semantic summaries of objects and their relationships. For example, the graph in Figure 1 encodes the existence of key objects such as people ("man" and "woman"), their possessions ("helmet" and "motorcycle", both possessed by the woman), and their activities (the woman is "riding" the "motorcycle"). Predicting such graph representations has been shown to improve natural language based image tasks [17, 43, 51] and has the potential to significantly expand the scope of applications for computer vision systems. Compared to object detection [36, 34] , object interactions [48, 3] and activity recognition [13] , scene graph parsing poses unique challenges since it requires reasoning about the complex dependencies across all of these components.

Figure 1. A ground truth scene graph containing entities, such as woman, bike or helmet, that are localized in the image with bounding boxes, color coded above, and the relationships between those entities, such as riding, the relation between woman and motorcycle or has the relation between man and shirt.

Elements of visual scenes have strong structural regu- Figure 1 . A ground truth scene graph containing entities, such as woman, bike or helmet, that are localized in the image with bounding boxes, color coded above, and the relationships between those entities, such as riding, the relation between woman and motorcycle or has the relation between man and shirt.

larities. For instance, people tend to wear clothes, as can be seen in Figure 1 . We examine these structural repetitions, or motifs, using the Visual Genome [22] dataset, which provides annotated scene graphs for 100k images from COCO [28] , consisting of over 1M instances of objects and 600k relations. Our analysis leads to two key findings. First, there are strong regularities in the local graph structure such that the distribution of the relations is highly skewed once the corresponding object categories are given, but not vice versa. Second, structural patterns exist even in larger subgraphs; we find that over half of images contain previously occurring graph motifs.

Based on our analysis, we introduce a simple yet powerful baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. The baseline improves over prior state-of-the-art by 1.4 mean recall points (3.6% relative), suggesting that an effective scene graph model must capture both the asymmetric dependence between objects and their relations, along with larger contextual patterns.

Figure 2. Types of edges between high-level categories in Visual Genome. Geometric, possessive and semantic edges cover 50.9%, 40.9%, and 8.7%, respectively, of edge instances in scene graphs. The majority of semantic edges occur between people and vehicles, artifacts and locations. Less than 2% of edges between clothes and people are semantic.

Figure 3. The likelihood of guessing, in the top-k, head, tail, or edge labels in a scene graph, given other graph components (i.e. without image features). Neither head nor tail labels are strongly determined by other labels, but given the identity of head and tail, edges (edge | head, tail) can be determined with 97% accuracy in under 5 guesses. Such strong biases make it critical to condition on objects when predicting edges.

Figure 4. On the left, the percent of images that have a graph motif found in Visual Genome using pointwise mutual information, composed of at least a certain length (the number of edges it contains). Over 50% of images have at least one motif involving two relationships. On the right, example motifs, where structures repeating many times is indicated with plate notation. For example, the second motif is length 8 and consists of 8 flower-in-vase relationships. Graph motifs commonly result from groups (e.g., several instances of “leaf on tree”), and correlation between parts (e.g., “elephant has head,” “leg,” “trunk,” and “ear.”).

We introduce the Stacked Motif Network (MOTIFNET), a new neural network architecture that complements existing approaches to scene graph parsing. We posit that the key challenge in modeling scene graphs lies in devising an efficient mechanism to encode the global context that can directly inform the local predictors (i.e., objects and relations). While previous work has used graph-based inference to propagate information in both directions between objects and relations [47, 25, 24] , our analysis suggests strong independence assumptions in local predictors limit the quality of global predictions. Instead, our model predicts graph elements by staging bounding box predictions, object classifications, and relationships such that the global context encoding of all previous stages establishes rich context for predicting subsequent stages, as illustrated in Figure 5 . We represent the global context via recurrent sequential architectures such as Long Short-term Memory Networks (LSTMs) [15] .

Figure 5. A diagram of a Stacked Motif Network (MOTIFNET). The model breaks scene graph parsing into stages predicting bounding regions, labels for regions, and then relationships. Between each stage, global context is computed using bidirectional LSTMs and is then used for subsequent stages. In the first stage, a detector proposes bounding regions and then contextual information among bounding regions is computed and propagated (object context). The global context is used to predict labels for bounding boxes. Given bounding boxes and labels, the model constructs a new representation (edge context) that gives global context for edge predictions. Finally, edges are assigned labels by combining contextualized head, tail, and union bounding region information with an outer product.

Our model builds on Faster-RCNN [36] for predicting bounding regions, fine tuned and adapted for Visual Genome. Global context across bounding regions is computed and propagated through bidirectional LSTMs, which is then used by another LSTM that labels each bounding region conditioned on the overall context and all previous labels. Another specialized layer of bidirectional LSTMs then computes and propagates information for predicting edges given bounding regions, their labels, and all other computed context. Finally, we classify all n 2 edges in the graph, combining globally contextualized representations of head, tail, and image representations using using low-rank outer products [19] . The method can be trained end-to-end.

Experiments on Visual Genome demonstrate the efficacy of our approach. First, we update existing work by pretraining the detector on Visual Genome, setting a new state-of-the-art (improving on average across evaluation settings 14.0 absolute points). Our new simple baseline improves over previous work, using our updated detector, by a mean improvement of 1.4 points. Finally, experiments show Stacked Motif Networks is effective at modeling global context, with a mean improvement of 2.9 points (7.1% relative improvement) over our new strong baseline.

2. Formal Definition

A scene graph, G, is a structured representation of the semantic content of an image [17] . It consists of:

• a set B = {b 1 , . . . , b n } of bounding boxes, b i ∈ R 4 ,

• a corresponding set O = {o 1 , . . . , o n } of objects, assigning a class label o i ∈ C to each b i , and • a set R = {r 1 , . . . , r m } of binary relationships between those objects.

Each relationship r k ∈ R is a triplet of a start node

(b i , o i ) ∈ B × O, an end node (b j , o j ) ∈ B × O,

and a relationship label x i→j ∈ R, where R is the set of all predicate types, including the "background" predicate, BG, which indicates that there is no edge between the specified objects. See Figure 1 for an example scene graph.

3. Scene Graph Analysis

In this section, we seek quantitative insights on the structural regularities of scene graphs. In particular, (a) how different types of relations correlate with different objects, and (b) how higher order graph structures recur over different scenes. These insights motivate both the new baselines we introduce in this work and our model that better integrates the global context, described in Section 4.

3.1. Prevalent Relations In Visual Genome

To gain insight into the Visual Genome scene graphs, we first categorize objects and relations into high-level types. As shown in Table 1 , the predominant relations are geometric and possessive, with clothing and parts making up over one third of entity instances. Such relations are often obvious, e.g., houses tend to have windows. In contrast, semantic relations, which correspond to activities, are less frequent and less obvious. Although nearly half of relation types are semantic in nature, they comprise only 8.7% of relation instances. The relations "using" and "holding" account for 32.2% of all semantic relation instances.

Table 1. Object and relation types in Visual Genome, organized by super-type. Most, 25.2% of entities are parts and 90.9% of relations are geometric or possessive.

Semantic Geometric

Possessive Figure 2 . Types of edges between high-level categories in Visual Genome. Geometric, possessive and semantic edges cover 50.9%, 40.9%, and 8.7%, respectively, of edge instances in scene graphs. The majority of semantic edges occur between people and vehicles, artifacts and locations. Less than 2% of edges between clothes and people are semantic.

Using our high-level types, we visualize the distribution of relation types between object types in Figure 2 . Clothing and part entities are almost exclusively linked through possessive relations while furniture and building entities are almost exclusively linked through geometric relations. Geometric and spatial relationships between certain entities are interchangeable, for example, when a "part" is the head object, it tends to connect to other entities through a geometric relation (e.g. wheel on bike); when a "part" is the tail object, it tends to be connected with possessive relations (e.g. bike has wheel). Nearly all semantic relationship are headed by people, with the majority of edges relating to artifacts, vehicles, and locations. Such structural predictability and the prevalence of geometric and part-object relations suggest that common sense priors play an important role in generating accurate scene graphs.

In Figure 3 , we examine how much information is gained by knowing the identity of different parts in a scene graphs. In particular, we consider how many guesses are required to determine the labels of head (h), edge (e) or tail (t) given labels of the other elements, only using label statistics computed on scene graphs. Higher curves imply that the element is highly determined given the other values. The graph shows that the local distribution of relationships has significant structure. In general, the identity of edges involved in a relationship is not highly informative of other elements of the structure while the identities of head or tail provide significant information, both to each other and to edge labels. Adding edge information to already given head or tail information provides minimal gain. Finally, the graph shows edge labels are highly determined given the identity of ob- Figure 3 . The likelihood of guessing, in the top-k, head, tail, or edge labels in a scene graph, given other graph components (i.e. without image features). Neither head nor tail labels are strongly determined by other labels, but given the identity of head and tail, edges (edge | head, tail) can be determined with 97% accuracy in under 5 guesses. Such strong biases make it critical to condition on objects when predicting edges.

ject pairs: the most frequent relation is correct 70% of the time, and the five most frequent relations for the pair contain the correct label 97% of the time.

3.2. Larger Motifs

Scene graphs not only have local structure but have higher order structure as well. We conducted an analysis of repeated motifs in scene graphs by mining combinations of object-relation-object labels that have high pointwise mutual information with each other. Motifs were extracted iteratively: first we extracted motifs of two combinations, replaced all instances of that motif with an atomic symbol and mined new motifs given previously identified motifs. Combinations of graph elements were selected as motifs if both elements involved occurred at least 50 times in the Visual Genome training set and were at least 10 times more likely to occur together than apart. Motifs were mined until no new motifs were extracted. Figure 4 contains example motifs we extracted on the right, and the prevalence of motifs of different lengths in images on the left. Many motifs correspond to either combinations of parts, or objects that are commonly grouped together. Over 50% of images in Visual Genome contain a motif involving at least two combinations of object-relation-object, and some images contain motifs involving as many as 16 elements.

4. Model

Here we present our novel model, Stacked Motif Network (MOTIFNET). MOTIFNET decomposes the probability of a graph G (made up of a set of bounding regions B, Figure 4 . On the left, the percent of images that have a graph motif found in Visual Genome using pointwise mutual information, composed of at least a certain length (the number of edges it contains). Over 50% of images have at least one motif involving two relationships. On the right, example motifs, where structures repeating many times is indicated with plate notation. For example, the second motif is length 8 and consists of 8 flower-in-vase relationships. Graph motifs commonly result from groups (e.g., several instances of "leaf on tree"), and correlation between parts (e.g., "elephant has head," "leg," "trunk," and "ear.").

object labels O, and labeled relations R) into three factors:

Pr(G | I) = Pr(B | I) Pr(O | B, I) Pr(R | B, O, I). (1)

Note that this factorization makes no independence assumptions. Importantly, predicted object labels may depend on one another, and predicted relation labels may depend on predicted object labels. The analyses in Section 3 make it clear that capturing these dependencies is crucial.

The bounding box model (Pr(B | I)) is a fairly standard object detection model, which we describe in Section 4.1. The object model (Pr(O | B, I); Section 4.2) conditions on a potentially large set of predicted bounding boxes, B. To do so, we linearize B into a sequence that an LSTM then processes to create a contextualized representation of each box. Likewise, when modeling relations (Pr(R | B, O, I); Section 4.3), we linearize the set of predicted labeled objects, O, and process them with another LSTM to create a representation of each object in context. Figure 5 contains a visual summary of the entire model architecture.

4.1. Bounding Boxes

We use Faster R-CNN as an underlying detector [36] . For each image I, the detector predicts a set of region proposals B = {b 1 , . . . , b n }. For each proposal b i ∈ B it also outputs a feature vector f i and a vector l i ∈ R |C| of (noncontextualized) object label probabilities. Note that because BG is a possible label, our model has not yet committed to any bounding boxes. See Section 5.1 for details.

4.2. Objects

Context We construct a contextualized representation for object prediction based on the set of proposal regions B. Elements of B are first organized into a linear sequence, 1 The object context, C, is then computed using a bidirectional LSTM [15] : Decoding The context C is used to sequentially decode labels for each proposal bounding region, conditioning on previously decoded labels. We use an LSTM to decode a category label for each contextualized representation in C:

[(b 1 , f 1 , l 1 ), . . . , (b n , f n , l n )].

EQUATION (2): Not extracted; please refer to original document.

EQUATION (3): Not extracted; please refer to original document.

EQUATION (4): Not extracted; please refer to original document.

We then discard the hidden states h i and use the object class commitmentsô i in the relation model (Section 4.3).

4.3. Relations

Context We construct a contextualized representation of bounding regions, B, and objects, O, using additional bidirectional LSTM layers:

EQUATION (5): Not extracted; please refer to original document.

where the edge context D = [d 1 , . . . , d n ] contains the states for each bounding region at the final layer, and W 2 is a parameter matrix mappingô i into R 100 .

Decoding There are a quadratic number of possible relations in a scene graph. For each possible edge, say between b i and b j , we compute the probability the edge will have label x i→j (including BG). The distribution uses global context, D, and a feature vector for the union of boxes 2 , f i,j : Figure 5 . A diagram of a Stacked Motif Network (MOTIFNET). The model breaks scene graph parsing into stages predicting bounding regions, labels for regions, and then relationships. Between each stage, global context is computed using bidirectional LSTMs and is then used for subsequent stages. In the first stage, a detector proposes bounding regions and then contextual information among bounding regions is computed and propagated (object context). The global context is used to predict labels for bounding boxes. Given bounding boxes and labels, the model constructs a new representation (edge context) that gives global context for edge predictions. Finally, edges are assigned labels by combining contextualized head, tail, and union bounding region information with an outer product.

EQUATION (6): Not extracted; please refer to original document.

RPN VGG16 h1h 2h3h4h5h6 d6 d 1 d 2 d 3 d 4 d 5 d 6 d1d2d3d4d5

5. Experimental Setup

In the following sections we explain (1) details of how we construct the detector, order bounding regions, and implement the final edge classifier (Section 5.1), (2) details of training (Section 5.2), and (3) evaluation (Section 5.3).

5.1. Model Details

Detectors Similar to prior work in scene graph parsing [47, 25] , we use Faster RCNN with a VGG backbone as our underling object detector [36, 40] . Our detector is given images that are scaled and then zero-padded to be 592x592. We adjust the bounding box proposal scales and dimension ratios to account for different box shapes in Visual Genome, similar to YOLO-9000 [34] . To control for detector performance in evaluating different scene graph models, we first pretrain the detector on Visual Genome objects. We optimize the detector using SGD with momentum on 3 Titan Xs, with a batch size of b = 18, and a learning rate of lr = 1.8 • 10 −2 that is divided by 10 after validation mAP plateaus. For each batch we sample 256 RoIs per image, of which 75% are background. The detector gets 20.0 mAP (at 50% IoU) on Visual Genome; the same model, but trained and evaluated on COCO, gets 47.7 mAP at 50% IoU. Following [47] , we integrate the use the detector freezing the convolution layers and duplicating the fully connected lay-ers, resulting in separate branches for object/edge features.

Alternating Highway LSTMs To mitigate vanishing gradient problems as information flows upward, we add highway connections to all LSTMs [14, 41, 58] . To additionally reduce the number of parameters, we follow [14] and alternate the LSTM directions. Each alternating highway LSTM step can be written as the following wrapper around the conventional LSTM equations [15] :

r i = σ(W g [h i−δ , x i ] + b g ) (8) h i = r i • LSTM(x i , h i−δ ) + (1 − r i ) • W i x i , (9)

where x i is the input, h i represents the hidden state, and δ is the direction: δ = 1 if the current layer is even, and −1 otherwise. For MOTIFNET, we use 2 alternating highway LSTM layers for object context, and 4 for edge context.

Roi Ordering For Lstms

We consider several ways of ordering the bounding regions: (1) LEFTRIGHT (default): Our default option is to sort the regions left-to-right by the central x-coordinate: we expect this to encourage the model to predict edges between nearby objects, which is beneficial as objects appearing in relationships tend to be close together.

(2) CONFIDENCE: Another option is to order bounding regions based on the confidence of the maximum non-background prediction from the detector:

max j =BG l (j)

i , as this lets the detector commit to "easy" regions, obtaining context for more difficult regions. 3 (3) SIZE: Here, we sort the boxes in descending order by size, possibly predicting global scene information first. (4) RANDOM: Here, we randomly order the regions.

Predicate Visual Features To extract visual features for a predicate between boxes b i , b j , we resize the detector's features corresponding to the union box of b i , b j to 7x7x256. We model geometric relations using a 14x14x2 binary input with one channel per box. We apply two convolution layers to this and add the resulting 7x7x256 representation to the detector features. Last, we apply finetuned VGG fully connected layers to obtain a 4096 dimensional representation. 4

5.2. Training

We train MOTIFNET on ground truth boxes, with the objective to predict object labels and to predict edge labels given ground truth object labels. For an image, we include all annotated relationships (sampling if more than 64) and sample 3 negative relationships per positive. In cases with multiple edge labels per directed edge (5% of edges), we sample the predicates. Our loss is the sum of the cross entropy for predicates and cross entropy for objects predicted by the object context layer. We optimize using SGD with momentum on a single GPU, with lr = 6 • 10 −3 and b = 6.

Adapting to Detection Using the above protocol gets good results when evaluated on scene graph classification, but models that incorporate context underperform when suddenly introduced to non-gold proposal boxes at test time.

To alleviate this, we fine-tune using noisy box proposals from the detector. We use per-class non-maximal suppression (NMS) [38] at 0.3 IoU to pass 64 proposals to the object context branch of our model. We also enforce NMS constraints during decoding given object context. We then sample relationships between proposals that intersect with ground truth boxes and use relationships involving these boxes to finetune the model until detection convergence.

We also observe that in detection our model gets swamped with many low-quality RoI pairs as possible relationships, which slows the model and makes training less stable. To alleviate this, we observe that nearly all annotated relationships are between overlapping boxes, 5 and classify all relationships with non-overlapping boxes as BG.

5.3. Evaluation

We train and evaluate our models on Visual Genome, using the publicly released preprocessed data and splits from [47] , containing 150 object classes and 50 relation classes, but sample a development set from the training set of 5000 images. We follow three standard evaluation modes: (1) predicate classification (PREDCLS): given a ground truth set of boxes and labels, predict edge labels, (2) scene graph classification (SGCLS): given ground truth boxes, predict box labels and edge label and (3) scene graph detection (SGDET): predict boxes, box labels, and edge labels. The annotated graphs are known to be incomplete, thus systems are evaluated using recall@K metrics. 6 In all three modes, recall is calculated for relations; a ground truth edge (b h , o h , x, b t , o t ) is counted as a "match" if there exist predicted boxes i, j such that b i and b j respectively have sufficient overlap with b h and b t , 7 and the objects and relation labels agree. We follow previous work in enforcing that for a given head and tail bounding box, the system must not output multiple edge labels [47, 29] .

5.4. Frequency Baselines

To support our finding that object labels are highly predictive of edge labels, we additionally introduce several frequency baselines built off training set statistics. The first, FREQ, uses our pretrained detector to predict object labels for each RoI. To obtain predicate probabilities between boxes i and j, we look up the empirical distribution over relationships between objects o i and o j as computed in the training set. 8 Intuitively, while this baseline does not look at the image to compute Pr(x i→j |o i , o j ), it displays the value of conditioning on object label predictions o. The second, FREQ-OVERLAP, requires that the two boxes intersect in order for the pair to count as a valid relation.

Table 2. Results table, adapted from [47] which ran VRD [29] without language priors. All numbers in %. Since past work doesn’t evaluate on R@20, we compute the mean by averaging performance on the 3 evaluation modes over R@50 and R@100. ?: results in [31] are without scene graph constraints; we evaluated performance with constraints using saved predictions given to us by the authors (see Table 3 in supp).

6. Results

We present our results in Table 6 . We compare MO-TIFNET to previous models not directly incorporating context (VRD [29] and ASSOC EMBED [31] ), a stateof-the-art approach for incorporating graph context via message passing (MESSAGE PASSING) [47] , and its reimplemenation using our detector, edge model, and NMS settings (MESSAGE PASSING+). Unfortunately, many scene graph models are evaluated on different versions of Visual Genome; see Table 3 in the supp for more analysis.

Table 3. Results with and without scene graph constraints. Horizontal lines indicate different dataset preprocessing settings (the “other split” results, to the best of our knowledge, are reported on different splits). ?: [25] authors acknowledge that their paper results aren’t reproducible for SGCLS and PREDCLS; their current best reproducible numbers are one line below. MSDN-FREQ: Results from using node prediction from [25] and edge prediction from FREQ.

Table 6. Not extracted; please refer to original document.

Our best frequency baseline, FREQ+OVERLAP, improves over prior state-of-the-art by 1.4 mean recall, pri- 6 Past work has considered these evaluation modes at recall thresholds R@50 and R@100, but we also report results on R@20. 7 As in prior work, we compute the intersection-over-union (IoU) between the boxes and use a threshold of 0.5. 8 Since we consider an edge x i→j to have label BG if o has no edge to j, this gives us a valid probability distribution. [47] which ran VRD [29] without language priors. All numbers in %. Since past work doesn't evaluate on R@20, we compute the mean by averaging performance on the 3 evaluation modes over R@50 and R@100. : results in [31] are without scene graph constraints; we evaluated performance with constraints using saved predictions given to us by the authors (see Table 3 in supp). marily due to improvements in detection and predicate classification, where it outperforms MESSAGE PASSING+ by 5.5 and 6.5 mean points respectively. MOTIFNET improves even further, by 2.9 additional mean points over the baseline (a 7.1% relative gain).

Ablations To evaluate the effectiveness of our main model, MOTIFNET, we consider several ablations in Table 6 . In MOTIFNET-NOCONTEXT, we predict objects based on the fixed detector, and feed non-contexualized embeddings of the head and tail label into Equation 6. Our results suggest that there is signal in the vision features for edge predictions, as MOTIFNET-NOCONTEXT improves over FREQ-OVERLAP. Incorporating context is also important: our full model MOTIFNET improves by 1.2 mean points, with largest gains at the lowest recall threshold of R@20. 9 We additionally validate the impact of the ordering method used, as discussed in Section 5.1; the results vary less than 0.3 recall points, suggesting that MOTIFNET is robust to the RoI ordering scheme used.

7. Qualitative Results

Qualitative examples of our approach, shown in Figure 6 , suggest that MOTIFNET is able to induce graph motifs from detection context. Visual inspection of the results suggests that the method works even better than the quantitative results would imply, since many seemingly correct edges are predicted that do not exist in the ground truth.

Figure 6. Qualitative examples from our model in the Scene Graph Detection setting. Green boxes are predicted and overlap with the ground truth, orange boxes are ground truth with no match. Green edges are true positives predicted by our model at the R@20 setting, orange edges are false negatives, and blue edges are false positives. Only predicted boxes that overlap with the ground truth are shown.

There are two common failure cases of our model. The first, as exhibited by the middle left image in Figure 6 of a skateboarder carrying a surfboard, stems from predicate 9 The larger improvement at the lower thresholds suggests that our models mostly improve on relationship ordering rather than classification. Indeed, it is often unnecessary to order relationships at the higher thresholds: 51% of images have fewer than 50 candidates and 78% have less than 100. ambiguity ("wearing" vs "wears"). The second common failure case occurs when the detector fails, resulting in cascading failure to predict any edges to that object. For example, the failure to predict "house" in the lower left image resulted in five false negative relations.

8. Related Work

Context Many methods have been proposed for modeling semantic context in object recognition [7] . Our approach is most closely related to work that models object co-occurrence using graphical models to combine many sources of contextual information [33, 11, 26, 10] . While our approach is a type of graphical model, it is unique in that it stages incorporation of context allowing for meaningful global context from large conditioning sets.

Actions and relations have been a particularly fruitful source of context [30, 50] , especially when combined with pose to create human-object interactions [48, 3] . Recent work has shown that object layouts can provide sufficient context for captioning COCO images [52, 28] ; our work suggests the same for parsing Visual Genome scene graphs. Much of the context we derive could be interpreted as commonsense priors, which have commonly been extracted using auxiliary means [59, 39, 5, 49, 55 ]. Yet for scene graphs, we are able to directly extract such knowledge.

Structured Models Structured models in visual understanding have been explored for language grounding, where language determines the graph structures involved in prediction [32, 20, 42, 16] . Our problem is different as we must reason over all possible graph structures. Deep sequential models have demonstrated strong performance for tasks such as captioning [4, 9, 45, 18] and visual question answering [1, 37, 53, 12, 8] , including for problems not traditionally not thought of as sequential, such as multilabel classifi- cation [46] . Indeed, graph linearization has worked surprisingly well for many problems in vision and language, such as generating image captions from object detections [52] , language parsing [44] , generating text from abstract meaning graphs [21] . Our work leverages the ability of RNNs to memorize long sequences in order to capture graph motifs in Visual Genome. Finally, recent works incorporate recurrent models into detection and segmentation [2, 35] and our methods contribute evidence that RNNs provide effective context for consecutive detection predictions.

Scene Graph Methods Several works have explored the role of priors by incorporating background language statistics [29, 54] or by attempting to preprocess scene graphs [56] . Instead, we allow our model to directly learn to use scene graph priors effectively. Furthermore, recent graph-propagation methods were applied but converge quickly and bottle neck through edges, significantly limiting information exchange [47, 25, 6, 23] . On the other hand, our method allows global exchange of information about context through conditioning and avoids uninformative edge predictions until the end. Others have explored creating richer models between image regions, introducing new convolutional features and new objectives [31, 57, 25, 27] . Our work is complementary and instead focuses on the role of context. See the supplemental section for a comprehensive comparison to prior work.

9. Conclusion

We presented an analysis of the Visual Genome dataset showing that motifs are prevalent, and hence important to model. Motivated by this analysis, we introduced strong baselines that improve over prior state-of-the-art models by modeling these intra-graph interactions, while mostly ignoring visual cues. We also introduced our model MO-TIFNET for capturing higher order structure and global interactions in scene graphs that achieves additional significant gains over our already strong baselines.

Supplemental

Current work in scene graph parsing is largely inconsistent in terms of evaluation and experiments across papers are not completely comparable. In this supplementary material, we attempt to classify some of the differences and put the works together in the most comparable light.

Setup

In our paper, we compared against papers that (to the best of our knowledge) evaluated in the same way as [47] . Variation in evaluation consists of two types:

• Custom data handling, such as creating paper-specific dataset splits, changing the data pre-processing, or using different label sets.

• Omitting graph constraints, namely, allowing a headtail pair to have multiple edge labels in system output. We hypothesize that omitting graph constraints should always lead to higher numbers, since the model is then allowed multiple guesses for challenging objects and relations. Table 3 provides a best effort comprehensive review against all prior work that we are aware of. Other works also introduce slight variations in the tasks that are evaluated: 10

• Predicate Detection (PREDDET). The model is given a list of labeled boxes, as in predicate classification, and a list of head-tail pairs that have edges in the ground truth (the model makes no edge predictions for head-tail pairs not in the ground truth).

• Phrase Detection (PHRDET). The model must produce a set of objects and edges, as in scene graph detection. An edge is counted as a match if the objects and predicate match the ground truth, with the IOU between the union-boxes of the prediction and the ground truth over 0.5 (in contrast to scene graph detection where each object box must independently overlap with the corresponding ground truth box).

Models Considered

In Table 3 , we list the following additional methods:

• MSDN [25] : This model is an extension of the message passing idea from [47] . In addition to using an RPN to propose boxes for objects, an additional RPN is used to propose regions for captioning. The caption generator is trained using an additional loss on the annotated regions from Visual Genome.

• MSDN-FREQ: To benchmark the performance on [25] 's split (with more aggressive preprocessing than [47] and with small objects removed), we evaluated a version of our FREQ baseline in [25] 's codebase. We took a checkpoint from the authors and replaced all edge predictions with predictions from the training set statistics from [25] 's split.

• SCR [23] : This model uses an RPN to generate triplet proposals. Messages are then passed between the head, tail, and predicate for each triplet.

• DR-NET [6] : Similar to [47] , this model uses an object detector to propose regions, and then messages are passed between relationship components using an approximation to CRF inference.

• VRL [27] : This model constructs a scene graph incrementally. During training, a reinforcement learning loss is used to reward the model when it predicts correct components.

• VTE [57] : This model learns subject, predicate, and object embeddings. A margin loss is used to reward the model for predicting correct triplets over incorrect ones.

• LKD [54] : This model uses word vectors to regularize a CNN that predicts relationship triplets.

Summary

The amount of variation in Table 3 requires extremely cautious interpretation. As expected, removing graph constraints significantly increases reported performance and both predicate detection and phrase detection are significantly less challenging than predicate classification and scene graph detection, respectively. On [25] 's split, the MSDN-FREQ baseline outperforms MSDN on all evaluation settings, suggesting baseline is robust across alternative data settings. In total, the results suggest that our model and baselines are at least competitive with other approaches on different configurations of the task. No graph constraints SGDET SGCLS PREDCLS SGDET SGCLS PREDCLS PHRDET PREDDET Model R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 [47] 's split VRD [29] , from [47] [54] 92.31 95.68 Table 3 . Results with and without scene graph constraints. Horizontal lines indicate different dataset preprocessing settings (the "other split" results, to the best of our knowledge, are reported on different splits). : [25] authors acknowledge that their paper results aren't reproducible for SGCLS and PREDCLS; their current best reproducible numbers are one line below. MSDN-FREQ: Results from using node prediction from [25] and edge prediction from FREQ.

We consider several strategies to order the regions, see Section 5.1.2 A union box is the convex hull of the union of two bounding boxes.

When sorting by confidence, the edge layer's regions are ordered by the maximum non-background object prediction as given by Equation 4.4 We remove the final ReLU to allow more interaction in Equation 6.5 A hypothetical model that perfectly classifies relationships, but only between boxes with nonzero IoU, gets 91% recall.

We use task names from[29], despite inconsistency in whether the underlying task actually involves classification or detection.