Abstract
We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graphs that further improves over our strong baseline by an average 7.1% relative gain. Our code is available at github.com/rowanz/neural-motifs.
1. Introduction
We investigate scene graph parsing: the task of producing graph representations of real-world images that provide semantic summaries of objects and their relationships. For example, the graph in Figure 1 encodes the existence of key objects such as people ("man" and "woman"), their possessions ("helmet" and "motorcycle", both possessed by the woman), and their activities (the woman is "riding" the "motorcycle"). Predicting such graph representations has been shown to improve natural language based image tasks [17, 43, 51] and has the potential to significantly expand the scope of applications for computer vision systems. Compared to object detection [36, 34] , object interactions [48, 3] and activity recognition [13] , scene graph parsing poses unique challenges since it requires reasoning about the complex dependencies across all of these components.
Elements of visual scenes have strong structural regu- Figure 1 . A ground truth scene graph containing entities, such as woman, bike or helmet, that are localized in the image with bounding boxes, color coded above, and the relationships between those entities, such as riding, the relation between woman and motorcycle or has the relation between man and shirt.
larities. For instance, people tend to wear clothes, as can be seen in Figure 1 . We examine these structural repetitions, or motifs, using the Visual Genome [22] dataset, which provides annotated scene graphs for 100k images from COCO [28] , consisting of over 1M instances of objects and 600k relations. Our analysis leads to two key findings. First, there are strong regularities in the local graph structure such that the distribution of the relations is highly skewed once the corresponding object categories are given, but not vice versa. Second, structural patterns exist even in larger subgraphs; we find that over half of images contain previously occurring graph motifs.
Based on our analysis, we introduce a simple yet powerful baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. The baseline improves over prior state-of-the-art by 1.4 mean recall points (3.6% relative), suggesting that an effective scene graph model must capture both the asymmetric dependence between objects and their relations, along with larger contextual patterns.
We introduce the Stacked Motif Network (MOTIFNET), a new neural network architecture that complements existing approaches to scene graph parsing. We posit that the key challenge in modeling scene graphs lies in devising an efficient mechanism to encode the global context that can directly inform the local predictors (i.e., objects and relations). While previous work has used graph-based inference to propagate information in both directions between objects and relations [47, 25, 24] , our analysis suggests strong independence assumptions in local predictors limit the quality of global predictions. Instead, our model predicts graph elements by staging bounding box predictions, object classifications, and relationships such that the global context encoding of all previous stages establishes rich context for predicting subsequent stages, as illustrated in Figure 5 . We represent the global context via recurrent sequential architectures such as Long Short-term Memory Networks (LSTMs) [15] .