Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

Neural Motifs: Scene Graph Parsing with Global Context

Authors

  • Rowan Zellers
  • Mark Yatskar
  • Sam Thomson
  • Yejin Choi
  • 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
  • 2018
  • View in Semantic Scholar

Abstract

We investigate the problem of producing structured graph representations of visual scenes. Our work analyzes the role of motifs: regularly appearing substructures in scene graphs. We present new quantitative insights on such repeated structures in the Visual Genome dataset. Our analysis shows that object labels are highly predictive of relation labels but not vice-versa. We also find that there are recurring patterns even in larger subgraphs: more than 50% of graphs contain motifs involving at least two relations. Our analysis motivates a new baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. This baseline improves on the previous state-of-the-art by an average of 3.6% relative improvement across evaluation settings. We then introduce Stacked Motif Networks, a new architecture designed to capture higher order motifs in scene graphs that further improves over our strong baseline by an average 7.1% relative gain. Our code is available at github.com/rowanz/neural-motifs.

1. Introduction

We investigate scene graph parsing: the task of producing graph representations of real-world images that provide semantic summaries of objects and their relationships. For example, the graph in Figure 1 encodes the existence of key objects such as people ("man" and "woman"), their possessions ("helmet" and "motorcycle", both possessed by the woman), and their activities (the woman is "riding" the "motorcycle"). Predicting such graph representations has been shown to improve natural language based image tasks [17, 43, 51] and has the potential to significantly expand the scope of applications for computer vision systems. Compared to object detection [36, 34] , object interactions [48, 3] and activity recognition [13] , scene graph parsing poses unique challenges since it requires reasoning about the complex dependencies across all of these components.

Figure 1. A ground truth scene graph containing entities, such as woman, bike or helmet, that are localized in the image with bounding boxes, color coded above, and the relationships between those entities, such as riding, the relation between woman and motorcycle or has the relation between man and shirt.

Elements of visual scenes have strong structural regu- Figure 1 . A ground truth scene graph containing entities, such as woman, bike or helmet, that are localized in the image with bounding boxes, color coded above, and the relationships between those entities, such as riding, the relation between woman and motorcycle or has the relation between man and shirt.

larities. For instance, people tend to wear clothes, as can be seen in Figure 1 . We examine these structural repetitions, or motifs, using the Visual Genome [22] dataset, which provides annotated scene graphs for 100k images from COCO [28] , consisting of over 1M instances of objects and 600k relations. Our analysis leads to two key findings. First, there are strong regularities in the local graph structure such that the distribution of the relations is highly skewed once the corresponding object categories are given, but not vice versa. Second, structural patterns exist even in larger subgraphs; we find that over half of images contain previously occurring graph motifs.

Based on our analysis, we introduce a simple yet powerful baseline: given object detections, predict the most frequent relation between object pairs with the given labels, as seen in the training set. The baseline improves over prior state-of-the-art by 1.4 mean recall points (3.6% relative), suggesting that an effective scene graph model must capture both the asymmetric dependence between objects and their relations, along with larger contextual patterns.

Figure 2. Types of edges between high-level categories in Visual Genome. Geometric, possessive and semantic edges cover 50.9%, 40.9%, and 8.7%, respectively, of edge instances in scene graphs. The majority of semantic edges occur between people and vehicles, artifacts and locations. Less than 2% of edges between clothes and people are semantic.
Figure 3. The likelihood of guessing, in the top-k, head, tail, or edge labels in a scene graph, given other graph components (i.e. without image features). Neither head nor tail labels are strongly determined by other labels, but given the identity of head and tail, edges (edge | head, tail) can be determined with 97% accuracy in under 5 guesses. Such strong biases make it critical to condition on objects when predicting edges.
Figure 4. On the left, the percent of images that have a graph motif found in Visual Genome using pointwise mutual information, composed of at least a certain length (the number of edges it contains). Over 50% of images have at least one motif involving two relationships. On the right, example motifs, where structures repeating many times is indicated with plate notation. For example, the second motif is length 8 and consists of 8 flower-in-vase relationships. Graph motifs commonly result from groups (e.g., several instances of “leaf on tree”), and correlation between parts (e.g., “elephant has head,” “leg,” “trunk,” and “ear.”).

We introduce the Stacked Motif Network (MOTIFNET), a new neural network architecture that complements existing approaches to scene graph parsing. We posit that the key challenge in modeling scene graphs lies in devising an efficient mechanism to encode the global context that can directly inform the local predictors (i.e., objects and relations). While previous work has used graph-based inference to propagate information in both directions between objects and relations [47, 25, 24] , our analysis suggests strong independence assumptions in local predictors limit the quality of global predictions. Instead, our model predicts graph elements by staging bounding box predictions, object classifications, and relationships such that the global context encoding of all previous stages establishes rich context for predicting subsequent stages, as illustrated in Figure 5 . We represent the global context via recurrent sequential architectures such as Long Short-term Memory Networks (LSTMs) [15] .