# Mapping Images to Scene Graphs with Permutation-Invariant Structured Prediction

## Abstract

Structured prediction is concerned with predicting multiple inter-dependent labels simultaneously. Classical methods like CRF achieve this by maximizing a score function over the set of possible label assignments. Recent extensions use neural networks to either implement the score function or in maximization. The current paper takes an alternative approach, using a neural network to generate the structured output directly, without going through a score function. We take an axiomatic perspective to derive the desired properties and invariances of a such network to certain input permutations, presenting a structural characterization that is provably both necessary and sufficient. We then discuss graph-permutation invariant (GPI) architectures that satisfy this characterization and explain how they can be used for deep structured prediction. We evaluate our approach on the challenging problem of inferring a {\em scene graph} from an image, namely, predicting entities and their relations in the image. We obtain state-of-the-art results on the challenging Visual Genome benchmark, outperforming all recent approaches.

## 1 Introduction

Understanding the semantics of a complex visual scene is a fundamental problem in machine perception. It often requires recognizing multiple objects in a scene, together with their spatial and functional relations. The set of objects and relations is sometimes represented as a graph, connecting objects (nodes) with their relations (edges) and is known as a scene graph (Figure 1 ). Scene graphs provide a compact representation of the semantics of an image, and can be useful for semantic-level interpretation and reasoning about a visual scene Johnson et al. (2018) . Scene-graph prediction is the problem of inferring the joint set of objects and their relations in a visual scene.

Since objects and relations are inter-dependent (e.g., a person and chair are more likely to be in relation "sitting on" than "eating"), a scene graph predictor should capture this dependence in order to improve prediction accuracy. This goal is a special case of a more general problem, namely, inferring multiple inter-dependent labels, which is the research focus of the field of structured prediction. Structured prediction has attracted considerable attention because it applies to many learning problems and Figure 1 : An image and its scene graph from the Visual Genome dataset (Krishna et al., 2017) . The scene graph captures the entities in the image (nodes, blue circles) like dog and their relations (edges, red circles) like hat, on, dog .

poses unique theoretical and algorithmic challenges (e.g., see Belanger et al., 2017; Chen et al., 2015; Taskar et al., 2004) . It is therefore a natural approach for predicting scene graphs from images.

Structured prediction models typically define a score function s(x, y) that quantifies how well a label assignment y is compatible with an input x. In the case of understanding complex visual scenes, x is an image, and y is a complex label containing the labels of objects detected in an image and the labels of their relations. In this setup, the inference task amounts to finding the label that maximizes the compatibility score y * = arg max y s(x, y). This score-based approach separates a scoring component -implemented by a parametric model, from an optimization component -aimed at finding a label that maximizes that score. Unfortunately, for a general scoring function s(•), the space of possible label assignments grows exponentially with input size. For instance, for scene graphs the set of possible object label assignments is too large even for relatively simple images, since the vocabulary of candidate objects may contain thousands of objects. As a result, inferring the label assignment that maximizes a scoring function is computationally hard in the general case.

An alternative approach to score-based methods is to map an input x to a structured output y with a "black box" neural network, without explicitly defining a score function. This raises a natural question: what is the right architecture for such a network? Here we take an axiomatic approach and argue that one important property such networks should satisfy is invariance to a particular type of input permutation. We then prove that this invariance is equivalent to imposing certain structural constraints on the architecture of the network, and describe architectures that satisfy these constraints.

To evaluate our approach, we first demonstrate on a synthetic dataset that respecting permutation invariance is important, because models that violate this invariance need more training data, despite having a comparable model size. Then, we tackle the problem of scene graph generation. We describe a model that satisfies the permutation invariance property, and show that it achieves state-of-the-art results on the competitive Visual Genome benchmark (Krishna et al., 2017) , demonstrating the power of our new design principle.

In summary, the novel contributions of this paper are: a) Deriving sufficient and necessary conditions for graph-permutation invariance in deep structured prediction architectures. b) Empirically demonstrating the benefit of graph-permutation invariance. c) Developing a state-of-the-art model for scene graph prediction on a large dataset of complex visual scenes.

## 2 Structured Prediction

Scored-based methods in structured prediction define a function s(x, y) that quantifies the degree to which y is compatible with x, and infer a label by maximizing s(x, y) (e.g., see Belanger et al., 2017; Chen et al., 2015; Lafferty et al., 2001; Meshi et al., 2010; Taskar et al., 2004) . Most score functions previously used decompose as a sum over simpler functions, s(x, y) = i f i (x, y), making it possible to optimize max y f i (x, y) efficiently. This local maximization forms the basic building block of algorithms for approximately maximizing s(x, y). One way to decompose the score function is to restrict each f i (x, y) to depend only on a small subset of the y variables.

The renewed interest in deep learning led to efforts to integrate deep networks with structured prediction, including modeling the f i functions as deep networks. In this context, the most widely-used score functions are singleton f i (y i , x) and pairwise f ij (y i , y j , x). The early work taking this approach used a two-stage architecture, learning the local scores independently of the structured prediction Figure 2 : Left: Graph permutation invariance. A graph labeling function F is graph permutation invariant (GPI) if permuting the node features maintains the output. Right: a schematic representation of the GPI architecture in Theorem 1. Singleton features zi are omitted for simplicity. (a) First, the features zi,j are processed element-wise by φ. (b) Features are summed to create a vector si, which is concatenated with zi. (c) A representation of the entire graph is created by applying α n times and summing the created vector. (d) The graph representation is then finally processed by ρ together with z k .

goal Farabet et al., 2013) . Later studies considered end-to-end architectures where the inference algorithm is part of the computation graph (Chen et al., 2015; Pei et al., 2015; Zheng et al., 2015) . Recent studies go beyond pairwise scores, also modelling global factors (Belanger et al., 2017; Gygli et al., 2017) .

Score-based methods provide several advantages. First, they allow intuitive specification of local dependencies between labels and how these translate to global dependencies. Second, for linear score functions, the learning problem has natural convex surrogates Lafferty et al. (2001) ; Taskar et al. (2004) . Third, inference in large label spaces is sometimes possible via exact algorithms or empirically accurate approximations. However, with the advent of deep scoring functions s(x, y; w), learning is no longer convex. Thus, it is worthwhile to rethink the architecture of structured prediction models, and consider models that map inputs x to outputs y directly without explicitly maximizing a score function. We would like these models to enjoy the expressivity and predictive power of neural networks, while maintaining the ability to specify local dependencies between labels in a flexible manner. In the next section, we present such an approach and consider a natural question: what should be the properties of a deep neural network used for structured prediction.

## 3 Permutation-Invariant Structured Prediction

In what follows we define the permutation-invariance property for structured prediction models, and argue that permutation invariance is a natural principle for designing their architecture.

We first introduce our notation. We focus on structures with pairwise interactions, because they are simpler in terms of notation and are sufficient for describing the structure in many problems. We denote a structured label by y = [y 1 , . . . , y n ]. In a score-based approach, the score is defined via a set of singleton scores f i (y i , x) and pairwise scores f ij (y i , y j , x), where the overall score s(x, y) is the sum of these scores. For brevity, we denote f ij = f ij (y i , y j , x) and f i = f i (y i , x). An inference algorithm takes as input the local scores f i , f ij and outputs an assignment that maximizes s(x, y). We can thus view inference as a black-box that takes node-dependent and edge-dependent inputs (i.e., the scores f i , f ij ) and returns a label y, even without an explicit score function s(x, y). While numerous inference algorithms exist for this setup, including belief propagation (BP) and mean field, here we develop a framework for a deep labeling algorithm (we avoid the term "inference" since the algorithm does not explicitly maximize a score function). Such an algorithm will be a black-box, taking the f functions as input and the labels y 1 , . . . , y n as output. We next ask what architecture such an algorithm should have.

We follow with several definitions. A graph labeling function F : (V, E) → Y is a function whose input is an ordered set of node features V = [z 1 , . . . , z n ] and an ordered set of edge features E = [z 1,2 . . . , z i,j , . . . , z n,n−1 ]. For example, z i can be the array of values f i , and z i,j can be the table of values f i,j . Assume z i ∈ R d and z i,j ∈ R e . The output of F is a set of node labels y = [y 1 , . . . , y n ]. Thus, algorithms such as BP are graph labeling functions. However, graph labeling functions do not necessarily maximize a score function. We denote the joint set of node features and edge features by z (i.e., a set of n + n(n − 1) = n 2 vectors). In Section 3.1 we discuss extensions to this case where only a subset of the edges is available.

A natural requirement is that the function F produces the same result when given the same features, up to a permutation of the input. For example, consider a label space with three variables y 1 , y 2 , y 3 , and assume that F takes as input

z = (z 1 , z 2 , z 3 , z 12 , z 13 , z 23 ) = (f 1 , f 2 , f 3 , f 12 , f 13 , f 23 )

, and outputs a label y = (y * 1 , y * 2 , y * 3 ). When F is given an input that is permuted in a consistent way, say, z = (f 2 , f 1 , f 3 , f 21 , f 23 , f 13 ), this defines exactly the same input. Hence, the output should still be y = (y * 2 , y * 1 , y * 3 ). Most inference algorithms, including BP and mean field, satisfy this symmetry requirement by design, but this property is not guaranteed in general in a deep model. Here, our goal is to design a deep learning black-box, and hence we wish to guarantee invariance to input permutations. A black-box that violates this invariance "wastes" capacity on learning it at training time, which increases sample complexity, as shown in Sec. 5.1. We proceed to formally define the permutation invariance property. Definition 1. Let z be a set of node features and edge features, and let σ be a permutation of {1, . . . , n}. We define σ(z) to be a new set of node and edge features given

by [σ(z)] i = z σ(i) and [σ(z)] i,j = z σ(i),σ(j) .

We also use the notation σ([y 1 , . . . , y n ]) = [y σ(1) , . . . , y σ(n) ] for permuting the labels. Namely, σ applied to a set of labels yields the same labels, only permuted by σ. Be aware that applying σ to the input features is different from permuting labels, because edge input features must permuted in a way that is consistent with permuting node input features. We now provide our key definition of a function whose output is invariant to permutations of the input. See Figure 2 (left). Definition 2. A graph labeling function F is said to be graph-permutation invariant (GPI), if for all permutations σ of {1, . . . , n} and for all z it satisfies: F(σ(z)) = σ(F(z)).

## 3.1 Characterizing Permutation Invariance

Motivated by the above discussion, we ask: what structure is necessary and sufficient to guarantee that F is GPI? Note that a function F takes as input an ordered set z. Therefore its output on z could certainly differ from its output on σ(z). To achieve permutation invariance, F should contain certain symmetries. For instance, one permutation invariant architecture could be to define y i = g(z i ) for any function g, but this architecture is too restrictive and does not cover all permutation invariant functions. Theorem 1 below provides a complete characterization (see Figure 2 for the corresponding architecture). Intuitively, the architecture in Theorem 1 is such that it can aggregate information from the entire graph, and do so in a permutation invariant manner. Theorem 1. Let F be a graph labeling function. Then F is graph-permutation invariant if and only if there exist functions α, ρ, φ such that for all k = 1, . . . , n:

EQUATION (1): Not extracted; please refer to original document.

where φ :

R 2d+e → R L , α : R d+L → R W and ρ : R W +d → R.

Proof. First, we show that any F satisfying the conditions of Theorem 1 is GPI. Namely, for any permutation σ,

[F(σ(z))] k = [F(z)] σ(k)

. To see this, write [F(σ(z))] k using Eq. 1 and Definition 1:

EQUATION (2): Not extracted; please refer to original document.

The second argument of ρ above is invariant under σ, because it is a sum over nodes and their neighbors, which is invariant under permutation. Thus Eq. 2 is equal to:

ρ(z σ(k) , i α(z i , j =i φ(z i , z i,j , z j ))) = [F(z)] σ(k)

where equality follows from Eq. 1. We thus proved that Eq. 1 implies graph permutation invariance.

Next, we prove that any given GPI function F 0 can be expressed as a function F in Eq. 1. Namely, we show how to define φ, α and ρ that can implement F 0 . Note that in this direction of the proof the function F 0 is a black-box. Namely, we only know that it is GPI, but do not assume anything else about its implementation.

The key idea is to construct φ, α such that the second argument of ρ in Eq. 1 contains the information about all the graph features z. Then, the function ρ corresponds to an application of F 0 to this representation, followed by extracting the label y k . To simplify notation assume edge features are scalar (e = 1). The extension to vectors is simple, but involves more indexing.

We assume WLOG that the black-box function F 0 is a function only of the pairwise features z i,j (otherwise, we can always augment the pairwise features with the singleton features). Since z i,j ∈ R we use a matrix R n,n to denote all the pairwise features.

Finally, we assume that our implementation of F 0 will take additional node features z k such that no two nodes have the same feature (i.e., the features identify the node).

Our goal is thus to show that there exist functions α, φ, ρ such that the function in Eq. 2 applied to Z yields the same labels as F 0 (Z).

Let H be a hash function with L buckets mapping node features z i to an index (bucket). Assume that H is perfect (this can be achieved for a large enough L). Define φ to map the pairwise features to a vector of size L. Let 1 [j] be a one-hot vector of dimension R L , with one in the j th coordinate. Recall that we consider scalar z i,j so that φ is indeed in R L , and define φ as:

φ(z i , z i,j , z j ) = 1 [H(z j )] z i,j , i.

e., φ "stores" z i,j in the unique bucket for node j.

Let

s i = zi,j ∈E φ(z i , z i,j , z j ) be the second argument of α in Eq. 1 (s i ∈ R L ).

Then, since all z j are distinct, s i stores all the pairwise features for neighbors of i in unique positions within its L coordinates. Since s i (H(z k )) contains the feature z i,k whereas s j (H(z k )) contains the feature z j,k , we cannot simply sum the s i , since we would lose the information of which edges the features originated from. Instead, we define α to map s i to R L×L such that each feature is mapped to a distinct location. Formally:

EQUATION (3): Not extracted; please refer to original document.

α outputs a matrix that is all zeros except for the features corresponding to node i that are stored in row H(z i ). The matrix M = i α(z i , s i ) (namely, the second argument of ρ in Eq. 1) is a matrix with all the edge features in the graph including the graph structure.

To complete the construction we set ρ to have the same outcome as F 0 . We first discard rows and columns in M that do not correspond to original nodes (reducing M to dimension n × n). Then, we use the reduced matrix as the input z to the black-box F 0 .

Assume for simplicity that M does not need to be contracted (this merely introduces another indexing step). Then M corresponds to the original matrix Z of pairwise features, with both rows and columns permuted according to H. We will thus use M as input to the function F 0 . Since F 0 is GPI, this means that the label for node k will be given by F 0 (M ) in position H(z k ). Thus we set

ρ(z k , M ) = [F 0 (M )] H(z k )

, and by the argument above this equals [F 0 (Z)] k , implying that the above α, φ and ρ indeed implement F 0 .

Extension to general graphs So far, we discussed complete graphs, where edges correspond to valid feature pairs. However, many graphs of interest might be incomplete. For example, an n-variable chain graph in sequence labeling has only n − 1 edges. For such graphs, the input to F would not contain all z i,j pairs but rather only features corresponding to valid edges of the graph, and we are only interested in invariances that preserve the graph structure, namely, the automorphisms of the graph. Thus, the desired invariance is that σ(F(z)) = F(σ(z)), where σ is not an arbitrary permutation but an automorphism. It is easy to see that a simple variant of Theorem 1 holds in this case. All we need to do is replace in Eq. 2 the sum j =i with j∈N (i) , where N (i) are the neighbors of node i in the graph. The arguments are then similar to the proof above.

Implications of Theorem 1 Our result has interesting implications for deep structured prediction. First, it highlights that the fact that the architecture "collects" information from all different edges of the graph, in an invariant fashion via the α, φ functions. Specifically, the functions φ (after summation) aggregate all the features around a given node, and then α (after summation) can collect them. Thus, these functions can provide a summary of the entire graph that is sufficient for downstream algorithms. This is different from one round of message passing algorithms which would not be sufficient for collecting global graph information. Note that the dimensions of φ, α may need to be large to aggregate all graph information (e.g., by hashing all the features as in the proof of Theorem 1), but the architecture itself can be shallow.

Second, the architecture is parallelizable, as all φ functions can be applied simultaneously. This is in contrast to recurrent models Zellers et al. (2017) which are harder to parallelize and are thus slower in practice.

Finally, the theorem suggests several common architectural structures that can be used within GPI. We briefly mention two of these. 1) Attention: Attention is a powerful component in deep learning architectures (Bahdanau et al., 2015) , but most inference algorithms do not use attention. Intuitively, in attention each node i aggregates features of neighbors through a weighted sum, where the weight is a function of the neighbor's relevance. For example, the label of an entity in an image may depend more strongly on entities that are spatially closer. Attention can be naturally implemented in our GPI characterization, and we provide a full derivation for this implementation in the appendix. It plays a key role in our scene graph model described below. 2) RNNs: Because GPI functions are closed under composition, for any GPI function F we can run F iteratively by providing the output of one step of F as part of the input to the next step and maintain GPI. This results in a recurrent architecture, which we use in our scene graph model.

## 4 Related Work

The concept of architectural invariance was recently proposed in DEEPSETS (Zaheer et al., 2017) . The invariance we consider is much less restrictive: the architecture does not need to be invariant to all permutations of singleton and pairwise features, just those consistent with a graph re-labeling. This characterization results in a substantially different set of possible architectures.

Deep structured prediction. There has been significant recent interest in extending deep learning to structured prediction tasks. Much of this work has been on semantic segmentation, where convolutional networks (Shelhamer et al., 2017 ) became a standard approach for obtaining "singleton scores" and various approaches were proposed for adding structure on top. Most of these approaches used variants of message passing algorithms, unrolled into a computation graph (Xu et al., 2017) . Some studies parameterized parts of the message passing algorithm and learned its parameters (Lin et al., 2015) . Recently, gradient descent has also been used for maximizing score functions (Belanger et al., 2017; Gygli et al., 2017) . An alternative to deep structured prediction is greedy decoding, inferring each label at a time based on previous labels. This approach has been popular in sequencebased applications (e.g., parsing (Chen & Manning, 2014) ), relying on the sequential structure of the input, where BiLSTMs are effectively applied. Another related line of work is applying deep learning to graph-based problems, such as TSP (Bello et al., 2016; Gilmer et al., 2017; Khalil et al., 2017) . Clearly, the notion of graph invariance is important in these, as highlighted in (Gilmer et al., 2017) . They however do not specify a general architecture that satisfies invariance as we do here, and in fact focus on message passing architectures, which we strictly generalize. Furthermore, our focus is on the more general problem of structured prediction, rather than specific graph-based optimization problems.

Scene graph prediction. Extracting scene graphs from images provides a semantic representation that can later be used for reasoning, question answering, and image retrieval (Johnson et al., 2015; Lu et al., 2016; Raposo et al., 2017) . It is at the forefront of machine vision research, integrating challenges like object detection, action recognition and detection of human-object interactions (Liao et al., 2016; Plummer et al., 2017) . Prior work on scene graph predictions used neural message passing algorithms (Xu et al., 2017) as well as prior knowledge in the form of word embeddings (Lu et al., 2016) . Other work suggested to predict graphs directly from pixels in an end-to-end manner . NeuralMotif (Zellers et al., 2017) , currently the state-of-the-art model for scene graph prediction on Visual Genome, employs an RNN that provides global context by sequentially reading the independent predictions for each entity and relation and then refines those predictions. The NEURALMOTIF model maintains GPI by fixing the order in which the RNN reads its inputs and thus only a single order is allowed. However, this fixed order is not guaranteed to be optimal.

## 5 Experimental Evaluation

We empirically evaluate the benefit of GPI architectures. First, using a synthetic graph-labeling task, and then for the problem of mapping images to scene graphs.

## 5.1 Synthetic Graph Labeling

We start with studying GPI on a synthetic problem, defined as follows. An input graph G = (V, E) is given, where each node i ∈ V is assigned to one of K sets. The set for node i is denoted by Γ(i). The goal is to compute for each node the number of neighbors that belong to the same set. Namely, the label of a node is

y i = j∈N (i) 1[Γ(i) = Γ(j)].

We generated random graphs with 10 nodes (larger graphs produced similar results) by sampling each edge independently and uniformly, and sampling Γ(i) for every node uniformly from {1, . . . , K}. The node features z i ∈ {0, 1} K are one-hot vectors of Γ(i) and the edge features z i,j ∈ {0, 1} indicate whether ij ∈ E. We compare two standard non-GPI architectures and one GPI architecture: (a) A GPI-architecture for graph prediction, described in detail in Section 5.2. We used the basic version without attention and RNN. (b) LSTM:

We replace φ(•) and α(•), which perform aggregation in Theorem 1 with two LSTMs with a state size of 200 that read their input in random order. (c) A fully-connected (FC) feed-forward network with 2 hidden layers of 1000 nodes each. The input to the fully connected model is a concatenation of all node and pairwise features. The output is all node predictions. The focus of the experiment is to study sample complexity. Therefore, for a fair comparison, we use the same number of parameters for all models. Figure 3 , shows the results, demonstrating that GPI requires far fewer samples to converge to the correct solution. This illustrates the advantage of an architecture with the correct inductive bias for the problem.

## 5.2 Scene-Graph Classification

We evaluate the GPI approach on the motivating task of this paper, inferring scene graphs from images (Figure 1) . In this problem, the input is an image annotated with a set of bounding boxes for the entities in the image. 2 The goal is to label each bounding box with the correct entity category and every pair of entities with their relation, such that they form a coherent scene graph.

We begin by describing our Scene Graph Predictor (SGP) model. We aim to predict two types of variables. The first is entity variables [y 1 , . . . , y n ] for all bounding boxes. Each y i can take one of L values (e.g., "dog", "man"). The second is relation variables [y n+1 , . . . , y n 2 ] for every pair of bounding boxes. Each such y j can take one of R values (e.g., "on", "near"). Our graph connects variables that are expected to be inter-related. It contains two types of edges: 1) entity-entity edge connecting every two entity variables (y i and y j for 1 ≤ i = j ≤ n. 2) entity-relation edges connecting every relation variable y k (where k > n) to its two entity variables. Thus, our graph is not a complete graph and our goal is to design an architecture that will be invariant to any automorphism of the graph, such as permutations of the entity variables.

For the input features z, we used the features learned by the baseline model from Zellers et al. (2017). 3 Specifically, the entity features z i included (1) The confidence probabilities of all entities for y i as learned by the baseline model. (2) Bounding box information given as (left, bottom, width, height); (3) The number of smaller entities (also bigger); (4) The number of entities to the left, right, above and below. (5) The number of entities with higher and with lower confidence;

## Constrained Evaluation

Unconstrained Evaluation SGCls PredCls SGCls PredCls R@50 R@100 R@50 R@100 R@50 R@100 R@50 R@100 Lu et al., 2016 (Lu et al., 2016 11.8 14.1 35.0 27.9 ---- Xu et al., 2017 (Xu et al., 2017 21 Table 1 : Test set results for graph-constrained evaluation (i.e., the returned triplets must be consistent with a scene graph) and for unconstrained evaluation (triplets need not be consistent with a scene graph).

(6) For the linguistic model only: word embedding of the most probable class. Word vectors were learned with GLOVE from the ground-truth captions of Visual Genome.

Similarly, the relation features z j ∈ R R contained the probabilities of relation entities for the relation j. For the Linguistic model, these features were extended to include word embedding of the most probable class. For entity-entity pairwise features z i,j , we use the relation probability for each pair. Because the output of SGP are probability distributions over entities and relations, we use them as an the input z to SGP, once again in a recurrent manner and maintain GPI.

We next describe the main components of the GPI architecture. First, we focus on the parts that output the entity labels. φ ent is the network that integrates features for two entity variables y i and y j . It simply takes z i , z j and z i,j as input, and outputs a vector of dimension n 1 . Next, the network α ent takes as input the outputs of φ ent for all neighbors of an entity, and uses the attention mechanism described above to output a vector of dimension n 2 . Finally, the ρ ent network takes these n 2 dimensional vectors and outputs L logits predicting the entity value. The ρ rel network takes as input the α ent representation of the two entities, as well as z i,j and transforms the output into R logits. See appendix for specific network architectures.

## 5.2.1 Experimental Setup And Results

Dataset. We evaluated our approach on Visual Genome (VG) (Krishna et al., 2017) , a dataset with 108,077 images annotated with bounding boxes, entities and relations. On average, images have 12 entities and 7 relations per image. For a proper comparison with previous results Xu et al., 2017; Zellers et al., 2017) , we used the data from (Xu et al., 2017) , including the train and test splits. For evaluation, we used the same 150 entities and 50 relations as in Xu et al., 2017; Zellers et al., 2017) . To tune hyper-parameters, we also split the training data into two by randomly selecting 5K examples, resulting in a final 70K/5K/32K split for train/validation/test sets.

Training. All networks were trained using Adam (Kingma & Ba, 2014) with batch size 20. Hyperparameter values below were chosen based on the validation set. The SGP loss function was the sum of cross-entropy losses over all entities and relations in the image. In the loss, we penalized entities 4 times more strongly than relations, and penalized negative relations 10 times more weakly than positive relations.

Evaluation. In (Xu et al., 2017) three different evaluation settings were considered. Here we focus on two of these: (1) SGCls: Given ground-truth bounding boxes for entities, predict all entity categories and relations categories.

(2) PredCls: Given bounding boxes annotated with entity labels, predict all relations. Following (Lu et al., 2016) , we used Recall@K as the evaluation metric. It measures the fraction of correct ground-truth triplets that appear within the K most confident triplets proposed by the model. Two evaluation protocols are used in the literature differing in whether they enforce graph constraints over model predictions. The first graph-constrained protocol requires that the top-K triplets assign one consistent class per entity and relation. The second unconstrained protocol does not enforce any such constraints. We report results on both protocols, following (Zellers et al., 2017) . Models and baselines. We compare four variants of our GPI approach with the reported results of four baselines that are currently the state-of-the-art on various scene graph prediction problems (all models use the same data split and pre-processing as (Xu et al., 2017) to produce a full graph from the image. 4) YANG ET AL., 2018 (YANG ET AL., 2018): The GRAPH R-CNN model uses object-relation regularities to sparsify and reason over scene graphs. 5) ZELLERS ET AL., 2017 (ZELLERS ET AL., 2017): The NEURALMOTIF method encodes global context for capturing high-order motifs in scene graphs, and the BASELINE outputs the entities and relations distributions without using the global context. The following variants of GPI were compared: 1) GPI: NO ATTENTION: Our GPI model, but with no attention mechanism. Instead, following Theorem 1, we simply sum the features. 2) GPI: NEIGHBORATTENTION: Our GPI model, with attention over neighbors features. 3) GPI: LINGUISTIC: Same as GPI: NEIGHBORATTENTION but also concatenating the word embedding vector, as described above.

Results. Table 1 shows recall@50 and recall@100 for three variants of our approach, and compared with five baselines. All GPI variants performs well, with LINGUISTIC outperforming all baselines for SGCls and being comparable to the state-of-the-art model for PredCls. Note that PredCl is an easier task, which makes less use of the structure, hence it is not surprising that GPI achieves similar accuracy to Zellers et al. (2017) . Figure 4 illustrates the model behavior. Predicting isolated labels with z i (4c) mislabels several entities, but these are corrected at the final output (4d). Figure 4e shows that the system learned to attend more to nearby entities (the window and building are closer to the tree), and 4f shows that stronger attention is learned for the class bird, presumably because it is usually more informative than common classes like tree.

Implementation details. The φ and α networks were each implemented as a single fully-connected (FC) layer with a 500-dimensional outputs. ρ was implemented as a FC network with 3 500dimensional hidden layers, with one 150-dimensional output for the entity probabilities, and one 51-dimensional output for relation probabilities. The attention mechanism was implemented as a network like to φ and α, receiving the same inputs, but using the output scores for the attention . The full code is available at https://github.com/shikorab/SceneGraph

## 6 Conclusion

We presented a deep learning approach to structured prediction, which constrains the architecture to be invariant to structurally identical inputs. As in score-based methods, our approach relies on pairwise features, capable of describing inter-label correlations, and thus inheriting the intuitive aspect of score-based approaches. However, instead of maximizing a score function (which leads to computationally-hard inference), we directly produce an output that is invariant to equivalent representations of the pairwise terms.

This axiomatic approach to model architecture can be extended in many ways. For image labeling, geometric invariances (shift or rotation) may be desired. In other cases, invariance to feature permutations may be desirable. We leave the derivation of the corresponding architectures to future work. Finally, there may be cases where the invariant structure is unknown and should be discovered from data, which is related to work on lifting graphical models Bui et al. (2013) . It would be interesting to explore algorithms that discover and use such symmetries for deep structured prediction.

Zellers, Rowan, Yatskar, Mark, Thomson, Sam, and Choi, Yejin. Neural motifs: Scene graph parsing with global context. arXiv preprint arXiv:1711.06640, abs/1711.06640, 2017.

Zheng, Shuai, Jayasumana, Sadeep, Romera-Paredes, Bernardino, Vineet, Vibhav, Su, Zhizhong, Du, Dalong, Huang, Chang, and Torr, Philip HS. Conditional random fields as recurrent neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1529-1537, 2015.

## 7 Supplementary Material

This supplementary material includes: (1) Visual illustration of the proof of Theorem 1. (2) Explaining how to integrate an attention mechanism in our GPI framework. 3Additional evaluation method to further analyze and compare our work with baselines.

7.1 Theorem 1: Illustration of Proof Figure 5 : Illustration of the proof of Theorem 1 using a specific construction example. Here H is a hash function of size L = 5 such that H(1) = 1, H(3) = 2, H(2) = 4, G is a three-node input graph, and zi,j ∈ R are the pairwise features (in purple) of G. (a) φ is applied to each zi,j. Each application yields a vector in R 5 . The three dark yellow columns correspond to φ(z1,1), φ(z1,2) and φ(z1,3). Then, all vectors φ(zi,j) are summed over j to obtain three si vectors. (b) α's (blue matrices) are an outer product between 1 [H(zi)] and si resulting in a matrix of zeros except one row. The dark blue matrix corresponds for α(z1, s1). (c) All α's are summed to a 5 × 5 matrix, isomorphic to the original zi,j matrix.

## 7.2 Characterizing Permutation Invariance: Attention

Attention is a powerful component which naturally can be introduced into our GPI model. We now show how attention can be introduced in our framework. Formally, we learn attention weights for the neighbors j of a node i, which scale the features z i,j of that neighbor. We can also learn different attention weights for individual features of each neighbor in a similar way.

Let w i,j ∈ R be an attention mask specifying the weight that node i gives to node j: w i,j (z i , z i,j , z j ) = e β(zi,zi,j ,zj ) / t e β(zi,zi,t,zt)

where β can be any scalar-valued function of its arguments (e.g., a dot product of z i and z j as in standard attention models). To introduce attention we wish α ∈ R e to have the form of weighting w i,j over neighboring feature vectors z i,j , namely, α = j =i w i,j z i,j .

To achieve this form we extend φ by a single entry, defining φ ∈ R e+1 (namely we set L = e + 1) as φ 1:e (z i , z i,j , z j ) = e β(zi,zi,j ,zj ) z i,j (here φ 1:e are the first e elements of φ) and φ e+1 (z i , z i,j ; z j ) = e β (zi,zi,j ,zj ) . We keep the definition of s i = j =i φ(z i , z i,j , z j ). Next, we define α = si,1:e si,e+1 and substitute s i and φ to obtain the desired form as attention weights w i,j over neighboring feature vectors z i,j : zi,zi,j ,zj ) z i,j j =i e β(zi,zi,j ,zj ) = j =i w i,j z i,j

α(z i , s i ) = s i,1:e s i,e+1 = j =i e β(

A similar approach can be applied over α and ρ to model attention over the outputs of α as well (graph nodes).

## 7.3 Scene Graph Results

In the main paper, we described the results for the two prediction tasks: SGCls and PredCls, as defined in section 5.2.1: "Experimental Setup and Results". To further analyze our module, we compare the best variant, GPI: LINGUISTIC, per relation to two baselines: (Lu et al., 2016) and Xu et al. (2017) . Table 2 : Recall@5 of PredCls for the 20-top relations ranked by their frequency, as in (Xu et al., 2017)

For simplicity, we focus on the task where boxes are given.3 The baseline does not use any LSTM or context, and is thus unrelated to the main contribution of Zellers et al. (2017).