Video Relationship Reasoning Using Gated Spatio-Temporal Energy Graph

Yao-Hung Hubert Tsai
S. Divvala
Louis-Philippe Morency
R. Salakhutdinov
Ali Farhadi
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2019
View in Semantic Scholar

Abstract

Visual relationship reasoning is a crucial yet challenging task for understanding rich interactions across visual concepts. For example, a relationship \{man, open, door\} involves a complex relation \{open\} between concrete entities \{man, door\}. While much of the existing work has studied this problem in the context of still images, understanding visual relationships in videos has received limited attention. Due to their temporal nature, videos enable us to model and reason about a more comprehensive set of visual relationships, such as those requiring multiple (temporal) observations (e.g., \{man, lift up, box\} vs. \{man, put down, box\}), as well as relationships that are often correlated through time (e.g., \{woman, pay, money\} followed by \{woman, buy, coffee\}). In this paper, we construct a Conditional Random Field on a fully-connected spatio-temporal graph that exploits the statistical dependency between relational entities spatially and temporally. We introduce a novel gated energy function parametrization that learns adaptive relations conditioned on visual observations. Our model optimization is computationally efficient, and its space computation complexity is significantly amortized through our proposed parameterization. Experimental results on benchmark video datasets (ImageNet Video and Charades) demonstrate state-of-the-art performance across three standard relationship reasoning tasks: Detection, Tagging, and Recognition.

1. Introduction

Relationship reasoning is a challenging task that not only involves detecting low-level entities (subjects, objects, etc.) but also recognizing the high-level interaction between them (actions, sizes, parts, etc.). Successfully reasoning about relationships not only enables us to build richer question-answering models (e.g., Which objects are larger than a car?), but also helps in improving image retrieval [20] (e.g., images with elephants drawing a cart), scene graph parsing [41] (e.g., woman has helmet), captioning [42] , and many other visual reasoning tasks.

Most contemporary research in visual relationship rea- Figure 1 . Visual relationship reasoning in images (top) vs. videos (bottom): Given a single image, it is ambiguous whether the monkey is creeping up or down the car. Using a video not only helps to unambiguously recognize a richer set of relations, but also model temporal correlations across them (e.g., creep down and jump left).

Figure 1. Visual relationship reasoning in images (top) vs. videos (bottom): Given a single image, it is ambiguous whether the monkey is creeping up or down the car. Using a video not only helps to unambiguously recognize a richer set of relations, but also model temporal correlations across them (e.g., creep down and jump left).

soning has been focused in the domain of static images. While this has resulted in several exciting and attractive reasoning modules [26, 20, 42, 18, 40, 45, 3, 17] , it lacks the ability from reasoning about complex relations that are inherently temporal and/or correlated in nature. For example, in Fig. 1 it is ambiguous to infer from a static image whether the monkey is creeping down or up the car. Also, it is difficult to model relations that are often correlated through time, such as man enters room and man open door.

In this paper, we present a novel approach for reasoning about visual relationships in videos. Our proposed approach jointly models the spatial and temporal structure of relationships in videos by constructing a fully-connected spatio-temporal graph (see Fig. 2 ). We refer to our model as a Gated Spatio-Temporal Energy Graph. In our graph, each node represents an entity and the edges between them denote the statistical relations. Unlike much of the previous work [15, 43, 27, 4, 31] that assumed a predefined or globally-learned pairwise energy function, we introduce an observation-gated version that allows us to make the statistical dependency between entities adaptive (conditioned on the observation).

Figure 2. An overview of our Proposed Gated Spatio-Temporal Energy Graph. Given an input instance (a video clip), we predict the output relationships (e.g., {monkey, creep down, car}, etc.,) by reasoning over a fully-connected spatio-temporal graph with nodes S (Subject), P (Predicate) and O (Object). Unlike previous works that assumed a non-gated (i.e., predefined or globally-learned) pairwise energy function, we explore the use of gated energy functions (i.e., conditioned on the specific visual observation) . Best viewed zoomed in and in color.

Our adaptive parameterization of energy function helps us model the natural diversification of relationships in Figure 2 . An overview of our Proposed Gated Spatio-Temporal Energy Graph. Given an input instance (a video clip), we predict the output relationships (e.g., {monkey, creep down, car}, etc.,) by reasoning over a fully-connected spatio-temporal graph with nodes S (Subject), P (Predicate) and O (Object). Unlike previous works that assumed a non-gated (i.e., predefined or globally-learned) pairwise energy function, we explore the use of gated energy functions (i.e., conditioned on the specific visual observation) . Best viewed zoomed in and in color.

videos. For instance, the dependency between man and cooking should be different conditioned on the observation (i.e., whether the location is kitchen or gym). However, given the large state space of observations (in videos), directly maintaining observation-dependent statistical dependencies may be computationally intractable [22, 35] . Towards this end, we develop an amortized parameterization of our new gated pairwise energy function, which combines ideas from clique template [33, 34, 21] , neural networks [8, 35] , and tensor factorization [14] for achieving efficient inference and learning. We evaluate our model on two benchmark datasets, Ima-geNet Video [24] and Charades [32] . Our method achieves state-of-the-art performance across three standard relationship reasoning tasks: detection, tagging, and recognition. We also study the utility of our model in the zero-shot setting and learning from semantic priors.

2. Related Work

Video Activity Recognition. The notion of activity in a video represents the interaction between objects [9, 12] or the interaction between an object and a scene [32] . While related to our task of relation reasoning, activity recognition does not require explicit prediction of all entities, such as subject, object, scene, and their relationships. The term relation used in activity recognition and relationship reasoning has different connotations. In the visual relationship reasoning literature, it refers to the correlation between different entities, such as object, verb, and scene, while in activity recognition, it refers to either correlation between activity predictions (i.e., single entity) or correlation between video segments. For example, [44] proposed Temporal Relation Network to reason the temporal 'relations' across frames at multiple time scales. [6] introduced a spatio-temporal aggregation on local convolutional features for better learning representations in the video. [38] proposed Non-Local Neural Networks to model pairwise relations for every pixel in the feature space from low-layers to higher-layers. The work was extended to [39] for constructing a Graph Convolutional Layer that further modeled relation between object-level features.

Visual Relationship Reasoning. Most recent works in relation reasoning have focused their analysis on static images [40, 45, 3, 17] . For example, [26] introduced the idea of visual phrases for compositing visual concepts of subject, predicate, and object. [20] decomposed the direct visual phrase detection task into individual detection on subject, predicate, and object leading to improved performance. [4] further applied conditional random fields on top of the individual predictions to leverage their statistical correlations. [18] proposed a deep variation-structured reinforcement learning framework and then formed a directed semantic action graph. The global interdependency in this graph facilitated predictions in local regions of the image. One of the key challenges of learning relationships in videos has been the lack of relevant annotated datasets. In this context, the recent work of [29] is inspiring as it contributes manually annotated relations for the ImageNet video dataset. Our work improves upon [29] on multiple fronts: (1) Instead of assuming no temporal contingency between relationships, we introduce a gated fully-connected spatio-temporal energy graph for modeling the inherently rich structure from videos; (2) We extend the study of relation triplet from subject/predicate/object to a more general setting, such as object/verb/scene [32] ; (3) We consider a new task 'relation recognition' (apart from relation detection and tagging) which requires the model to make predictions in a fine-grained manner; (4) For various metrics and tasks, our model demonstrates improved performance.

Deep Conditional Random Fields. Conditional Random Fields (CRFs) have been popularly used to model the statistical dependencies among predictions in images [10, 43, 27, 25, 4] and videos [23, 31] . Several extensions have been recently introduced for fully-connected CRF graphs. For example, [43, 27, 31] attempted to express fully-connected CRFs as recurrent neural networks and made the whole network end-to-end trainable, which has led to interesting applications in image segmentation [43, 27] and video activity recognition tasks [31] . In the characterization of CRFs, the unary energy function represents the inverse likelihood for assigning a label, while the binary energy function measures the cost of assigning multiple labels jointly. However, most of the existing parameterizations of binary energy functions [15, 43, 27, 4, 31] have limited or no connections to observed variables. Such parameterizations may not be optimal for video relationship reasoning due to the adaptive idiosyncrasy for statistical dependencies between entities. To address the issue, we instead propose an observationgated pairwise energy function with efficient and amortized parameterization.

3. Proposed Approach

The task of video relationship reasoning not only requires modeling the entity predictions spatially and temporally, but also maintaining a changeable correlation structure between entities across videos with various contents. To this end, we propose a Gated Spatio-Temporal Fully-Connected Energy Graph for capturing the inherently rich video structure into account.

We start by defining our notations using Fig. 2 as a running example. The input instance X lies in a video segment and consists of K synchronous input streams X = {X k } K k=1 . In this example, input streams are {object trajectories, predicate trajectories, subject trajectories}, and thus K = 3, where trajectories refer to the consecutive frames or bounding boxes in the video segment. Each input stream contains observations for T time steps (i.e.,

X k = {X k t } T t=1 )

, where for example object trajectories represent object bounding boxes through time. For each input stream, our goal is to predict a sequence of entities (labels) Y k = {Y k t } T t=1 . In Fig. 2 , the output sequence of predicate trajectories represent predicate labels through time. Hence we formulate the data-entities tuple as (X, Y )

with Y = {Y 1 t , Y 2 t • • • , Y K t } T t=1

representing a set of sequence of entities.

The entity Y k t should spatially relate to entities

{{Y 1 t , Y 2 t • • • , Y K t } \ {Y k t }} and temporally relate to en- tities {{Y k 1 , Y k 2 • • • , Y k T } \ {Y k t }}.

For example, suppose that the visual relationships observed in a grocery store are {{mother, pay, money}, {infant, get, milk}, {infant, drink, milk}}; spatial correlation must exist between mother/pay/money and temporal correlation must exist between pay/get/drink. We also note that implicit correlation may also exist between Y k t and Y k ′ t ′ for t = t ′ , k = k ′ . Based on the structural dependencies between entities, we propose to construct a Spatio-Temporal Fully-Connected Energy Graph (see Sec. 3.1), where each node represents an entity and each edge denotes the statistical dependencies between the connected nodes. To further take account that the statistical dependency between "get" and "drink" may be different depending on different observations (i.e., location in grocery store v.s. home), we introduce an observation-gated parameterization for pairwise energy functions. In the new parameterization, we amortize the potentially large computational cost by using clique templates [33, 34, 21] , neural network approximation [22, 35] , and tensor factorization [14] (see Sec. 3.2).

3.1. Spatio-Temporal Fully-Connected Graph

By treating the predictions of entities as random variables, the construction of the graph can be realized by forming a Markov Random Field (MRF) conditioned on a global observation, which is the input instance (i.e., X). Then, the tuple (X, Y ) can be modeled as a Conditional Random Field (CRF) parametrized by a Gibbs distribution of the form:

P Y = y|X = 1 Z(X) exp − E(y|X)

, where Z(X) is the partition function and E(y|X) is the energy of assigning labels

Y = y = {y 1 t , y 2 t , • • • , y K t } T t=1 conditioned on X.

Assuming only pairwise cliques in the graph i.e., P (y|X) := P ψ,ϕ (y|X), E(y|X) := E ψ,ϕ (y|X) , the energy can be expressed as:

EQUATION (1): Not extracted; please refer to original document.

where ψ t,k and ϕ t,k,t ′ ,k ′ are the unary and pairwise energy, respectively. In Eq. 1, the unary energy, which is defined on each node in the graph, captures inverse likelihood for assigning Y k t = y k t conditioned on the observation X. Typically, this term can be derived from an arbitrary classifier or regressor, such as a deep neural network [16] . On the other hand, the pairwise energy models interactions of label assignments across nodes Y k t = y k t , Y k ′ t ′ = y k ′ t ′ conditioned on the observation X. Therefore, the pairwise term determines the statistical dependencies between entities spatially and temporally. However, the parameterization in most previous works on fully-connected CRF [43, 27, 31, 4] assumes that the pairwise energy function is non-adaptive to the current observation, which may not be ideal to model changeable dependencies between entities across videos. In the following Sec.3.2, we propose an observation-gated parametrization for pairwise energy function to address the issue.

3.2. Gated Pairwise Energy Function

Much of existing work uses a simplified parameterization of pairwise energy function and typically considers only the smoothness of the joint label assignment. For instance, in Asynchronous Temporal Field [31] , ϕ • (y k an observation-independent label compatibility matrix followed by a spatio or temporal discounting factor. We argue that the parametrization of pairwise energy function should be more expressive. To this end, we define the pairwise energy as:

EQUATION (2): Not extracted; please refer to original document.

where f ϕ can be seen as a discrete lookup table that takes the input X of size |X| and outputs a large transition matrix of size ( [33, 34, 21] , deep neural networks [22, 35] , and tensor factorization [14] , our workaround is to parametrize and approximate f ϕ as f ϕ θ with learnable parameters θ as follows:

T 2 K 2 − 1) × |Y k t | × |Y k ′ t ′ |,

EQUATION (3): Not extracted; please refer to original document.

where

g kk ′ θ (•), r kk ′ θ (•) ∈ R |Y k t |×r and h kk ′ θ (•), s kk ′ θ (•) ∈ R |Y k ′ t ′ |×r

represent the r-rank projection from X k t , which is modeled by a deep neural network. A ⊗ B = AB ⊤ denotes the function on matrix A and B, and results in a transition matrix of size |Y k t |×|Y k ′ t ′ |. K σ t, t ′ is the Gaussian kernel with bandwidth σ representing discounting factor for different time steps.

The intuition behind our parametrization is as follows: First, we note that clique templates [33, 34, 21] are adopted spatially and temporally, which leads to scalable learning and inference. Second, the idea of using neural networks for approximating the lookup tables ensures both parameter efficiency and generalization [8, 35] . The lookup table maintains the state transitions of X → Y k × Y k ′ where calligraphy font denotes the corresponding state space. Finally, we choose r << min k {|Y k t |} so that a low-rank decomposition is performed on the transition matrix from Y k t to Y k ′ t ′ . The low-rank decomposition allows us to substantially reduce the number of learnable parameters. To summarize, our design for f ϕ θ amortize the large space complexity for f ϕ and is gated by observation.

3.3. Inference, Message Passing, And Learning

Minimizing the CRF energy in Eq. (1) returns the most probable label assignment problem of Y =

{y 1 t , y 2 t , • • • , y K t } T t=1

given the observation X. However, the exact inference in a fully connected CRF is often computationally intractable even with variables enumeration or elimination [13] . In this work, we adopt the commonly used mean-field algorithm [13] as approximate inference, which finds the approximate posterior distribution Q(Y ) such that Q(•) is closest to P ψ,ϕ (Y |X) in terms of KL(Q//P ψ,ϕ ) within the class of distributions representable as a product of independent marginals Q(Y ) = t,k Q(Y k t ). Following [13] , inference can now be realized as the naive meanfield updates with the coordinate descent optimization, and it can be expressed in terms of fixed-point message passing equations:

EQUATION (4): Not extracted; please refer to original document.

with Ψ t,k = exp − ψ t,k representing the unary potential and m • (•) denoting the message having form 1 of

m • (•) = exp − y k ′ t ′ ϕ t,k,t ′ ,k ′ (y k t , y k ′ t ′ |X)Q(y k ′ t ′ ) . (5)

To parametrize the unary energy function, we use a similar formulation:

EQUATION (6): Not extracted; please refer to original document.

where w k θ ∈ R |Y k t | represents the projection from X k t to logits of size |Y k t |, modeled by a deep neural network. Lastly, we cast the learning problem as minimizing conditional cross-entropy between the proposed distribution and the true one, where θ denotes the parameters we need in our model:

θ * = arg min θ E X,Y [−log Q(Y )].

4. Experimental Results & Analysis

In this section, we report our quantitative and qualitative analyses for validating the benefit of our proposed method. Our experiments are designed to compare different baselines and ablations for detecting and tagging relationships given a video as well as recognizing relationships in a fine-grained manner.

Datasets. We perform our analysis on two datasets: ImageNet Video [24] and Charades [32] . (a) ImageNet Video [24] contains videos (from daily-life as well as in-the-wild) with manually labeled bounding boxes for objects. We utilize the annotations from [29] , in which a subset of the videos having rich visual relationships were selected (1, 000 videos in total with 800 for training & rest given an input instance (with object and subject trajectories in a time segment), we measure the recognition accuracy of subject, predicate, object, and the relationship, which we term it relationship recognition. As the Charades dataset does not consider object localization, we perform recognition on object, verb, scene, and the relationship within a time segment (where relation recognition happens at the scale of a time segment in the video). We use Accuracy@K (K equals 1) for emphasizing whether the model gives the correct recognition result on the top 1 relationship prediction. as object detector trained on the 35 objects (categories in the annotation) from MS-COCO [19] and ImageNet Detection [24] datasets. Next, the method described in [5] is used to relate frame-level into a chunk-level object proposals. Then, non-maximum suppression (NMS) with vIoU > 0.5 is performed to reduce the numbers of generated chunklevel proposals. During training, proposals that have vIoU > 0.5 with the ground truth trajectories are selected to be the training proposals. However, all the generated proposals are preserved for evaluation.

• Feature Representation. Following Sec. 3 notation, we express the input instance X into K synchronous streams of features. For the ImageNet Video, K equals 3 and the synchronous streams of features are

{X s t , X p t , X o t } T t=1

. s, p, o and T denote subject, predicate, object, and the number of chunks in the input instance, respectively. Note that each instance may have different numbers of chunks, i.e., different T , because of various duration of relationships. The output Y s t , Y p t , and Y o t follow categorical distribution. As in [29] , in the t th chunk of the input instance, we choose the subject and object features (i.e., X s t and X o t ) to be the averaged features for the Faster-RCNN label probability distribution outputs. X p t , on the other hand, is chosen to be the concatenation of the following three features: the improved dense trajectory (iDT) feature [37] for subject tracklet, the iDT feature for object tracklet, and the relative spatio-temporal positions [29] between subject and object tracklets. See Suppl. for more details.

For Charades, the input instance X is expressed as

{X o t , X v t , X s t } T t=1

with o, v, and s denoting object, verb, and scene , respectively. Since we are performing relation-ship reasoning directly in the entire video, we let Y o t , Y v t be a multinomial distribution while Y s t still remains to be a categorical distribution. The multinomial distribution suggests that each chunk may contain ≥ 0 number of objects or verbs. We set X o t , X v t , and , X s t to have identical features: the output feature layer from I3D network [2] . See Suppl. for more details.

Baselines The closest baseline to our proposed model is VidVRD [29] . Beyond comparisons to [29] , we also perform a detailed ablation study of our method as well as relate to the image-based visual relationship reasoning methods (when applicable). VidVRD. VidVRD [29] adopted a structured loss on the multiplication of three features (i.e., X s t , X p t , and X o t for ImageNet Video). The loss took softmax over all training triplets, which resembles the training objective in Visual Phrases [26] (designed for image-based visual relationship reasoning). Note that VidVRD fails to consider the temporal structure of relationship predictions. GSTEG (Ours). We denote our proposed method as GSTEG (Gated Spatio-Temporal Energy Graph). For the ablation study, we choose the Energy Graph (EG) when considering different energy function designs as described below. STEG. Spatio-Temporal Energy Graph (STEG) takes into account the spatial and temporal structure of video entities. However, it assumes fixed statistical dependencies between entities. Specifically, it is the non-gated version of our full model. STEG can be seen as a modified version of Asynchronous Temporal Fields (AsyncTF) [31] such that we have (1) AsyncTF's output to be a relationship prediction, and (2) a fully-connected spatial graph. SEG. Compared to STEG, the Spatio Energy Graph (SEG) method does not consider the temporal structure of video entities. Specifically, SEG assumes a spatially-fullyconnected graph and thus the relationship predictions are made temporally independently. The counterpart in imagebased visual relationship reasoning methods is Deep Relational Networks (DRN) [4] . We can view SEG as casting DRN to (1) take the video-based input features and (2) consider continuous object bounding boxes through time instead of a bounding box in a single frame. UEG and UEG † . The Unary Energy Graph (UEG) considers the prediction of entities both spatially and temporally independently. The counterpart in image-based visual relationship reasoning methods is the Visual Relationship De- tection (VRD) method of [20] without using language priors (denoted as V RD V ). Similar to the modification from DRN to SEG, the accommodation from V RD V to UEG is having V RD V take the video-based features and consider object trajectories. We also perform experiments that extend UEG with additional triplet loss defined with language priors [20] , which we denote it as UEG † . The counterpart in image-based methods is the full V RD model of [20] .

(Please see Suppl. for more details about parameterizations and training for all the methods and datasets).

4.1. Quantitative Analysis

ImageNet Video. Table. 1 shows our results and comparisons to the baselines. We first observe that, for every metric across the three tasks (detection, tagging, and recognition), our proposed method (GSTEG) outperforms all the competing methods. Comparing the numbers between UEG and UEG † , we find that language priors can help promote visual relation reasoning. We also observe performance improvement from UEG to SEG, which could be explained by the fact that SEG explicitly models the spatial statistical dependency in {subject, predicate, object} and leads to a better relation learning between different entities. However, comparing SEG to STEG, the performance drops in some metrics, indicating that modeling temporal statistical dependency using a fixed pairwise energy parameterization may not be ideal. For example, although STEG gives a much better relationship recognition results as compared to SEG, it becomes worse in R@50 for detection and P@5 for tagging. This indicates that observation-gated parametrization for pairwise energy is able to capture different structure for different videos. When comparing energy graph models, VidVRD is able to outperform all our ablation baselines (except for the full version) in relation detection and tagging. However, it suffers from relation recognition, which requires a fine-grained understanding of visual relation in the given object and subject tracklets. Apart from the 'standard evaluation', we also consid-ered the 'zero-shot' setting, where zero-shot refers to the evaluation on the relative complement of training triplets in evaluation triplets. More specifically, in the ImageNet Video dataset, the number of all possible relation triplets is 35 × 132 × 35 = 161, 700. While the training set contains 2, 961 relation triplets (i.e., 1.83% of 161, 700), the evaluation set has 1, 011 relation triplets (i.e., 0.63% of 161, 700). The number of zero-shot relation triplets is 258, which is 25.5% in the evaluation set. Zero-Shot evaluation is very challenging due to the fact that we need to infer the never-seen relationship in the training set. We observe that, for most cases, our proposed method reaches the best performance compared to various baselines. The exception is mAP, where VidVRD attains the best performance using a structural objective. However, the overall trend of zero-shot evaluation mirrors standard evaluation.

Charades. Our results and comparisons are shown in Table. 2. We find that our method outperforms all relevant baselines. We also note some interesting differences between the trend of results in Charades vs. ImageNet Video: First, comparing UEG to UEG † , we observe that language priors do not really help the visual relationship reasoning in Charades. We argue that it may because of the larger inter-class distinction in Charades' categories set. For example, dog/cat or horse/zebra or sit front/front/jump front share some similarity in the category set in ImageNet Video, while the categories are less semantically similar in Charades. Second, STEG constantly outperforms SEG which indicates modeling a fixed temporal statistical dependency between entities may aid the visual relationship reasoning in Charades. We hypothesize that, as compared to the Im-ageNet Video dataset that has a diversified set of videos in the wild between animals or inorganic substances, Charades contains videos of human indoor activities where relations between entities are much easier to model by a fixed dependency. Finally, we observe that VidVRD performs substantially worse compared to all the other models, suggest- In Supplementary, we also provide the results when leveraging language priors into our model and also provide the comparisons with Structural-RNN [11] and Graph Convolutional Network [39] .

4.2. Qualitative Analysis

We next illustrate our qualitative results in Fig. 3 in the ImageNet Video dataset. For the relationship detection, in a scene with a person interacting with a horse, our model successfully detects 5 out of 6 relationships, while failing to detect horse-stand right-person in the top 100 detected relationships. In another scene with a car interacting with a person, our model only detects 1 relationship out of 7 ground-truth relationships. We argue that the reason may be because of the sand occlusion and the small size of a person. For relationship tagging, in a scene with a person riding a bike over another person, our model successfully tags all four relationships in the top 5 tagged results. Nevertheless, the third tagged result person-sit above-bicycle also looks visually plausible in this video. In another scene with a person playing with a dog on a sofa, our model fails to tag any correct relationships in the top 5 tagged results. Our model incorrectly identified dog as cat, representing the main reason why it failed.

Figure 3. Examples from ImageNet Video dataset of Relationship Detection (Left) & Tagging (Right) using baselines, ablations, and our full model. The bar plots illustrate the R@100 (left) and P@5 (right) difference comparing our model to VidVRD [29]. To show the results on all the methods, green boxes refer to a video where our model performs better and orange boxes refer to a video where VidVRD performs better. For tagging (right), we use green to highlight the correctly tagged relation and yellow for incorrectly tagged relation. The numbers in bracket represent the order of detection or tagging. Best viewed in color.

Since pairwise energy in a graphical model represents the negative statistical dependency between entities, in Fig. 4 , for a video in Charades dataset, we provide the illustration of pairwise energy when considering our gated and non-gated parameterization. Observe that the pairwise energies between the related entities are lower for the gated parameterization as compared to the non-gated one, suggesting that the gating mechanism is able to aid video relationship reasoning by improving statistical dependency between spatially or temporally correlated entities.

Figure 4. Analysis of non-gated and gated pairwise energies: Given an input video (top left) from Charades (that has {object, verb, scene} relationships), the matrices (top right) visualize the non-gated and gated pairwise energies between the verbs and objects (rows: 33 verbs, cols: 38 objects). Notice that for the verb sit (highlighted in red), the gated energy with objects chair, and table is lower compared to the corresponding non-gated pairwise energies, thereby helping towards improved relationship reasoning. A similar behavior is observed in case of verb to scene pairwise function (bottom left) as well as verb to verb pairwise function (bottom middle), which models the temporal correlations e.g., sit/sit or sit/stand. Best viewed in color and color in the matrix or vector is normalized in its own scale.

5. Conclusion

In this paper, we have presented a Gated Spatio-Temporal Energy Graph (GSTEG) model for the task of visual relationship reasoning in videos. In the graph, we consider a spatially and temporally fully-connected structure with an amortized observation-gated parameterization for the pairwise energy functions. The gated design enables the model to detect adaptive relations between entities conditioned on the current observation (i.e., current video). On two benchmark video datasets (ImageNet Video and Charades), our method achieves state-of-the-art performance across three relationship reasoning tasks (Detection, Tagging, and Recognition).

t , y k ′ t ′ |X) is defined as µ(y k t , yk ′ t ′ )K(t, t ′ ), where µ represents the label compatibility matrix and K(t, t ′ ) is an affinity kernel measurement which represents the discounting factor between t and t ′ . Similarly, in the image segmentation domain[43,27], ϕ • (s i , s j |I) is defined as µ(s i , s j )K(I i , I j ), where s {i,j} is the segment label and I {i,j} is the input feature for location {i, j} in image I. In these models, the pairwise energy comprises

In Supplementary, we make connection from our gated amortized parametrization for pairwise energy function in message form with Self-Attention[36] in machine translation and Non-Local Means[1] in Image Denoising.

The performance reported in[31] refers to the mean Avergage Precision (mAP) of 157 activities, while ours consider the detection of relation triplets. Although not being our focus, our method with the 157 activities

Table 1. Evaluation for different methods on ImageNet Video dataset. ∗ denotes the re-implementation of [29] after fixing the bugs in their released evaluation code (by contacting authors). † denotes the implementation with additional triplet loss term for language priors [20].

Table 2. Evaluation for different methods on Charades dataset. Our method outperforms all competing baselines across the three tasks.