Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Grounded Situation Recognition



We introduce Grounded Situation Recognition (GSR), a task that requires producing structured semantic summaries of images describing: the primary activity, entities engaged in the activity with their roles (e.g. agent, tool), and bounding-box groundings of entities. GSR presents important technical challenges: identifying semantic saliency, categorizing and localizing a large and diverse set of entities, overcoming semantic sparsity, and disambiguating roles. Moreover, unlike in captioning, GSR is straightforward to evaluate. To study this new task we create the Situations With Groundings (SWiG) dataset which adds 278,336 bounding-box groundings to the 11,538 entity classes in the imsitu dataset. We propose a Joint Situation Localizer and find that jointly predicting situations and groundings with end-to-end training handily outperforms independent training on the entire grounding metric suite with relative gains between 8% and 32%. Finally, we show initial findings on three exciting future directions enabled by our models: conditional querying, visual chaining, and grounded semantic aware image retrieval. Code and data available at this https URL.

. A Two examples from our dataset: semantic frames describe primary activities and relevant entities. Groundings are bounding-boxes colored to match roles. B Output of our model (dev set image). C Top-4 nearest neighbors to B using model predictions. Beyond visual similarity, these images are clearly semantically similar. D Output of the conditional model: given a bounding-box (yellow-dashed), predicts a relevant frame. E Example of grounded semantic chaining: given query boxes we are able to chain situations together. E.g. the teacher teaches students so they may work on a project arXiv:2003.12058v1 [cs.CV] 26 Mar 2020

1 Introduction

Situation Recognition [60] is the task of recognizing the activity happening in an image, the actors and objects involved in this activity, and the roles they play. The structured image descriptions produced by situation recognition are drawn from FrameNet [5] , a formal verb lexicon that pairs every verb with a frame of semantic roles, as shown in Figure 1 . These semantic roles describe how objects in the image participate in the activity described by the verb.

Fig. 1. A Two examples from our dataset: semantic frames describe primary activities and relevant entities. Groundings are bounding-boxes colored to match roles. B Output of our model (dev set image). C Top-4 nearest neighbors to B using model predictions. Beyond visual similarity, these images are clearly semantically similar. D Output of the conditional model: given a bounding-box (yellow-dashed), predicts a relevant frame. E Example of grounded semantic chaining: given query boxes we are able to chain situations together. E.g. the teacher teaches students so they may work on a project

As such, situation recognition generalizes several computer vision tasks such as image classification, activity recognition, and human object interaction. It is related to the task of image captioning, which also typically describes the salient objects and activities in an image using natural language. However, in contrast to captioning, it has the advantages of always producing a structured and complete (with regards to semantic roles) output and it does not suffer from the well known challenges of evaluating natural language captions.

While situation recognition addresses what is happening in an image, who is playing a part in this and what their roles are, it does not address a critical aspect of visual understanding: where the involved entities lie in the image. We address this shortcoming and present Grounded Situation Recognition (GSR), a task that builds upon situation recognition and requires one to not just identify the situation observed in the image but also visually ground the identified roles within the corresponding image. GSR presents the following technical challenges. Semantic saliency: in contrast to recognizing all entities in the image, it requires identifying the key objects and actors in the context of the primary activity being presented. Semantic sparsity: grounded situation recognition suffers from the problem of semantic sparsity [59] , with many combinations of roles and groundings rarely seen in training. This challenge requires models to learn from limited data. Ambiguity: grounding roles into images often requires disambiguating between multiple observed entities of the same category. Scale: the scales of the grounded entities vary vastly with some entities also being absent in the image (in which case models are responsible for detecting this absence). Halucination: labeling semantic roles and grounding them often requires halucinating the presence of objects since they may be fully occluded or off screen.

To train and benchmark models on GSR, we present the Situations With Groundings dataset (SWiG) that builds upon the large imSitu dataset by adding 278,336 bounding-box-based visual groundings to the annotated frames. SWiG contains groundings for most of the more than 10k entity classes in imSitu and exhibits a long tail distribution of grounded object classes. In addition to the aforementioned technical challenges of GSR, the diversity of activities, images, and grounded classes, makes SWiG particularly challenging for existing approaches.

Training neural networks for grounded situation recognition using the challenging SWiG dataset requires localizing roughly 10k categories; a task that modern object detection models like RetinaNet [34] struggle to scale to out of the box. We first propose modifications to RetinaNet that enables us to train large-class-cardinality object detectors. Using these modifications, we then create a strong baseline, the Independent Situation Localizer (ISL), that independently predicts the situation and groundings and uses late fusion to produce the desired outputs. Our proposed model, the Joint Situation Localizer (JSL), jointly predicts the situation and grounding conditioned on the context of the image. During training, JSL backpropagates gradients through the the entire network. JSL demonstrates the effectiveness of joint structured semantic prediction and grounding by improving both semantic role prediction and grounding and obtaining huge relative gains of between 8% and 32% points over ISL on the entire suite of grounding metrics.

Grounded situation recognition opens up several exciting avenues for future research. First, it enables us to build a Conditional Situation Localizer (CSL); a model that outputs a grounded situation conditioned on an input image and a specified region of interest within the image. CSL allows us to query what is happening in an image in regards to a specified query object or region. This is particularly revealing when entities are involved in multiple situations within an image or when an image consists of a large number of visible entities. Second, we show that such pointed conditioning models enable us to tackle higher order semantic relations amongst activities in images via visual chaining. Third, we show that grounded situation recognition models can serve as effective image retrieval mechanisms that can condition on linguistic as well as visual inputs and are able to retrieve images with the desired semantics.

In summary our contributions include: (i) proposing Grounded Situation Recognition, a task to identify the observed salient situation and ground the corresponding roles within the image, (ii) presenting the SWiG dataset towards building and benchmarking models for this task, (iii) showing that joint structured semantic prediction and grounding models improve both semantic role prediction and grounding by large margins, but also noting that there is still considerable ground for future improvements; (iv) revealing several exciting avenues for future research that exploit grounded situation recognition data to build models for semantic querying, visual chaining, and image retrieval. Our new dataset, code, and trained model weights will be publicly released.

2 Related Work

Grounded Situation Recognition is related to several areas of research at the intersection of vision and language and we now present a review of these below.

Describing Activities in Images. While recognizing actions in videos has been a major focus area [50, 25, 21, 48, 47] , describing activities from images has also received a lot of attention (see Gella et al. [15] for a more detailed overview).

Early works [23, 19, 10, 57, 58, 29, 13] framed this as a classification problem amongst a few verbs (running/walking/etc.) or few verb-object tuples (riding bike/riding horse/etc.). More recent work has focused on human object interactions [8, 30, 61, 45] with more classes; but the classes are either arbitrarily chosen or obtained by starting with a set of images and then labeling them with actions. Also, the relationships include Subject-Verb-Object triples or subsets thereof. In contrast, the imSitu dataset for situation recognition uses linguistic resources to define a large and more comprehensive space of possible situations, ensuring a fairly balanced datasets despite the large number of verbs (roughly 500) and modeling a detailed set of semantic roles per verb obtained from FrameNet [5] .

Image captioning is another popular setup to describe the salient actions taking place in an image with several datasets [9, 46, 1] and many recent neural models that perform well [53, 3, 24] . One serious drawback to image captioning is the well known challenge of evaluation which has led to a number of proposed metrics [6, 52, 2, 32, 38] ; but these problems continue to persist. Situation recognition does not face this issue and has clearly established metrics for evaluation owing to its structured frame output.

Other relevant works include visual sense disambiguation [16] , visual semantic role labelling [20] , and scene graph generation [28] with the latter two described in more detail below.

Visual Grounding. In contrast to associating full images with actions or captions, past works have also associated regions to parts of captions. This includes visual grounding i.e. associating words in a caption to regions in an image and referring expression generation i.e. producing a caption to unambiguously describe a region of interest; and there are several interesting datasets here.

Flickr30k-Entities [40] is a large dataset for grounded captioning. v-COCO [20] is more focused on semantic role labeling for human interactions with human groundings, action labels and relevant object groundings. Compared to SWiG, the verbs (26 vs 504) and semantic roles per verb (up to 2 vs up to 6) are fewer. HICO-Det [7] has 117 actions, but they only involve 80 objects, compared to nearly 10,000 objects in SWiG. In addition to these human centric datasets, SWiG also contains actions by animals and objects.

Large [51] achieved state of the art accuracy using attention graph neural nets. Our pro-posed grounded models build upon the RNN based approach of [36] owing to its simplicity and high accuracy; but our methods to combine situation recognition models with detectors can be applied to any of the aforementioned approaches. Large-Class-Cardinality Object Detection. While most popular object detectors are built and evaluated on datasets [35, 14] with few classes, some past works have addressed the problem of building detectors for thousands of classes. This includes YOLO-9000 [43] , DLM-FA [56] , R-FCN-3000 [49] , and CS-R-FCN [18] . Our modifications to RetinaNet borrow some ideas from these works. Task. Grounded Situation Recognition (GSR) builds upon situation recognition and requires one to identify the salient activity, the entities involved, the semantic roles they play and the locations of each entity in the image. The frame representation is drawn from the linguistic resource FrameNet and the visual groundings are akin to bounding boxes produced by object detectors. More formally, given an input image, the goal is to produce three outputs. (a) Verb: classifying the salient activity into one of 504 visually groundable verbs (one in which it is possible to view the action, for example, talking is visible, but thinking is not). (b) Frame: consists of 1 to 6 semantic role values i.e. nouns associated with the verb (each verb has its own pre-defined set of roles). For e.g., Fig. 2 shows that kneading consists of 3 roles: Agent, Item, and Place. Every image labeled with the verb kneading will have the same roles but may have different nouns filled in at each role based on the contents of the image. A role value can also be ∅ indicating that a role does not exist in an image (Fig. 2c ). (c) Groundings: each grounding is described with coordinates [x 1 , y 1 , x 2 , y 2 ] if the noun in grounded in the image. It is possible for a noun to be labeled in the frame but not grounded, for example in cases of occlusion.