Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

From Recognition to Cognition: Visual Commonsense Reasoning

Authors

  • Rowan Zellers
  • Yonatan Bisk
  • Ali Farhadi
  • Yejin Choi
  • 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2019
  • View in Semantic Scholar

Abstract

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer. Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and high-quality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show that while humans find VCR easy (over 90% accuracy), state-of-the-art vision models struggle (~45%). To move towards cognition-level understanding, we present a new reasoning engine, Recognition to Cognition Networks (R2C), that models the necessary layered inferences for grounding, contextualization, and reasoning. R2C helps narrow the gap between humans and machines (~65%); still, the challenge is far from solved, and we provide analysis that suggests avenues for future work.

Given an image, a list of regions, and a question, a model must answer the question and provide a rationale explaining why its answer is right. Our questions challenge computer vision systems to go beyond recognition-level understanding, towards a higher-order cognitive and commonsense understanding of the world depicted by the image.

The task is agnostic to the representation of the mask, but it could be thought of as a list of polygons p, with each polygon consisting of a sequence of 2d vertices inside the box p j = {x t , y t } t .

Namely, Fandango MovieClips: youtube.com/user/movieclips.3 We annotated images for 'interestingness' and trained a classifier using CNN features and detection statistics, details in the appendix, Sec B.4 This additional clip-level context helps workers ask and answer about what will happen next.5 More details in the appendix, Sec B.

We tuned this hyperparameter by asking crowd workers to answer multiple-choice questions at several thresholds, and chose the value for which human performance is above 90% -details in appendix Sec C.

Our code is also available online at visualcommonsense.com.

For VQA, the model is trained by sampling positive or negative answers for a given question; for our dataset, we simply use the result of the perceptron (for response r (i) ) as the i-th logit.11 To match the other GRUs used in[4,42,6] which encode q.

As we find in Appendix D, including additional language context tends to boost model performance.

This was used to create the SWAG dataset, a multiple choice NLP dataset for natural language inference.

Table 2: Ablations for R2C, over the validation set. ‘No query’ tests the importance of integrating the query during contextualization; removing this reduces Q→AR performance by 20%. In ‘no reasoning’, the LSTM in the reasoning stage is removed; this hurts performance by roughly 1%. Removing the visual features during grounding, or using GloVe embeddings rather than BERT, lowers performance significantly, by 10% and 25% respectively.

Abstract

Visual understanding goes well beyond object recognition. With one glance at an image, we can effortlessly imagine the world beyond the pixels: for instance, we can infer people's actions, goals, and mental states. While this task is easy for humans, it is tremendously difficult for today's vision systems, requiring higher-order cognition and commonsense reasoning about the world. We formalize this task as Visual Commonsense Reasoning. Given a challenging question about an image, a machine must answer correctly and then provide a rationale justifying its answer.

Next, we introduce a new dataset, VCR, consisting of 290k multiple choice QA problems derived from 110k movie scenes. The key recipe for generating non-trivial and highquality problems at scale is Adversarial Matching, a new approach to transform rich annotations into multiple choice questions with minimal bias. Experimental results show

1. Introduction

With one glance at an image, we can immediately infer what is happening in the scene beyond what is visually obvious. For example, in the top image of Figure 1 , not only do we see several objects (people, plates, and cups), we can also reason about the entire situation: three people are dining together, they have already ordered their food before 1

Figure 1. Not extracted; please refer to original document.

Vcr

Visual Commonsense Reasoning the photo has been taken, [person3 ] is serving and not eating with them, and what [person1 ] ordered are the pancakes and bacon (as opposed to the cheesecake), because [person4 ] is pointing to [person1 ] while looking at the server, [person3 ] .

Visual understanding requires seamless integration between recognition and cognition: beyond recognition-level perception (e.g., detecting objects and their attributes), one must perform cognition-level reasoning (e.g., inferring the likely intents, goals, and social dynamics of people) [13] . State-of-the-art vision systems can reliably perform recognition-level image understanding, but struggle with complex inferences, like those in Figure 1 . We argue that as the field has made significant progress on recognition-level building blocks, such as object detection, pose estimation, and segmentation, now is the right time to tackle cognitionlevel reasoning at scale.

As a critical step toward complete visual understanding, we present the task of Visual Commonsense Reasoning. Given an image, a machine must answer a question that requires a thorough understanding of the visual world evoked by the image. Moreover, the machine must provide a rationale justifying why that answer is true, referring to the details of the scene, as well as background knowledge about how the world works. These questions, answers, and rationales are expressed using a mixture of rich natural language as well as explicit references to image regions. To support clean-cut evaluation, all our tasks are framed as multiple choice QA.

Our new dataset for this task, VCR, is the first of its kind and is large-scale -290k pairs of questions, answers, and rationales, over 110k unique movie scenes. A crucial challenge in constructing a dataset of this complexity at this scale is how to avoid annotation artifacts. A recurring challenge in most recent QA datasets has been that humanwritten answers contain unexpected but distinct biases that models can easily exploit. Often these biases are so prominent so that models can select the right answers without even looking at the questions [28, 61, 72] .

Thus, we present Adversarial Matching, a novel QA assignment algorithm that allows for robust multiple-choice dataset creation at scale. The key idea is to recycle each correct answer for a question exactly three times -as a negative answer for three other questions. Each answer thus has the same probability (25%) of being correct: this resolves the issue of answer-only biases, and disincentivizes machines from always selecting the most generic answer.

We formulate the answer recycling problem as a constrained optimization based on the relevance and entailment scores between each candidate negative answer and the gold answer, as measured by state-of-the-art natural language inference models [10, 57, 15] . A neat feature of our recycling algorithm is a knob that can control the tradeoff between human and machine difficulty: we want the problems to be hard for machines while easy for humans.

Narrowing the gap between recognition-and cognitionlevel image understanding requires grounding the meaning of the natural language passage in the visual data, understanding the answer in the context of the question, and reasoning over the shared and grounded understanding of the question, the answer, the rationale and the image. In this paper we introduce a new model, Recognition to Cognition Networks (R2C). Our model performs three inference steps. First, it grounds the meaning of a natural language passage with respect to the image regions (objects) that are directly referred to. It then contextualizes the meaning of an answer with respect to the question that was asked, as well as the global objects not mentioned. Finally, it reasons over this shared representation to arrive at an answer.

Experiments on VCR show that R2C greatly outperforms state-of-the-art visual question-answering systems: obtaining 65% accuracy at question answering, 67% at answer justification, and 44% at staged answering and justification. Still, the task and dataset is far from solved: humans score roughly 90% on each. We provide detailed insights and an ablation study to point to avenues for future research.

In sum, our major contributions are fourfold: (1) we formalize a new task, Visual Commonsense Reasoning, and (2) present a large-scale multiple-choice QA dataset, VCR, (3) that is automatically assigned using Adversarial Matching, a new algorithm for robust multiple-choice dataset creation. (4) We also propose a new model, R2C, that aims to mimic the layered inferences from recognition to cognition; this also establishes baseline performance on our new challenge. The dataset is available to download, along with code for our model, at visualcommonsense.com.

2. Task Overview

We present VCR, a new task that challenges vision systems to holistically and cognitively understand the content of an image. For instance, in Figure 1 , we need to understand the activities ( [person3 ] is delivering food), the roles of people ([person1 ] is a customer who previously ordered food), the mental states of people ( [person1 ] wants to eat), and the likely events before and after the scene ( [person3 ] will serve the pancakes next). Our task covers these categories and more: a distribution of the inferences required is in Figure 2 .

Figure 2: Overview of the types of inference required by questions in VCR. Of note, 38% of the questions are explanatory ‘why’ or ‘how’ questions, 24% involve cognitionlevel activities, and 13% require temporal reasoning (i.e., what might come next). These categories are not mutually exclusive; an answer might require several hops of different types of inferences (see appendix Sec A).

Visual understanding requires not only answering questions correctly, but doing so for the right reasons. We thus require a model to give a rationale that explains why its answer is true. Our questions, answers, and rationales are written in a mixture of rich natural language as well as detection tags, like '[person2 ] ': this helps to provide an unambiguous link between the textual description of an object ('the man on the left in the white shirt') and the corresponding image region.

To make evaluation straightforward, we frame our ultimate task -of staged answering and justification -in a multiple-choice setting. Given a question along with four answer choices, a model must first select the right answer. If its answer was correct, then it is provided four rationale choices (that could purportedly justify its correct answer), and it must select the correct rationale. We call this Q→AR as for the model prediction to be correct requires both the chosen answer and then the chosen rationale to be correct.

Our task can be decomposed into two multiple-choice sub-tasks, that correspond to answering (Q→A) and justification (QA→R) respectively: Definition VCR subtask. A single example of a VCR subtask consists of an image I, and:

• A sequence o of object detections. Each object detection o i consists of a bounding box b, a segmentation mask m 1 , and a class label`i ∈L. • A query q, posed using a mix of natural language and pointing. Each word q i in the query is either a word in a vocabulary V, or is a tag referring to an object in o. • A set of N responses, where each response r (i) is written in the same manner as the query: with natural language and pointing. Exactly one response is correct. The model chooses a single (best) response.

In question-answering (Q→A), the query is the question and the responses are answer choices. In answer justification (QA→R), the query is the concatenated question and correct answer, while the responses are rationale choices.

In this paper, we evaluate models in terms of accuracy and use N=4 responses. Baseline accuracy on each subtask is then 25% (1/N). In the holistic setting (Q→AR), baseline accuracy is 6.25% (1/N 2 ) as there are two subtasks.

3. Data Collection

In this section, we describe how we collect the questions, correct answers and correct rationales for VCR. Our key insight -towards collecting commonsense visual reasoning problems at scale -is to carefully select interesting situations. We thus extract still images from movie clips. The images from these clips describe complex situations that humans can decipher without additional context: for instance, in Figure 1 , we know that [person3 ] will serve [person1 ] pancakes, whereas a machine might not understand this unless it sees the entire clip.

Interesting and Diverse Situations To ensure diversity, we make no limiting assumptions about the predefined set of actions. Rather than searching for predefined labels, which can introduce search engine bias [76, 16, 20] , we collect images from movie scenes. The underlying scenes come from the Large Scale Movie Description Challenge [67] and YouTube movie clips. 2 To avoid simple images, we train and apply an 'interestingness filter' (e.g. a closeup of a syringe in Figure 3 ). 3 We center our task around challenging questions requiring cognition-level reasoning. To make these cognitionlevel questions simple to ask, and to avoid the clunkiness of referring expressions, VCR's language integrates object tags ( [person2 ] ) and explicitly excludes referring expressions ('the woman on the right.') These object tags are detected from Mask-RCNN [29, 24] , and the images are filtered so as to have at least three high-confidence tags.

Figure 3: An overview of the construction of VCR. Using a state-of-the-art object detector [29, 24], we identify the objects in each image. The most interesting images are passed to crowd workers, along with scene-level context in the form of scene descriptions (MovieClips) and video captions (LSMDC, [67]). The crowd workers use a combination of natural language and detection tags to ask and answer challenging visual questions, also providing a rationale justifying their answer.

Crowdsourcing Quality Annotations Workers on Amazon Mechanical Turk were given an image with detections, along with additional context in the form of video captions. 4 They then ask one to three questions about the image; for each question, they provide a reasonable answer and a rationale. To ensure top-tier work, we used a system of quality checks and paid our workers well. 5 The result is an underlying dataset with high agreement and diversity of reasoning. Our dataset contains a myriad of interesting commonsense phenomena ( Figure 2 ) and a great diversity in terms of unique examples (Supp Section A); almost every answer and rationale is unique.

Shot+Object Detection

What is [person1] doing?

[person1] is injecting a needle into someone on the floor.

[person1] has a needle in his hand and is aggressively lowering it, in a stabbing motion.

Question:

(likely)

Interestingness Filter

Crowd Workers Ask And Answer Questions

Answer:

Rationale: Figure 3 : An overview of the construction of VCR. Using a state-of-the-art object detector [29, 24] , we identify the objects in each image. The most interesting images are passed to crowd workers, along with scene-level context in the form of scene descriptions (MovieClips) and video captions (LSMDC, [67] ). The crowd workers use a combination of natural language and detection tags to ask and answer challenging visual questions, also providing a rationale justifying their answer.

4. Adversarial Matching

We cast VCR as a four-way multiple choice task, to avoid the evaluation difficulties of language generation or captioning tasks where current metrics often prefer incorrect machine-written text over correct human-written text [49] . However, it is not obvious how to obtain highquality incorrect choices, or counterfactuals, at scale. While past work has asked humans to write several counterfactual choices for each correct answer [75, 46] , this process is expensive. Moreover, it has the potential of introducing annotation artifacts: subtle patterns that are by themselves highly predictive of the 'correct' or 'incorrect' label [72, 28, 61] .

In this work, we propose Adversarial Matching: a new method that allows for any 'language generation' dataset to be turned into a multiple choice test, while requiring minimal human involvement. An overview is shown in Figure 4 . Our key insight is that the problem of obtaining good counterfactuals can be broken up into two subtasks: the counterfactuals must be as relevant as possible to the context (so that they appeal to machines), while they cannot be overly similar to the correct response (so that they don't become correct answers incidentally). We balance between these two objectives to create a dataset that is challenging for machines, yet easy for humans.

Figure 4: Overview of Adversarial Matching. Incorrect choices are obtained via maximum-weight bipartite matching between queries and responses; the weights are scores from state-of-the-art natural language inference models. Assigned responses are highly relevant to the query, while they differ in meaning versus the correct responses.

Formally, our procedure requires two models: one to compute the relevance between a query and a response, P rel , and another to compute the similarity between two response choices, P sim . Here, we employ state-of-the-art models for Natural Language Inference: BERT [15] and ESIM+ELMo [10, 57] , respectively. 6 Then, given dataset examples (q i , r i ) 1≤i≤N , we obtain a counterfactual for each q i by performing maximum-weight bipartite matching [55, 40] on a weight matrix W ∈ R N×N , given by

EQUATION (1): Not extracted; please refer to original document.

Here, >0 controls the tradeoff between similarity and rel- 6 We finetune P rel (BERT), on the annotated data (taking steps to avoid data leakage), whereas P sim (ESIM+ELMo) is trained on entailment and paraphrase data -details in appendix Sec C. evance. 7 To obtain multiple counterfactuals, we perform several bipartite matchings. To ensure that the negatives are diverse, during each iteration we replace the similarity term with the maximum similarity between a candidate response r j and all responses currently assigned to q i . Ensuring dataset integrity To guarantee that there is no question/answer overlap between the training and test sets, we split our full dataset (by movie) into 11 folds. We match the answers and rationales invidually for each fold. Two folds are pulled aside for validation and testing.

V I 7 b x N B E F 4 f j 4 T j 5 U B J c 8 J C Q h T W H Y o U 6 C J I k Q Y R J E w i 2 Z Y 1 t x 6 f V 9 6 9 P W b n k j g n 1 / k T t N B R I V r + B w X / h b V z B W e T k V b 7 7 T f P n Z m 0 0 M p x H P 9 u B T d u 3 r q 9 t X 0 n v H v v / o O H 7 Z 1 H n 5 w t S W J P W m 3 p J A W H W u X Y Y 8 U a T w p C M K n G 4 3 T 2 d q k / P k V y y u Y f e V 7 g 0 E C W q 4 m S w J 4 a t d u D 1 O q x m x t / V Z 8 X o 9 1 R u x N 3 4 5 V E m y C p Q U f U c j T a a f 0 Z j K 0 s D e Y s N T j X T + K C h x U Q K 6 l x E Q 5 K h w X I G W T Y 9 z A H g 2 5 Y r U p f R M 8 8 M 4 4 m l v z J O V q x / 3 p U Y N y y O m 9 p g K d u X b c k / 6 f r l z x 5 N a x U X p S M u b x K N C l 1 x D Z a 9 i E a K 0 L J e u 4 B S F K + 1 k h O g U C y 7 1 Y j i y k 1 K 7 J n j Z 9 U E r R s M h l B M V X y v

V I 7 b x N B E F 4 f j 4 T j 5 U B J c 8 J C Q h T W H Y o U 6 C J I k Q Y R J E w i 2 Z Y 1 t x 6 f V 9 6 9 P W b n k j g n 1 / k T t N B R I V r + B w X / h b V z B W e T k V b 7 7 T f P n Z m 0 0 M p x H P 9 u B T d u 3 r q 9 t X 0 n v H v v / o O H 7 Z 1 H n 5 w t S W J P W m 3 p J A W H W u X Y Y 8 U a T w p C M K n G 4 3 T 2 d q k / P k V y y u Y f e V 7 g 0 E C W q 4 m S w J 4 a t d u D 1 O q x m x t / V Z 8 X o 9 1 R u x N 3 4 5 V E m y C p Q U f U c j T a a f 0 Z j K 0 s D e Y s N T j X T + K C h x U Q K 6 l x E Q 5 K h w X I G W T Y 9 z A H g 2 5 Y r U p f R M 8 8 M 4 4 m l v z J O V q x / 3 p U Y N y y O m 9 p g K d u X b c k / 6 f r l z x 5 N a x U X p S M u b x K N C l 1 x D Z a 9 i E a K 0 L J e u 4 B S F K + 1 k h O g U C y 7 1 Y j i y k 1 K 7 J n j Z 9 U E r R s M h l B M V X y v

V I 7 b x N B E F 4 f j 4 T j 5 U B J c 8 J C Q h T W H Y o U 6 C J I k Q Y R J E w i 2 Z Y 1 t x 6 f V 9 6 9 P W b n k j g n 1 / k T t N B R I V r + B w X / h b V z B W e T k V b 7 7 T f P n Z m 0 0 M p x H P 9 u B T d u 3 r q 9 t X 0 n v H v v / o O H 7 Z 1 H n 5 w t S W J P W m 3 p J A W H W u X Y Y 8 U a T w p C M K n G 4 3 T 2 d q k / P k V y y u Y f e V 7 g 0 E C W q 4 m S w J 4 a t d u D 1 O q x m x t / V Z 8 X o 9 1 R u x N 3 4 5 V E m y C p Q U f U c j T a a f 0 Z j K 0 s D e Y s N T j X T + K C h x U Q K 6 l x E Q 5 K h w X I G W T Y 9 z A H g 2 5 Y r U p f R M 8 8 M 4 4 m l v z J O V q x / 3 p U Y N y y O m 9 p g K d u X b c k / 6 f r l z x 5 N a x U X p S M u b x K N C l 1 x D Z a 9 i E a K 0 L J e u 4 B S F K + 1 k h O g U C y 7 1 Y j i y k 1 K 7 J n j Z 9 U E r R s M h l B M V X y v

V I 7 b x N B E F 4 f j 4 T j 5 U B J c 8 J C Q h T W H Y o U 6 C J I k Q Y R J E w i 2 Z Y 1 t x 6 f V 9 6 9 P W b n k j g n 1 / k T t N B R I V r + B w X / h b V z B W e T k V b 7 7 T f P n Z m 0 0 M p x H P 9 u B T d u 3 r q 9 t X 0 n v H v v / o O H 7 Z 1 H n 5 w t S W J P W m 3 p J A W H W u X Y Y 8 U a T w p C M K n G 4 3 T 2 d q k / P k V y y u Y f e V 7 g 0 E C W q 4 m S w J 4 a t d u D 1 O q x m x t / V Z 8 X o 9 1 R u x N 3 4 5 V E m y C p Q U f U c j T a a f 0 Z j K 0 s D e Y s N T j X T + K C h x U Q K 6 l x E Q 5 K h w X I G W T Y 9 z A H g 2 5 Y r U p f R M 8 8 M 4 4 m l v z J O V q x / 3 p U Y N y y O m 9 p g K d u X b c k / 6 f r l z x 5 N a x U X p S M u b x K N C l 1 x D Z a 9 i E a K 0 L J e u 4 B S F K + 1 k h O g U C y 7 1 Y j i y k 1 K 7 J n j Z 9 U E r R s M h l B M V X y v

5. Recognition To Cognition Networks

We introduce Recognition to Cognition Networks (R2C), a new model for visual commonsense reasoning. To perform well on this task requires a deep understanding of language, vision, and the world. For example, in Figure 5 , answering 'Why is [person4 ] pointing at [person1 ] ?' requires multiple inference steps. First, we ground the meaning of the query and each response, which involves referring to the image for the ...

Figure 5: High-level overview of our model, R2C. We break the challenge of Visual Commonsense Reasoning into three components: grounding the query and response, contextualizing the response within the context of the query and the entire image, and performing additional reasoning steps on top of this rich representation.

Lstm1 Lstm2 Lstm3

He is telling ... Query q Figure 5 : High-level overview of our model, R2C. We break the challenge of Visual Commonsense Reasoning into three components: grounding the query and response, contextualizing the response within the context of the query and the entire image, and performing additional reasoning steps on top of this rich representation.

= " > A A A D G X i c d V L N b t N A E N 6 Y v 2 L + U j h y s Y g q p Q h F d o V E u V X A g Q u i S I R W i k M 0 3 o y d V X a 9 7 u 4 Y G i w / A a / A S 3 C F G y f E l R M H 3 o V N a i S c 0 J F W + + 0 3 v z s z S S G F p T D 8 1 f E u X L x 0 + c r W V f / a 9 R s 3 b 3 W 3 b 7 + x u j Q c h 1 x L b Y 4 T s C h F j k M S J P G 4 M A g q k X i U z J 8 u 9 U f v 0 F i h 8 9 e 0 K H C s I M t F K j i Q o y b d n X Q S 0 w w J + r E C m i V p d V I / + A t N / b b q i 9 1 6 d 9 L t h Y N w J c E m i B r Q Y 4 0 c T r Y 7 v + O p 5 q X C n L g E a 0 d R W N C 4 A k O C S 6 z 9 u L R Y A J 9 D h i M H c 1 B o x 9 X q P 3 W w 4 5 h p k G r j T k 7 B i v 3 X o w J l 7 U I l z n J Z q V 3 X L c n / 6 U Y l p f v j S u R F S Z j z s 0 R p K Q P S w b I 5 w V Q Y 5 C Q X D g A 3 w t U a 8 B k Y 4 O R a 2 M q i S k n C 6 P e t n 1 Q c J G 8 z m Y F i J v

= " > A A A D G X i c d V L N b t N A E N 6 Y v 2 L + U j h y s Y g q p Q h F d o V E u V X A g Q u i S I R W i k M 0 3 o y d V X a 9 7 u 4 Y G i w / A a / A S 3 C F G y f E l R M H 3 o V N a i S c 0 J F W + + 0 3 v z s z S S G F p T D 8 1 f E u X L x 0 + c r W V f / a 9 R s 3 b 3 W 3 b 7 + x u j Q c h 1 x L b Y 4 T s C h F j k M S J P G 4 M A g q k X i U z J 8 u 9 U f v 0 F i h 8 9 e 0 K H C s I M t F K j i Q o y b d n X Q S 0 w w J + r E C m i V p d V I / + A t N / b b q i 9 1 6 d 9 L t h Y N w J c E m i B r Q Y 4 0 c T r Y 7 v + O p 5 q X C n L g E a 0 d R W N C 4 A k O C S 6 z 9 u L R Y A J 9 D h i M H c 1 B o x 9 X q P 3 W w 4 5 h p k G r j T k 7 B i v 3 X o w J l 7 U I l z n J Z q V 3 X L c n / 6 U Y l p f v j S u R F S Z j z s 0 R p K Q P S w b I 5 w V Q Y 5 C Q X D g A 3 w t U

= " > A A A D G X i c d V L N b t N A E N 6 Y v 2 L + U j h y s Y g q p Q h F d o V E u V X A g Q u i S I R W i k M 0 3 o y d V X a 9 7 u 4 Y G i w / A a / A S 3 C F G y f E l R M H 3 o V N a i S c 0 J F W + + 0 3 v z s z S S G F p T D 8 1 f E u X L x 0 + c r W V

= " > A A A D G X i c d V L N b t N A E N 6 Y v 2 L + U j h y s Y g q p Q h F d o V E u V X A g Q u i S I R W i k M 0 3 o y d V X a 9 7 u 4 Y G i w / A a / A S 3 C F G y f E l R M H 3 o V N a i S c 0 J F W + + 0 3 v z s z S S G F p T D 8 1 f E u X L x 0 + c r W V

two people. Second, we contextualize the meaning of the query, response, and image together. This step includes resolving the referent 'he,' and why one might be pointing in a diner. Third, we reason about the interplay of relevant image regions, the query, and the response. In this example, the model must determine the social dynamics between [person1 ] and [person4 ] . We formulate our model as three high-level stages: grounding, contextualization, and reasoning, and use standard neural building blocks to implement each component.

In more detail, recall that a model is given an image, a set of objects o, a query q, and a set of responses r (i) (of which exactly one is correct). The query q and response choices r (i) are all expressed in terms of a mixture of natural language and pointing to image regions: notation-wise, we will represent the object tagged by a word w as o w . If w isn't a detection tag, o w refers to the entire image boundary. Our model will then consider each response r separately, using the following three components:

Grounding The grounding module will learn a joint image-language representation for each token in a sequence. Because both the query and the response contain a mixture of tags and natural language words, we apply the same grounding module for each (allowing it to share parameters). At the core of our grounding module is a bidirectional LSTM [34] which at each position is passed as input a word representation for w i , as well as visual features for o w i . We use a CNN to learn object-level features: the visual representation for each region o is Roi-Aligned from its bounding region [63, 29] . To additionally encode information about the object's class label`o, we project an embedding of`o (along with the object's visual features) into a shared hidden representation. Let the output of the LSTM over all positions be r, for the response and q for the query.

Contextualization Given a grounded representation of the query and response, we use attention mechanisms to contextualize these sentences with respect to each other and the image context. For each position i in the response, we will define the attended query representation asq i using the following equation:

EQUATION (2): Not extracted; please refer to original document.

To contextualize an answer with the image, including implicitly relevant objects that have not been picked up from the grounding stage, we perform another bilinear attention between the response r and each object o's image features. Let the result of the object attention beô i .

Reasoning Last, we allow the model to reason over the response, attended query and objects. We accomplish this using a bidirectional LSTM that is given as contextq i , r i , andô i for each position i. For better gradient flow through the network, we concatenate the output of the reasoning LSTM along with the question and answer representations for each timestep: the resulting sequence is max-pooled and passed through a multilayer perceptron, which predicts a logit for the query-response compatibility.

Neural architecture and training details For our image features, we use ResNet50 [30] . To obtain strong representations for language, we used BERT representations [15] . BERT is applied over the entire question and answer choice, and we extract a feature vector from the second-tolast layer for each word. We train R2C by minimizing the multi-class cross entropy between the prediction for each response r (i) , and the gold label. See the appendix (Sec E) for detailed training information and hyperparameters. 8

6. Results

In this section, we evaluate the performance of various models on VCR. Recall that our main evaluation mode is the staged setting (Q→AR). Here, a model must choose the right answer for a question (given four answer choices), and then choose the right rationale for that question and answer (given four rationale choices). If it gets either the answer or the rationale wrong, the entire prediction will be wrong. This holistic task decomposes into two sub-tasks wherein we can train individual models: question answering (Q→A) as well as answer justification (QA→R). Thus, in addition to reporting combined Q→AR performance, we will also report Q→A and QA→R. Task setup A model is presented with a query q, and four response choices r (i) . Like our model, we train the baselines using multi-class cross entropy between the set of responses and the label. Each model is trained separately for question answering and answer justification. 9

6.1. Baselines

We compare our R2C to several strong language and vision baselines.

Text-only baselines We evaluate the level of visual reasoning needed for the dataset by also evaluating purely text-only models. For each model, we represent q and r (i) as streams of tokens, with the detection tags replaced by the object name (e.g. chair5 → chair). To minimize the discrepancy between our task and pretrained models, we replace person detection tags with gender-neutral names. a. BERT [15] : BERT is a recently released NLP model that achieves state-of-the-art performance on many NLP tasks. b. BERT (response only) We use the same BERT model, however, during fine-tuning and testing the model is only given the response choices r (i) . c. ESIM+ELMo [10] : ESIM is another high performing model for sentence-pair classification tasks, particularly when used with ELMo embeddings [57] . 9 We follow the standard train, val and test splits. i) . VQA Baselines Additionally we compare our approach to models developed on the VQA dataset [5] . All models use the same visual backbone as R2C (ResNet 50) as well as text representations (GloVe; [56] ) that match the original implementations. e. RevisitedVQA [38] : This model takes as input a query, response, and image features for the entire image, and passes the result through a multilayer perceptron, which has to classify 'yes' or 'no'. 10 f. Bottom-up and Top-down attention (BottomUpTop-Down) [4] : This model attends over region proposals given by an object detector. To adapt to VCR, we pass this model object regions referenced by the query and response. g. Multimodal Low-rank Bilinear Attention (MLB) [42] : This model uses Hadamard products to merge the vision and language representations given by a query and each region in the image. h. Multimodal Tucker Fusion (MUTAN) [6] : This model expresses joint vision-language context in terms of a tensor decomposition, allowing for more expressivity.

We note that BottomUpTopDown, MLB, and MUTAN all treat VQA as a multilabel classification over the top 1000 answers [4, 50] . Because VCR is highly diverse (Supp A), for these models we represent each response r (i) using a GRU [11] . 11 The output logit for response i is given by the dot product between the final hidden state of the GRU encoding r (i) , and the final representation from the model. Human performance We asked five different workers on Amazon Mechanical Turk to answer 200 dataset questions from the test set. A different set of five workers were asked to choose rationales for those questions and answers. Predictions were combined using a majority vote. c) The vault in the background is similar to a bank vault.

6.2. Results And Ablations

We present our results in Table 1 . Of note, standard VQA models struggle on our task. The best model, in terms of Q→AR accuracy, is MLB, with 17.2% accuracy. Deep textonly models perform much better: most notably, BERT [15] obtains 35.0% accuracy. One possible justification for this gap in performance is a bottlenecking effect: whereas VQA models are often built around multilabel classification of the top 1000 answers, VCR requires reasoning over two (often long) text spans. Our model, R2C obtains an additional boost over BERT by 9% accuracy, reaching a final performance of 44%. Still, this figure is nowhere near human performance: 85% on the staged task, so there is significant headroom remaining.

Table 1: Experimental results on VCR. VQA models struggle on both question-answering (Q → A) as well as answer justification (Q → AR), possibly due to the complex language and diversity of examples in the dataset. While language-only models perform well, our model R2C obtains a significant performance boost. Still, all models underperform human accuracy at this task. For more up-to-date results, see the leaderboard at visualcommonsense.com/leaderboard.

Ablations We evaluated our model under several ablations to determine which components are most important. Removing the query representation (and query-response contextualization entirely) results in a drop of 21.6% ac-curacy points in terms of Q → AR performance. Interestingly, this setting allows it to leverage its image representation more heavily: the text based response-only models (BERT response only, and LSTM+ELMo) perform barely better than chance. Taking the reasoning module lowers performance by 1.9%, which suggests that it is beneficial, but not critical for performance. The model suffers most when using GloVe representations instead of BERT: a loss of 24%. This suggests that strong textual representations are crucial to VCR performance.

Qualitative results Last, we present qualitative examples in Figure 6 . R2C works well for many images: for instance, in the first row, it correctly infers that a bank robbery is happening. Moreover, it picks the right rationale: even though all of the options have something to do with 'banks' and 'robbery,' only c) makes sense. Similarly, analyzing the examples for which R2C chooses the right answer but the wrong rationale allows us to gain more insight into its understanding of the world. In the third row, the model incorrectly believes there is a crib while assigning less probability mass on the correct rationale -that [person2 ] is being shown a photo of [person4 ] 's children, which is why [person2 ] might say how cute they are.

Figure 6: Qualitative examples from R2C. Correct predictions are highlighted in blue . Incorrect predictions are in red with the correct choices bolded. For more predictions, see see visualcommonsense.com/explore.

7. Related Work

Question Answering Visual Question Answering [5] was one of the first large-scale datasets that framed visual understanding as a QA task, with questions about COCO images [49] typically answered with a short phrase. This line of work also includes 'pointing' questions [45, 93] and templated questions with open ended answers [86] . Recent datasets also focus on knowledge-base style content [80, 83] . On the other hand, the answers in VCR are entire sentences, and the knowledge required by our dataset is largely background knowledge about how the world works.

Recent work also includes movie or TV-clip based QA [75, 51, 46] . In these settings, a model is given a video clip, often alongside additional language context such as subtitles, a movie script, or a plot summary. 12 In contrast, VCR features no extra language context besides the question. Moreover, the use of explicit detection tags means that there is no need to perform person identification [66] or linkage with subtitles.

An orthogonal line of work has been on referring expressions: asking to what image region a natural language sentence refers to [60, 52, 65, 87, 88, 59, 36, 33] . We explicitly avoid referring expression-style questions by using indexed detection tags (like [person1 ] ).

Last, some work focuses on commonsense phenomena, such as 'what if' and 'why' questions [79, 58] . However, the space of commonsense inferences is often limited by the underlying dataset chosen (synthetic [79] or COCO [58] scenes). In our work, we ask commonsense questions in the context of rich images from movies.

Explainability AI models are often right, but for questionable or vague reasons [7] . This has motivated work in having models provide explanations for their behavior, in the form of a natural language sentence [31, 9, 41] or an attention map [32, 35, 37] . Our rationales combine the best of both of these approaches, as they involve both natural language text as well as references to image regions. Additionally, while it is hard to evaluate the quality of generated model explanations, choosing the right rationale in VCR is a multiple choice task, making evaluation straightforward.

Commonsense Reasoning Our task unifies work involving reasoning about commonsense phenomena, such as physics [54, 84] , social interactions [2, 77, 12, 27] , procedure understanding [91, 3] and predicting what might happen next in a video [74, 17, 92, 78, 18, 64, 85] .

Adversarial Datasets Past work has proposed the idea of creating adversarial datasets, whether by balancing the dataset with respect to priors [25, 28, 62] or switching them at test time [1] . Most relevant to our dataset construction methodology is the idea of Adversarial Filtering [89] . 13 Correct answers are human-written, while wrong answers are chosen from a pool of machine-generated text that is further validated by humans. However, the correct and wrong answers come from fundamentally different sources, which raises the concern that models can cheat by performing authorship identification rather than reasoning over the image. In contrast, in Adversarial Matching, the wrong choices come from the exact same distribution as the right choices, and no human validation is needed.

8. Conclusion

In this paper, we introduced Visual Commonsense Reasoning, along with a large dataset VCR for the task that was built using Adversarial Matching. We presented R2C, a model for this task, but the challenge -of cognition-level visual undertanding -is far from solved.