SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

Rowan Zellers
Yonatan Bisk
Roy Schwartz
Yejin Choi
Conference on Empirical Methods in Natural Language Processing
2018
View in Semantic Scholar

Abstract

Given a partial description like “she opened the hood of the car,” humans can reason about the situation and anticipate what might come next (”then, she examined the engine”). In this paper, we introduce the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning. We present SWAG, a new dataset with 113k multiple choice questions about a rich spectrum of grounded situations. To address the recurring challenges of the annotation artifacts and human biases found in many existing datasets, we propose Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data. To account for the aggressive adversarial filtering, we use state-of-the-art language models to massively oversample a diverse set of potential counterfactuals. Empirical results demonstrate that while humans can solve the resulting inference problems with high accuracy (88%), various competitive models struggle on our task. We provide comprehensive analysis that indicates significant opportunities for future research.

1. Introduction

Table 1: Examples from Swag; the correct answer is bolded. Adversarial Filtering ensures that stylistic models find all options equally appealing.

When we read a story, we bring to it a large body of implicit knowledge about the physical world. For instance, given the context "on stage, a woman takes a seat at the piano," shown in Table 1 , we can easily infer what the situation might look like: a woman is giving a piano performance, with a crowd watching her. We can furthermore infer her likely next action: she will most likely set her fingers on the piano keys and start playing.

This type of natural language inference requires commonsense reasoning, substantially broadening the scope of prior work that focused primarily on On stage, a woman takes a seat at the piano. She a) sits on a bench as her sister plays with the doll. b) smiles with someone as the music plays. c) is in the crowd, watching the dancers. d) nervously sets her fingers on the keys.

A girl is going across a set of monkey bars. She a) jumps up across the monkey bars. b) struggles onto the monkey bars to grab her head. c) gets to the end and stands on a wooden plank. d) jumps up and does a back flip.

The woman is now blow drying the dog. The dog a) is placed in the kennel next to a woman's feet. b) washes her face with the shampoo. c) walks into frame and walks towards the dog. d) tried to cut her face, so she is trying to do something very close to her face. linguistic entailment (Chierchia and McConnell-Ginet, 2000) . Whereas the dominant entailment paradigm asks if two natural language sentences (the 'premise' and the 'hypothesis') describe the same set of possible worlds (Dagan et al., 2006; Bowman et al., 2015) , here we focus on whether a (multiple-choice) ending describes a possible (future) world that can be anticipated from the situation described in the premise, even when it is not strictly entailed. Making such inference necessitates a rich understanding about everyday physical situations, including object affordances (Gibson, 1979) and frame semantics (Baker et al., 1998) . A first step toward grounded commonsense inference with today's deep learning machinery is to create a large-scale dataset. However, recent work has shown that human-written datasets are susceptible to annotation artifacts: unintended stylistic patterns that give out clues for the gold labels (Gururangan et al., 2018; Poliak et al., 2018) . As a result, models trained on such datasets with hu-man biases run the risk of over-estimating the actual performance on the underlying task, and are vulnerable to adversarial or out-of-domain examples (Wang et al., 2018; Glockner et al., 2018) .

In this paper, we introduce Adversarial Filtering (AF), a new method to automatically detect and reduce stylistic artifacts. We use this method to construct Swag: an adversarial dataset with 113k multiple-choice questions. We start with pairs of temporally adjacent video captions, each with a context and a follow-up event that we know is physically possible. We then use a state-of-theart language model fine-tuned on this data to massively oversample a diverse set of possible negative sentence endings (or counterfactuals). Next, we filter these candidate endings aggressively and adversarially using a committee of trained models to obtain a population of de-biased endings with similar stylistic features to the real ones. Finally, these filtered counterfactuals are validated by crowd workers to further ensure data quality.

Extensive empirical results demonstrate unique contributions of our dataset, complementing existing datasets for natural langauge inference (NLI) (Bowman et al., 2015; Williams et al., 2018) and commonsense reasoning (Roemmele et al., 2011; Zhang et al., 2017) . First, our dataset poses a new challenge of grounded commonsense inference that is easy for humans (88%) while hard for current state-ofthe-art NLI models (<60%). Second, our proposed adversarial filtering methodology allows for cost-effective construction of a large-scale dataset while substantially reducing known annotation artifacts. The generality of adversarial filtering allows it to be applied to build future datasets, ensuring that they serve as reliable benchmarks.

2 Swag: Our new dataset

Figure 1: Overview of the data collection process. For a pair of sequential video captions, the second caption is split into noun and verb phrases. A language model generates many negative endings, of which a difficult subset are human-annotated.

We introduce a new dataset for studying physically grounded commonsense inference, called Swag. 1 Our task is to predict which event is most likely to occur next in a video. More formally, a model is given a context c = (s, n): a complete sentence s and a noun phrase n that begins a second sentence, as well as a list of possible verb phrase sentence endings V = {v 1 , . . . , v 4 }. See Figure 1 for an example triple (s, n, v i ). The model must then select the most appropriate verb phrase vî ∈ V .

1 Short for Situations With Adversarial Generations.

is put on top of the vegetables. is putting vegetable fruits. is using a red sponge to add eggs and parsley.

⋮ is placed in the oven.

The mixer creams the butter. Sugar…

Adversarially Select Generations

Annotators filter endings to ensure agreement

Oversample Endings From Context+Np

Sugar is added to the mixing bowl. The mixer creams the butter.

Np Vp Context

Using video captions from t < l a t e x i t s h a 1 _ b a s e 6 4 = " 0

v + h F B O f v F X t 3 4 t a 2 N D U v y 3 d j x g = " > A A A C 3 X i c d V J L b 9 N A E N 6 Y V z G v F o 5 c L C I k x C G y E R J w q 6 C H X h C t R G i l J C r j z c R Z Z R / W 7 L h t a u X a C 4 g T i J / E b + D f s E l 9 w A m M t N p v v 3 n P b F 5 q 5 T l N f 3 e i a 9 d v 3 L y 1 d T u + c / f e / Q f b O w 8 / e V e R x L 5 0 2 t F x D h 6 1 s t h n x R q P S 0 I w u c a j f P Z u q T 8 6 R f L K 2 Y 8 8 L 3 F k o L B q o i R w o A 7 5 Z L u b 9 t K V J J s g a 0 B X N H J w s t P 5 N R w 7 W R m 0 L D V 4 P 8 j S k k c 1 E C u p c R E P K 4 8 l y B k U O A j Q g k E / q l e V L p K n g R k n E 0 f h W E 5 W 7 N 8 e N R j v 5 y Y P l g Z 4 6 t d 1 S / J f u k H F k 9 e j W t m y Y r T y K t G k 0 g m 7 Z N l 2 M l a E k v U 8 A J C k Q q 2 J n A K B 5 D C c V h Z T a V b k z l q d 1 B K 0 b D M F Q T l V 8 r z N E m q v L t p j + E 9 I c h y W Y I s 2 m 5 v 2 u y K 9 F s w R b q b I n Z s x 5 D 6 w e x j 2 Q v g + z O h D i Q T s 6 H k 9 B C o M n C / q 5 o 7 j O K w + W 1 / 0 J u i / 6 L 3 p Z Y c v u 7 t v m z + w J R 6 L J + K Z y M Q r s S v 2 x Y H o C y l Q f B H f x Y / o c 3 Q Z f Y 2 + X Z l G

n c b n k W h J 9 P M P 9 U X t 7 A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 0

n c b n k W h J 9 P M P 9 U X t 7 A = = < / l a t e x i t > t + 1 < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R

S p Q 2 + c q d l v / 9 i y s H N v s c = " > A A A C 3 3 i c d V J L b x M x E H a 2 P M r y a s u R y 4 o I C Y E U 7 S I k 6 K 2 C H r g g i i C 0 U h J V Y 2 e y s W K v V + P Z 0 r D K v R c Q J x C / i N / A v 8 F J 9 8

A m M J L l z 9 + 8 Z y x L o z 2 n 6 e 9 O t H X l 6 r X r 2 z f i m 7 d u 3 7 m 7 s 7 v 3 0 b u K F P a V M 4 5 O J H g 0 u s A + a z Z 4 U h K C l Q a P 5 e z V U n 9 8 h u S 1 K z 7 w v M S R h b z Q E 6 2 A A / W e n 2 S n O 9 2 0 l 6 4 k 2 Q R Z A 7 q i k a P T 3 c 6

v 4 d i p y m L B y o D 3 g y w t e V Q D s V Y G F / G w 8 l i C m k G O g w A L s O h H 9 a r W R f I w M O N k 4 i i c g p M V + 7 d H D d b 7 u Z X B 0 g J P / b p u S f 5 L N 6 h 4 8 m J U 6 6 K s G A t 1 m W h S m Y R d s m w 8 G W t C x W Y e A C j S o d Z E T Y F A c R h P K 4 u t D G t y n 1 q d 1 A q M a j M 5 Q T n V 6 r z N E h q v P 7 f H 8 J + Q 5 D i s o c j b r L T t d 0 V m L Z g j 3 E w h n Z s x S B / Y Q w x 7 I X w T Z v S 2 R A J 2 9 L g e A u U W z h d 1 c 8 d x H F a f r S 9 6 E / S f 9 v Z 7 2 b t n 3 Y O X z R / Y F v f F A / F I Z O K 5 O B C v x Z H o C y V y 8 U V 8 F z 8 i G V 1 E X 6 N v l 6 Z R p / G 5 J 1 o S / f w D J a b u X A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R S p Q 2 + c q d l v / 9 i y s H N v s c = " > A A A C 3 3 i c d V J L b x M x E H a 2 P M r y a s u R y 4 o I C Y E U 7 S I k 6 K 2 C H r g g i i C 0 U h J V Y 2 e y s W K v V + P Z 0 r D K v R c Q J x C / i N / A v 8 F J 9 8 A m M J L l z 9 + 8 Z y x L o z 2 n 6 e 9 O t H X l 6 r X r 2 z f i m 7 d u 3 7 m 7 s 7 v 3 0 b u K F P a V M 4 5 O J H g 0 u s A + a z Z 4 U h K C l Q a P 5 e z V U n 9 8 h u S 1 K z 7 w v M S R h b z Q E 6 2 A A / W e n 2 S n O 9 2 0 l 6 4 k 2 Q R Z A 7 q i k a P T 3 c 6 v 4 d i p y m L B y o D 3 g y w t e V Q D s V Y G F / G w 8 l i C m k G O g w A L s O h H 9 a r W R f I w M O N k 4 i i c g p M V + 7 d H D d b 7 u Z X B 0 g J P / b p u S f 5 L N 6 h 4 8 m J U 6 6 K s G A t 1 m W h S m Y R d s m w 8 G W t C x W Y e A C j S o d Z E T Y F A c R h P K 4 u t D G t y n 1 q d 1 A q M a j M 5 Q T n V 6 r z N E h q v P 7 f H 8 J + Q 5 D i s o c j b r L T t d 0 V m L Z g j 3 E w h n Z s x S B / Y Q w x 7 I X w T Z v S 2 R A J 2 9 L g e A u U W z h d 1 c 8 d x H F a f r S 9 6 E / S f 9 v Z 7 2 b t n 3 Y O X z R / Y F v f F A / F I Z O K 5 O B C v x Z H o C y V y 8 U V 8 F z 8 i G V 1 E X 6 N v l 6 Z R p / G 5 J 1 o S / f w D J a b u X A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R S p Q 2 + c q d l v / 9 i y s H N v s c = " > A A A C 3 3 i c d V J L b x M x E H a 2 P M r y a s u R y 4 o I C Y E U 7 S I k 6 K 2 C H r g g i i C 0 U h J V Y 2 e y s W K v V + P Z 0 r D K v R c Q J x C / i N / A v 8 F J 9 8 A m M J L l z 9 + 8 Z y x L o z 2 n 6 e 9 O t H X l 6 r X r 2 z f i m 7 d u 3 7 m 7 s 7 v 3 0 b u K F P a V M 4 5 O J H g 0 u s A + a z Z 4 U h K C l Q a P 5 e z V U n 9 8 h u S 1 K z 7 w v M S R h b z Q E 6 2 A A / W e n 2 S n O 9 2 0 l 6 4 k 2 Q R Z A 7 q i k a P T 3 c 6 v 4 d i p y m L B y o D 3 g y w t e V Q D s V Y G F / G w 8 l i C m k G O g w A L s O h H 9 a r W R f I w M O N k 4 i i c g p M V + 7 d H D d b 7 u Z X B 0 g J P / b p u S f 5 L N 6 h 4 8 m J U 6 6 K s G A t 1 m W h S m Y R d s m w 8 G W t C x W Y e A C j S o d Z E T Y F A c R h P K 4 u t D G t y n 1 q d 1 A q M a j M 5 Q T n V 6 r z N E h q v P 7 f H 8 J + Q 5 D i s o c j b r L T t d 0 V m L Z g j 3 E w h n Z s x S B / Y Q w x 7 I X w T Z v S 2 R A J 2 9 L g e A u U W z h d 1 c 8 d x H F a f r S 9 6 E / S f 9 v Z 7 2 b t n 3 Y O X z R / Y F v f F A / F I Z O K 5 O B C v x Z H o C y V y 8 U V 8 F z 8 i G V 1 E X 6 N v l 6 Z R p / G 5 J 1 o S / f w D J a b u X A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " F x Z l r q R S p Q 2 + c q d l v / 9 i y s H N v s c = " > A A A C 3 3 i c d V J L b x M x E H a 2 P M r y a s u R y 4 o I C Y E U 7 S I k 6 K 2 C H r g g i i C 0 U h J V Y 2 e y s W K v V + P Z 0 r D K v R c Q J x C / i N / A v 8 F J 9 8 A m M J L l z 9 + 8 Z y x L o z 2 n 6 e 9 O t H X l 6 r X r 2 z f i m 7 d u 3 7 m 7 s 7 v 3 0 b u K F P a V M 4 5 O J H g 0 u s A + a z Z 4 U h K C l Q a P 5 e z V U n 9 8 h u S 1 K z 7 w v M S R h b z Q E 6 2 A A / W e n 2 S n O 9 2 0 l 6 4 k 2 Q R Z A 7 q i k a P T 3 c 6 v 4 d i p y m L B y o D 3 g y w t e V Q D s V Y G F / G w 8 l i C m k G O g w A L s O h H 9 a r W R f I w M O N k 4 i i c g p M V + 7 d H D d b 7 u Z X B 0 g J P / b p u S f 5 L N 6 h 4 8 m J U 6 6 K s G A t 1 m W h S m Y R d s m w 8 G W t C x W Y e A C j S o d Z E T Y F A c R h P K 4 u t D G t y n 1 q d 1 A q M a j M 5 Q T n V 6 r z N E h q v P 7 f H 8 J + Q 5 D i s o c j b r L T t d 0 V m L Z g j 3 E w h n Z s x S B / Y Q w x 7 I X w T Z v S 2 R A J 2 9 L g e A u U W z h d 1 c 8 d x H F a f r S 9 6 E / S f 9 v Z 7 2 b t n 3 Y O X z R / Y F v f F A / F I Z O K 5 O B C v x Z H o C y V y 8 U V 8 F z 8 i G V 1 E X 6 N v l 6 Z R p / G 5 J 1 o S / f w D J a b u X A = = < / l a t e x i t >

(the videos are never used) Figure 1 : Overview of the data collection process. For a pair of sequential video captions, the second caption is split into noun and verb phrases. A language model generates many negative endings, of which a difficult subset are human-annotated.

Overview Our corpus consists of 113k multiple choice questions (73k training, 20k validation, 20k test) and is derived from pairs of consecutive video captions from ActivityNet Captions (Krishna et al., 2017; Heilbron et al., 2015) and the Large Scale Movie Description Challenge (LSMDC; Rohrbach et al., 2017) . The two datasets are slightly different in nature and allow us to achieve broader coverage: ActivityNet contains 20k YouTube clips containing one of 203 activity types (such as doing gymnastics or playing guitar); LSMDC consists of 128k movie captions (audio descriptions and scripts). For each pair of captions, we use a constituency parser (Stern et al., 2017) to split the second sentence into noun and verb phrases ( Figure 1 ). 2 Each question has a human-verified gold ending and 3 distractors.

3. A Solution To Annotation Artifacts

In this section, we outline the construction of Swag. We seek dataset diversity while minimizing annotation artifacts, conditional stylistic patterns such as length and word-preference biases. For many NLI datasets, these biases have been shown to allow shallow models (e.g. bag-of-words) obtain artificially high performance.

To avoid introducing easily "gamed" patterns, we present Adversarial Filtering (AF), a generallyapplicable treatment involving the iterative refinement of a set of assignments to increase the entropy under a chosen model family. We then discuss how we generate counterfactual endings, and

A easy i = {j ∈ A i : f θ (x + i ) > f θ (x − i,j )} • Replace N easy easy indices j ∈ A easy i with adversarial indices k ∈ A i satisfying f θ (x − i,k ) > f θ (x − i,j )

. end for end while finally, the models used for filtering.

3.1 Formal Definition

In this section, we formalize what it means for a dataset to be adversarial. Intuitively, we say that an adversarial dataset for a model f is one on which f will not generalize, even if evaluated on test data from the same distribution. More formally, let our input space be X and the label space be Y. Our trainable classifier f , taking parameters θ is defined as f θ : X → R |Y| . Let our dataset of size N be defined as D = {(x i , y i )} 1≤i≤N , and let the loss function over the dataset be L(f θ , D). We say that a dataset is adversarial with respect to f if we expect high empirical error I over all leave-one-out train/test splits (Vapnik, 2000) :

EQUATION (1): Not extracted; please refer to original document.

where

EQUATION (2): Not extracted; please refer to original document.

with regularization terms omitted for simplicity.

3.2 Adversarial Filtering (Af) Algorithm

In this section, we outline an approach for generating an adversarial dataset D, effectively maximizing empirical error I with respect to a family of trainable classifiers f . Without loss of generality, we consider the situation where we have N contexts, each associated with a single positive example (x + i , 1) ∈ X × Y, and a large population of context-specific negative examples (x − i,j , 0) ∈ X × Y, where 1≤j≤N − for each i. For instance, the negative examples could be incorrect relations in knowledge-base completion (Socher et al., 2013) , or all words in a dictionary for a single-word cloze task (Zweig and Burges, 2011) .

Our goal will be to filter the population of negative examples for each instance i to a size of k N − . This will be captured by returning a set of assignments A, where for each instance the assignment will be a k-subset

A i = [1 . . . N − ] k .

The filtered dataset will then be:

D AF = {(x i , 1), {(x − i,j , 0)} j∈A i } 1≤i≤N (3)

Unfortunately, optimizing I(D AF , f ) is difficult as A is global and non-differentiable. To address this, we present Algorithm 1. On each iteration, we split the data into dummy 'train' and 'test' splits. We train a model f on the training portion and obtain parameters θ, then use the remaining test portion to reassign the indices of A. For each context, we replace some number of 'easy' negatives in A that f θ classifies correctly with 'adversarial' negatives outside of A that f θ misclassifies. This process can be thought of as increasing the overall entropy of the dataset: given a strong model f θ that is compatible with a random subset of the data, we aim to ensure it cannot generalize to the held-out set. We repeat this for several iterations to reduce the generalization ability of the model family f over arbitrary train/test splits.

3.3 Generating Candidate Endings

Figure 2: Test accuracy by AF iteration, under the negatives given by A. The accuracy drops from around 60% to close to random chance. For efficiency, the first 100 iterations only use the MLP.

To generate counterfactuals for Swag, we use an LSTM (Hochreiter and Schmidhuber, 1997 ) language model (LM), conditioned on contexts from video captions. We first pretrain on BookCorpus , then finetune on the video caption datasets. The architecture uses standard best practices and was validated on held-out perplexity of the video caption datasets; details are in the appendix. We use the LM to sample N − =1023 unique endings for a partial caption. 3 Importantly, we greedily sample the endings, since beam search decoding biases the generated endings to be of lower perplexity (and thus easily distinguishable from found endings). We find this process gives good counterfactuals: the generated endings tend to use topical words, but often make little sense physically, making them perfect for our task. Further, the generated endings are marked as "gibberish" by humans only 9.1% of the time (Sec 3.5); in that case the ending is filtered out. Figure 2 : Test accuracy by AF iteration, under the negatives given by A. The accuracy drops from around 60% to close to random chance. For efficiency, the first 100 iterations only use the MLP.

3.4 Stylistic Models For Adversarial Filtering

In creating Swag, we designed the model family f to pick up on low-level stylistic features that we posit should not be predictive of whether an event happens next in a video. These stylistic features are an obvious case of annotation artifacts (Cai et al., 2017; Schwartz et al., 2017) . 4 Our final classifier is an ensemble of four stylistic models: 1. A multilayer perceptron (MLP) given LM perplexity features and context/ending lengths. 2. A bag-of-words model that averages the word embeddings of the second sentence as features.

3. A one-layer CNN, with filter sizes ranging from 2-5, over the second sentence. 4. A bidirectional LSTM over the 100 most common words in the second sentence; uncommon words are replaced by their POS tags. We ensemble the models by concatenating their final representations and passing it through an MLP. On every adversarial iteration, the ensemble is trained jointly to minimize cross-entropy. The accuracies of these models (at each iteration, evaluated on a 20% split of the test dataset before indices of A get remapped) are shown in Figure 2 . Performance decreases from 60% to close to random chance; moreover, confusing the perplexity-based MLP is not sufficient to lower performance of the ensemble. Only once the other stylistic models are added does the ensemble accuracy drop substantially, suggesting that our approach is effective at reducing stylistic artifacts. 4 A broad definition of annotation artifacts might include aspects besides lexical/stylistic features: for instance, certain events are less likely semantically regardless of the context (e.g. riding a horse using a hose). For this work, we erred more conservatively and only filtered based on style.

Imagine that you are watching a video clip. The clip has a caption, but it is missing the final phrase. Please choose the best 2 caption endings, and classify each as:

• likely, if it completes the caption in a reasonable way;

• unlikely, if it sounds ridiculous or impossible;

• gibberish if it has such serious errors that it doesn't feel like a valid English sentence.

Example: Someone is shown sitting on a fence and talking to the camera while pointing out horses. Someone • stands in front of a podium. (likely, second best)

• rides a horse using a hose. (unlikely)

• is shown riding a horse. (likely, best)

Figure 3. Not extracted; please refer to original document.

• , the horse in a plaza field. (gibberish) Figure 3 : Mechanical Turk instructions (abridged).

3.5 Human Verification

The final data-collection step is to have humans verify the data. Workers on Amazon Mechanical Turk were given the caption context, as well as six candidate endings: one found ending and five adversarially-sampled endings. The task was twofold: Turkers ranked the endings independently as likely, unlikely, or gibberish, and selected the best and second best endings (Fig 3) .

Table 2: Annotators tend to label the found ending as likely and within the top 2 (column 2), in other cases the example is filtered out. Both label groups have high inter-annotator agreement, in terms of Krippendorff’s α and pairwise percent agreement.
Labels	Found end	Gen. end	a	ppa
	Label ending type	distribution by	agreement	Inter-annotator
Best	53.5%	9.3%
Second Best	20.2%	15.9%	0.43	72%
Neither	26.3%	74.8%
Likely	80.3%	33.3%
Unlikely	19.0%	57.5%	0.39	64%
Gibberish	0.7%	9.1%

We obtained the correct answers to each context in two ways. If a Turker ranks the found ending as either best or second best (73.7% of the time), we add the found ending as a gold example, with negatives from the generations not labelled best or gibberish. Further, if a Turker ranks a generated ending as best, and the found ending as second best, then we have reason to believe that the generation is good. This lets us add an additional training example, consisting of the generated best ending as the gold, and remaining generations as negatives. 5 Examples with ≤3 nongibberish endings were filtered out. 6 We found after 1000 examples that the annotators tended to have high agreement, also generally choosing found endings over generations (see Table 2 ). Thus, we collected the remaining 112k examples with one annotator each, periodically verifying that annotators preferred the found endings.

4. Experiments

In this section, we evaluate the performance of various NLI models on Swag. Recall that models for our dataset take the following form: given a sentence and a noun phrase as context c = (s, n), as well as a list of possible verb phrase endings V = {v 1 , . . . , v 4 }, a model f θ must select a verb i that hopefully matches i gold :

EQUATION (4): Not extracted; please refer to original document.

To study the amount of bias in our dataset, we also consider models that take as input just the ending verb phrase v i , or the entire second sentence (n, v i ). For our learned models, we train f by minimizing multi-class cross-entropy. We consider three different types of word representations: 300d GloVe vectors from Common Crawl (Pennington et al., 2014) , 300d Numberbatch vectors retrofitted using ConceptNet relations (Speer et al., 2017) , and 1024d ELMo contextual representations that show improvement on a variety of NLP tasks, including standard NLI (Peters et al., 2018) . We follow the final dataset split (see Section 2) using two training approaches: training on the found data, and the found and highly-ranked generated data. See the appendix for more details.

4.1 Unary Models

The following models predict labels from a single span of text as input; this could be the ending only, the second sentence only, or the full passage. a. fastText (Joulin et al., 2017) : This library models a single span of text as a bag of n-grams, and tries to predict the probability of an ending being correct or incorrect independently. 7 b. Pretrained sentence encoders We consider two types of pretrained RNN sentence encoders, SkipThoughts and InferSent (Conneau et al., 2017) . SkipThoughts was trained by predicting adjacent sentences in book data, whereas InferSent was trained on supervised NLI data. For each second sentence (or just the ending), we feed the encoding into an MLP. c. LSTM sentence encoder Given an arbitrary span of text, we run a two-layer BiLSTM over it. The final hidden states are then max-pooled to obtain a fixed-size representation, which is then used to predict the potential for that ending.

4.2 Binary Models

The following models predict labels from two spans of text. We consider two possibilties for these models: using just the second sentence, where the two text spans are n, v i , or using the context and the second sentence, in which case the spans are s, (n, v i ). The latter case includes many models developed for the NLI task. d. Dual Bag-of-Words For this baseline, we treat each sentence as a bag-of-embeddings (c, v i ). We model the probability of picking an ending i using a bilinear model: softmax i (cWv T i ). 8 e. Dual pretrained sentence encoders Here, we obtain representations from SkipThoughts or In-ferSent for each span, and compute their pairwise compatibility using either 1) a bilinear model or 2) an MLP from their concatenated representations. f. SNLI inference Here, we consider two models that do well on SNLI (Bowman et al., 2015): Decomposable Attention and ESIM (Chen et al., 2017) . We use pretrained versions of these models (with ELMo embeddings) on SNLI to obtain 3-way entailment, neutral, and contradiction probabilities for each example. We then train a log-linear model using these 3-way probabilities as features. g. SNLI models (retrained) Here, we train ESIM and Decomposable Attention on our dataset: we simply change the output layer size to 1 (the potential of an ending v i ) with a softmax over i.

4.3 Other Models

Table 3: Performance of all models in accuracy (%). All models substantially underperform humans, although performance increases as more context is provided (left to right). We optionally train on found endings only, or found and human-validated generated endings (found+gen).
			Ending	only		2nd	sentence		only		Context+2nd	sentence
		found	only		found+gen	found	only		found+gen	found	only		found+gen
	Model	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
	Random	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0	25.0
misc	Length	26.7	27.0	26.7	27.0
	ConceptNet					26.0	26.0	26.0	26.0
	fastText	27.5	26.9	29.9	29.0	29.2	27.8	29.8	29.0	29.4 28.0		30.3 29.8
Sentence	SkipThoughts	32.4	32.1	32.2	31.8	33.0	32.4	32.8	32.3
encoders	InferSent	30.6	30.2	32.0	31.9	33.2	32.0	34.0	32.6
LSTM	LSTM+GloVe	31.9	31.8	32.9	32.4	32.7	32.4	34.3	33.5	43.1	43.6	45.6	45.7
sequence	LSTM+Numberbatch	32.4	32.6	32.3	31.9	31.9	31.9	34.1	32.8	39.9	40.2	41.2	40.5
model	LSTM+ELMo	43.6	42.9	43.3	42.3	47.4	46.7	46.3	46.0	51.4	50.6	51.3	50.4
	DualBoW+GloVe					31.3	31.3	31.9	31.2	34.5	34.7	32.9	33.1
DualBoW	DualBoW+Numberbatch					31.9	31.4	31.6	31.3	35.1	35.1	34.2	34.1
	SkipThoughts-MLP					34.6	33.9	36.2	35.5	33.4	32.3	37.4	36.4
Dual	SkipThoughts-Bilinear					36.0	35.7	34.7	34.5	36.5	35.6	35.3	34.9
sentence	InferSent-MLP					32.9	32.1	32.8	32.7	35.9	36.2	39.5	39.4
encoders	InferSent-Bilinear					32.0	31.3	31.6	31.3	40.5	40.3	39.0	38.4
SNLI	SNLI-ESIM									36.4	36.1	36.2	36.0
inference	SNLI-DecompAttn									35.8	35.8	35.8	35.7
	DecompAttn+GloVe					29.8	30.3	31.1	31.7	47.4	47.6	48.5	48.6
	DecompAttn+Numberbatch					32.4	31.7	32.5	31.9	47.4	48.0	48.0	48.3
SNLI models	DecompAttn+ELMo					43.4	43.4	40.6	40.3	47.7	47.3	46.0	45.4
	ESIM+GloVe					34.8	35.1	36.3	36.7	51.9	52.7	52.5	52.5
(retrained)	ESIM+Numberbatch					33.1	32.6	33.0	32.4	46.5	46.4	44.0	44.6
	ESIM+ELMo					46.0	45.7	45.9	44.8	59.1	59.2	58.7	58.5
	1 turker										82.8
	3 turkers										85.1
Human	5 turkers										88.0
	Expert										85.0

We also considered the following models: h. Length: Although length was used by the adversarial classifier, we want to verify that human validation didn't reintroduce a length bias. For this baseline, we always choose the shortest ending. i. ConceptNet As our task requires world knowledge, we tried a rule-based system on top of the Table 3 : Performance of all models in accuracy (%). All models substantially underperform humans, although performance increases as more context is provided (left to right). We optionally train on found endings only, or found and human-validated generated endings (found+gen).

ConceptNet knowledge base (Speer et al., 2017) . For an ending sentence, we use the spaCy dependency parser to extract the head verb and its dependent object. The ending score is given by the number of ConceptNet causal relations 9 between synonyms of the verb and synonyms of the object. j. Human performance To benchmark human performance, five Mechanical Turk workers were asked to answer 100 dataset questions, as did an 'expert' annotator (the first author of this paper). Predictions were combined using a majority vote.

4.4 Results

We present our results in Table 3 . The best model that only uses the ending is the LSTM sequence model with ELMo embeddings, which obtains 43.6%. This model, as with most models studied, greatly improves with more context: by 3.1% when given the initial noun phrase, and by an ad- 9 We used the relations 'Causes', 'CapableOf', 'Re-ceivesAction', 'UsedFor', and 'HasSubevent'. Though their coverage is low (30.4% of questions have an answer with ≥1 causal relation), the more frequent relations in ConceptNet, such as 'IsA', at best only indirectly relate to our task. ditional 4% when also given the first sentence.

Further improvement is gained from models that compute pairwise representations of the inputs. While the simplest such model, Dual-BoW, obtains only 35.1% accuracy, combining In-ferSent sentence representations gives 40.5% accuracy (InferSent-Bilinear). The best results come from pairwise NLI models: when fully trained on Swag, ESIM+ELMo obtains 59.2% accuracy.

When comparing machine results to human results, we see there exists a lot of headroom. Though there likely is some noise in the task, our results suggest that humans (even untrained) converge to a consensus. Our in-house "expert" annotator is outperformed by an ensemble of 5 Turk workers (with 88% accuracy); thus, the effective upper bound on our dataset is likely even higher.

5.1 Swag Versus Existing Nli Datasets

Figure 4: Top: Distribution of the 40 top verbs in the union of SNLI and Swag. Our dataset shows a greater variety of dynamic verbs, such as “move”, as well as temporal verbs such as “start” and “come.” “Continue” is cut off for SNLI (it has frequency 6 · 10−5). Bottom: CDF for verbs in SNLI and Swag.

The past few years have yielded great advances in NLI and representation learning, due to the availability of large datasets like SNLI and MultiNLI Verb CDF SNLI SWAG Figure 4 : Top: Distribution of the 40 top verbs in the union of SNLI and Swag. Our dataset shows a greater variety of dynamic verbs, such as "move", as well as temporal verbs such as "start" and "come." "Continue" is cut off for SNLI (it has frequency 6 • 10 −5 ). Bottom: CDF for verbs in SNLI and Swag.

( Bowman et al., 2015; Williams et al., 2018) . With the release of Swag, we hope to continue this trend, particularly as our dataset largely has the same input/output format as other NLI datasets. We observe three key differences between our dataset and others in this space: First, as noted in Section 1, Swag requires a unique type of temporal reasoning. A state-of-theart NLI model such as ESIM, when bottlenecked through the SNLI notion of entailment (SNLI-ESIM), only obtains 36.1% accuracy. 10 This implies that these datasets necessitate different (and complementary) forms of reasoning.

Second, our use of videos results in wide coverage of dynamic and temporal situations Compared with SNLI, with contexts from Flickr30K (Plummer et al., 2017) image captions, Swag has more active verbs like 'pull' and 'hit,' and fewer static verbs like 'sit' and 'wear' (Figure 4) . 11 Third, our dataset suffers from few lexical biases. Whereas fastText, a bag of n-gram model, obtains 67.0% accuracy on SNLI versus a 34.3% baseline (Gururangan et al., 2018) , fastText obtains only 29.0% accuracy on Swag. 12

5.2 Error Analysis

We sought to quantify how human judgments differ from the best studied model, ESIM+ELMo. We randomly sampled 100 validation questions 10 The weights of SNLI-ESIM pick up primarily on entailment probability (0.59), as with neutral (0.46), while contradiction is negatively correlated (-.42).

11 Video data has other language differences; notably, character names in LSMDC were replaced by 'someone'

12 The most predictive individual words on SWAG are infrequent in number: 'dotted' with P(+|dotted) = 77% with 10.3 counts, and P(−|similar) = 81% with 16.3 counts. (Counts from negative endings were discounted 3x, as there are 3 times as many negative endings as positive endings).

Reason

Explanation Freq.

Situational The good ending is better in context. 53.7% Plausibility The bad ending is implausible regardless of context.

Novelty

The bad ending seems redundant; it is entailed by the context.

1.8%

Weirdness The bad ending is semantically or grammatically malformed, e.g. 'the man is getting out of the horse.'

18.1%

Table 4: Justifications for ranking the gold answer over a wrong answer chosen by ESIM+ELMo.
Reason	Explanation	Freq.
Situational	The good ending is better in context.	53.7%
Plausibility	The bad ending is implausible regard- less of context.	14.4%
Novelty	The bad ending seems redundant; it is entailed by the context.	1.8%
Weirdness	The bad ending is semantically or grammatically malformed, e.g. 'the man is getting out of the horse.'	18.1%
Ambiguous	Both endings seem equally likely.	12.0%

Ambiguous Both endings seem equally likely. 12.0% Table 4 : Justifications for ranking the gold answer over a wrong answer chosen by ESIM+ELMo.

that ESIM+ELMo answered incorrectly, for each extracting both the gold ending and the model's preferred ending. We asked 5 Amazon Mechanical Turk workers to pick the better ending (of which they preferred the gold endings 94% of the time) and to select one (or more) multiple choice reasons explaining why the chosen answer was better. The options, and the frequencies, are outlined in Table 4 . The most common reason for the turkers preferring the correct answer is situational (52.3% of the time), followed by weirdness (17.5%) and plausibility (14.4%). This suggests that ESIM+ELMo already does a good job at filtering out weird and implausible answers, with the main bottleneck being grounded physical understanding. The ambiguous percentage is also relatively low (12.0%), implying significant headroom.

5.3 Qualitative Examples

Table 5: Example questions answered by the best model, ESIM+Elmo, sorted by model probability. Correct model predictions are in blue, incorrect model predictions are red. The right answers are bolded.
A waiter brings a fork. The waiter a) starts to step away. (74.76%) b) adds spaghetti to the table. (21.57%) c) brings a bunch of pie to the food (2.67%) d) drinks from the mug in the bowl. (0.98%)	He is up a tree. Someone a) stands underneath the tree. (97.44%) b) is at a pool table holding a cup. (1.14%) c) grabs a flower from a paper. (0.96%) d) is eating some cereal. (0.45%)
An old man rides a small bumper car. Several people a) get in the parking lot. (76.58%) b) wait in the car. (15.28%) c) get stuck with other bumper cars. (6.75%) d) are running down the road. (1.39%)	He pours the raw egg batter into the pan. He a) drops the tiny pan onto a plate. (93.48%) b) lifts the pan and moves it around to shuffle the eggs. (4.94%) c) stirs the dough into a kite. (1.53%) d) swirls the stir under the adhesive. (0.05%)

Last, we show several qualitative examples in Table 5. Though models can do decently well by identifying complex alignment patterns between the two sentences (e.g. being "up a tree" implies that "tree" is the end phrase), the incorrect model predictions suggest this strategy is insuffi- cient. For instance, answering "An old man rides a small bumper car" requires knowledge about bumper cars and how they differ from regular cars: bumper cars are tiny, don't drive on roads, and don't work in parking lots, eliminating the alternatives. However, this knowledge is difficult to extract from existing corpora: for instance, the Con-ceptNet entry for Bumper Car has only a single relation: bumper cars are a type of vehicle. Other questions require intuitive physical reasoning: e.g, for "he pours the raw egg batter into the pan," about what happens next in making an omelet.

5.4 Where To Go Next?

Our results suggest that Swag is a challenging testbed for NLI models. However, the adversarial models used to filter the dataset are purely stylistic and focus on the second sentence; thus, subtle artifacts still likely remain in our dataset. These patterns are ostensibly picked up by the NLI models (particularly when using ELMo features), but the large gap between machine and human performance suggests that more is required to solve the dataset. As models are developed for commonsense inference, and more broadly as the field of NLP advances, we note that AF can be used again to create a more adversarial version of Swag using better language models and AF models.

6. Related Work

Entailment NLI There has been a long history of NLI benchmarks focusing on linguistic entailment (Cooper et al., 1996; Dagan et al., 2006; Marelli et al., 2014; Bowman et al., 2015; Lai et al., 2017; Williams et al., 2018) . Recent NLI datasets in particular have supported learning broadly-applicable sentence representations (Conneau et al., 2017) ; moreover, models trained on these datasets were used as components for performing better video captioning (Pasunuru and Bansal, 2017) , summarization (Pasunuru and Bansal, 2018) , and generation (Holtzman et al., 2018) , confirming the importance of NLI research. The NLI task requires a variety of commonsense knowledge (LoBue and Yates, 2011), which our work complements. However, previous datasets for NLI have been challenged by unwanted annotation artifacts, (Gururangan et al., 2018; Poliak et al., 2018) or scale issues. Our work addresses these challenges by constructing a new NLI benchmark focused on grounded commonsense reasoning, and by introducing an adversarial filtering mechanism that substantially reduces known and easily detectable annotation artifacts.

Commonsense NLI Several datasets have been introduced to study NLI beyond linguistic entailment: for inferring likely causes and endings given a sentence (COPA; Roemmele et al., 2011) , for choosing the most sensible ending to a short story (RocStories; Sharma et al., 2018) , and for predicting likelihood of a hypothesis by regressing to an ordinal label (JOCI; (Zhang et al., 2017) ). These datasets are relatively small: 1k examples for COPA and 10k cloze examples for RocStories. 13 JOCI increases the scale by generating the hypotheses using a knowledge graph or a neural model. In contrast to JOCI where the task was formulated as a regression task on the degree of plausibility of the hypothesis, we frame commonsense inference as a multiple choice question to reduce the potential ambiguity in the labels and to allow for direct comparison between machines and humans. In addition, Swag's use of adversarial filtering increases diversity of situations and counterfactual generation quality. Last, another related task formulation is sentence completion or cloze, where the task is to predict a single word that is removed from a given context (Zweig and Burges, 2011; Paperno et al., 2016 ). 14 Our work in contrast requires longer textual descriptions to reason about.

Vision datasets Several resources have been introduced to study temporal inference in vision. The Visual Madlibs dataset has 20k image captions about hypothetical next/previous events (Yu et al., 2015) ; similar to our work, the test portion is multiple-choice, with counterfactual answers retrieved from similar images and verified by humans. The question of 'what will happen next?' has also been studied in photo albums (Huang et al., 2016) , videos of team sports, (Felsen et al., 2017) and egocentric dog videos (Ehsani et al., 2018) . Last, annotation artifacts are also a recurring problem for vision datasets such as Visual Genome (Zellers et al., 2018) and Visual QA (Jabri et al., 2016) ; recent work was done to create a more challenging VQA dataset by annotating complementary image pairs (Goyal et al., 2016) .

Reducing gender/racial bias Prior work has sought to reduce demographic biases in word embeddings (Zhang et al., 2018) as well as in image recognition models (Zhao et al., 2017) . Our work has focused on producing a dataset with minimal annotation artifacts, which in turn helps to avoid some gender and racial biases that stem from elicitation . However, it is not perfect in this regard, particularly due to biases in movies (Schofield and Mehr, 2016; . Our methodology could potentially be extended to construct datasets free of (possibly intersectional) gender or racial bias.

Physical knowledge Prior work has studied learning grounded knowledge about objects and verbs: from knowledge bases (Li et al., 2016) , syntax parses (Forbes and Choi, 2017) , word embeddings (Lucy and Gauthier, 2017) , and images and dictionary definitions (Zellers and Choi, 2017) . An alternate thread of work has been to learn scripts: high-level representations of event chains (Schank and Abelson, 1975; Chambers and Jurafsky, 2009) . Swag evaluates both of these strands.

14 Prior work on sentence completion filtered negatives with heuristics based on LM perplexities. We initially tried something similar, but found the result to still be gameable.

7. Conclusion

We propose a new challenge of physically situated commonsense inference that broadens the scope of natural language inference (NLI) with commonsense reasoning. To support research toward commonsense NLI, we create a large-scale dataset Swag with 113k multiple-choice questions. Our dataset is constructed using Adversarial Filtering (AF), a new paradigm for robust and cost-effective dataset construction that allows datasets to be constructed at scale while automatically reducing annotation artifacts that can be easily detected by a committee of strong baseline models. Our adversarial filtering paradigm is general, allowing potential applications to other datasets that require human composition of question answer pairs.

SECTION

We filter out sentences with rare tokens (≤3 occurrences), that are short (l ≤ 5), or that lack a verb phrase.

To ensure that the LM generates unique endings, we split the data into five validation folds and train five separate LMs, one for each set of training folds. This means that each LM never sees the found endings during training.

These two examples share contexts. To prevent biasing the test and validation sets, we didn't perform this procedure on answers from the evaluation sets' context.6 To be data-efficient, we reannotated filtered-out examples by replacing gibberish endings, as well as generations that outranked the found ending, with candidates from A.

The fastText model is trained using binary cross-entropy; at test time we extract the prediction by selecting the ending with the highest positive likelihood under the model.

We also tried using an MLP, but got worse results.

For RocStories, this was by design to encourage learning from the larger corpus of 98k sensible stories.