HellaSwag: Can a Machine Really Finish Your Sentence?

Rowan Zellers
Ari Holtzman
Yonatan Bisk
Ali Farhadi
~y Ejin Choi
Paul G Allen

Abstract

Recent work by Zellers et al. (2018) introduced a new task of commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup: "She sets her fingers on the keys." With the introduction of BERT (Devlin et al., 2018), near human-level performance was reached. Does this mean that machines can perform human level commonsense inference? In this paper, we show that commonsense inference still proves di cult for even stateof-the-art models, by presenting HellaSwag, a new challenge dataset. Though its questions are trivial for humans (°95% accuracy), state-of-the-art models struggle (†48%). We achieve this via Adversarial Filtering (AF), a data collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers. AF proves to be surprisingly robust. The key insight is to scale up the length and complexity of the dataset examples towards a critical 'Goldilocks' zone wherein generated text is ridiculous to humans, yet often misclassified by state-of-the-art models. Our construction of HellaSwag, and its resulting di culty, sheds light on the inner workings of deep pretrained models. More broadly, it suggests a new path forward for NLP research, in which benchmarks co-evolve with the evolving state-of-the-art in an adversarial way, so as to present ever-harder challenges.

1. Introduction

Figure 1: Models like BERT struggle to finish the sentences in HellaSwag, even when they come from the same distribution as the training set. While the wrong endings are on-topic, with words that relate to the context, humans consistently judge their meanings to be either incorrect or implausible. For example, option A of the WikiHow passage suggests that a driver should stop at a red light for no more than two seconds.

Imagine a woman chasing a dog around outside, trying to give it a bath. What might happen next? Humans can read a narrative like this, shown in Figure 1 , and connect it to a rich model of the world: the dog is currently dry and not soapy, and it actively doesn't want to be bathed. Thus, one A woman is outside with a bucket and a dog. The dog is running around trying to avoid a bath. She… A. rinses the bucket off with soap and blow dry the dog's head. B. uses a hose to keep it from getting soapy. C. gets the dog wet, then it runs away again. D. gets into a bath tub with the dog.

Come to a complete halt at a stop sign or red light. At a stop sign, come to a complete halt for about 2 seconds or until vehicles that arrived before you clear the intersection. If you're stopped at a red light, proceed when the light has turned green. … A. Stop for no more than two seconds, or until the light turns yellow. A red light in front of you indicates that you should stop. B. After you come to a complete stop, turn off your turn signal.

Allow vehicles to move in different directions before moving onto the sidewalk. C. Stay out of the oncoming traffic. People coming in from behind may elect to stay left or right. D. If the intersection has a white stripe in your lane, stop before this line. Wait until all traffic has cleared before crossing the intersection.

How to determine who has right of way.

easy! ???

+ Adversarial Filtering + Adversarial Filtering

Figure 1: Models like BERT struggle to finish the sentences in HellaSwag, even when they come from the same distribution as the training set. While the wrong endings are on-topic, with words that relate to the context, humans consistently judge their meanings to be either incorrect or implausible. For example, option A of the WikiHow passage suggests that a driver should stop at a red light for no more than two seconds.

plausible next event is option C-that she'll get the dog wet and it will run away again. When the SWAG dataset was first announced (Zellers et al., 2018) , this new task of commonsense natural language inference seemed trivial for humans (88%) and yet challenging for thenstate-of-the-art models ( †60%), including ELMo (Peters et al., 2018) . However, BERT (Devlin et al., 2018) soon reached over 86%, almost human-level performance. One news article on this development was headlined "finally, a machine that can finish your sentence." 1 In this paper, we investigate the following question: How well do deep pretrained models, like BERT, perform at commonsense natural language inference (NLI)? Our surprising conclusion is that the underlying task remains unsolved. Indeed, we find that deep models such as BERT do not demonstrate robust commonsense reasonining ability by themselves. Instead, they operate more like rapid surface learners for a particular dataset. Their strong performance on SWAG is dependent on the finetuning process, wherein they largely learn to pick up on dataset-specific distributional biases. When the distribution of language shifts slightly, performance drops drastically -even if the domain remains identical.

We study this question by introducing Hella-Swag, 2 a new benchmark for commonsense NLI. We use Adversarial Filtering (AF), a datacollection paradigm in which a series of discriminators is used to select a challenging set of generated wrong answers. AF is surprisingly e↵ective towards this goal: the resulting dataset of 70k problems is easy for humans (95.6% accuracy), yet challenging for machines ( †50%q. This result holds even when models are given a significant number of training examples, and even when the test data comes from the exact same distribution as the training data. Machine performance slips an additional 5% when evaluated on examples that cover novel concepts from the same domain.

To make this dataset robust to deep pretrained models, we use a trifecta of state-of-theart generators (Radford et al., 2018) , state-ofthe-art discriminators (BERT), and high quality source text. We expand on the SWAG's original video-captioning domain by using WikiHow articles, greatly increasing the context diversity and generation length. Our investigation reveals a Goldilocks zone -roughly three sentences of context, and two generated sentences -wherein generations are largely nonsensical, even though state-of-the-art discriminators cannot reliably tell the di↵erence between these generations and the ground truth.

Figure 2: An overview of Adversarial Filtering. On each iteration, a new classifier is trained on a dummy training set Dtrain to replace easily-classified negative endings on the dummy test set Dtest with adversarial endings. This process is repeated iteratively, to obtain a challenging dataset regardless of the final split.

More broadly, our paper presents a case-study towards a future of verified progress in NLP, via iterative rounds of building and breaking datasets. If our ultimate goal is to provide reliable benchmarks for challenging tasks, such as commonsense NLI, these benchmarks cannot be static. Instead, they must evolve together with the evolving state-of-2 Short for Harder Endings, Longer contexts, and Lowshot Activities for Situations With Adversarial Generations. Dataset and code at https://rowanzellers.com/hellaswag. Gururangan et al., 2018; Poliak et al., 2018) has found that when humans write the endings to NLI questions, they introduce subtle yet strong class-conditional biases known as annotation artifacts. 3 To address this, Zellers et al. (2018) introduced Adversarial Filtering (AF). An overview is shown in Figure 2 . The key idea is to produce a dataset D which is adversarial for any arbitrary split of pD train , D test q. This requires a generator of negative candidates (i.e., wrong endings that vi- olate human notions about how the world works), which we achieve by using a language model. Potential candidates of incorrect answers were massively oversampled from a language model trained on in-domain data, and then selected using an ensemble of adversaries. The selection process happens iteratively: on each iteration, the dataset is randomly partitioned into D train and D test . The ensemble is trained to classify endings as real or generated on D train , then, AF replaces easy-toclassify generations in D test . This process continues until the accuracy of these adversaries converges. Last, humans validate the data to remove adversarial endings that seem realistic. Importantly, AF creates a final dataset that is challenging to models regardless of the final dataset split. In Section 4, we will use AF as the underlying workhorse to construct an NLI dataset that is easy for humans, yet challenging for machines. This di culty persists even when models are provided significant training data, and even when this data comes from the same distribution as the test set. This contrasts with past work on adversarial examples (e.g. Jia and Liang, 2017; Glockner et al., 2018; Belinkov and Bisk, 2018) which consider cases where an out-of-distribution test set is constructed to be adversarial.

Context 2 Context 1 Context N … Context 1 Context M … Real ending … Real ending (N instances) (M instances) D train Real ending … Real ending Real ending Gen'd ending K Gen'd ending K Gen'd ending K … … … … … Gen'd ending2 … Gen'd ending2 Gen'd ending2 Gen'd ending 1 … Gen'd ending 1 Gen'd ending 1 D test Gen'd ending2 … Gen'd ending 1 … … … … Gen'd ending 2 Gen'd ending 1 Gen'd ending K Gen'd ending K … f

3. Investigating Swag

Figure 3: Validation accuracy on SWAG for BERTLarge versus training set size. The baseline (25% accuracy) is random chance. BERT does well given as few as 16 training examples, but requires tens of thousands of examples to approach human performance.

Figure 4: BERT validation accuracy when trained and evaluated under several versions of SWAG, with the new dataset HellaSwag as comparison. We compare:

In this section, we investigate why SWAG was solved. We focus on BERT, since it is the best Figure 4 : BERT validation accuracy when trained and evaluated under several versions of SWAG, with the new dataset HellaSwag as comparison. We compare:

Ending Only No context is provided; just the endings.

Shuffled

Endings that are indidivually tokenized, shu✏ed, and then detokenized.

Shuffled+ Ending Only

No context is provided and each ending is shu✏ed.

known approach at the time of writing. 4 Core to our analysis is investigating how a model trained on Wikipedia and books can be so e↵ectively finetuned for SWAG, a dataset from video captions.

3.1 How Much Innate Knowledge Does Bert Have About Swag?

We investigate this question by measuring BERT's performance on SWAG while varying the size of the training dataset; results are shown in Figure 3. While the best known ELMo NLI model (ESIM+ELMo; Chen et al., 2017) requires the entire training set to reach 59%, BERT outperforms this given only 64 examples. However, BERT still needs upwards of 16k examples to approach human performance, around which it plateaus.

3.2 What Is Learned During Finetuning?

Figure 4 compares BERT's performance when trained and evaluated on variants of SWAG. Context: BERT's performance only slips 11.9 points (86.7%Ñ74.8%) when context is omitted (Ending Only), suggesting a bias exists in the endings themselves. 5 If a followup event seems unreasonable absent of context, then there must be something markedly di↵erent between the space of human-written and machine-generated endings. Structure: To distinguish word usage from 2018's language model versus OpenAI GPT. While GPT converges at random, the LM used for SWAG converges at 75%. Right: AF applied to WikiHow generations from GPT, while varying the ending length from one to three sentences. They converge to random, "40%, and "50%, respectively. structural patterns, we consider a new scenario,

Shuffled. Here the shared context is provided, but the words in each ending choice are randomly permuted. Surprisingly, this reduces BERT performance by less than 10%. Even though BERT was never exposed to randomly shu✏ed text during pretraining, it easily adapts to this setting, which suggests that BERT is largely performing lexical reasoning over each (context, answer) pair. Finally, when the context is removed and the words in each ending are shu✏ed, performance drops to 60.4%. While low, this is still higher than ELMo's performance ( †60% from Zellers et al., 2018) . As neither context nor structure is needed to discriminate between human and machine-written endings in a majority of cases, it is likely that systems primarily learn to detect distributional stylistic patterns during finetuning.

3.3 Where Do The Stylistic Biases Come From?

SWAG was constructed via Adversarial Filtering (AF). Endings were generated via a language model, and then selected to fool a discriminator. To understand why it was solved requires understanding the interplay of AF with respect to SWAG's generators and discriminators.

Zellers et al. (2018) used a two-layer LSTM for generation, with shallow stylistic adversarial filters. 6 This setup was robust against ELMo models, but has the shallow LM in particular produced distributional artifacts that BERT picks up on?

To investigate this, we perform AF using BERT-Large as the discriminator 7 in two settings, comparing generations from Zellers et al. (2018) with those from a finetuned GPT (Radford et al., 2018) .

Figure 5: Adversarial Filtering (AF) results with BERT-Large as the discriminator. Left: AF applied to ActivityNet generations produced by Zellers et al. (2018)’s language model versus OpenAI GPT. While GPT converges at random, the LM used for SWAG converges at 75%. Right: AF applied to WikiHow generations from GPT, while varying the ending length from one to three sentences. They converge to random, „40%, and „50%, respectively.

Strikingly, the results, Figure 5 (left), show that the generations used in SWAG are so di↵erent from the human-written endings that AF never drops the accuracy to chance; instead, it converges to roughly 75%. On the other hand, GPT's generations are good enough that BERT accuracy drops below 30% over many random subsplits of the data, revealing the importance of the generator.

4. Hellaswag

The success of BERT implies that high-quality generators and discriminators are crucial to AF's success. However, it does not imply that the underlying task of commonsense NLI -as opposed to a single dataset -is solved. To evaluate this claim requires us to try making a new evolution of the SWAG dataset, one in which artifacts are removed. In this section, we do just that by introducing HellaSwag.

4.1 Activitynet Captions

We start by including video captions from the ActivityNet Captions dataset (Krishna et al., 2017) . The original SWAG dataset contains these, along with captions from LSMDC (Rohrbach et al., 2017) , but for HellaSwag we solely used ActivityNet. In addition to temporal descriptions, ActivityNet also provides activity labels for each caption (e.g. jumping rope). We will use these activity labels as additional structure to test generalization ability.

4.2 Wikihow: A New Testbed

We next consider a new and challenging testbed for commonsense reasoning: completing how-to articles from WikiHow, an online how-to manual. We scrape 80k context and follow-up paragraphs from WikiHow, covering such diverse topics as "how to make an origami owl" to "how to survive a bank robbery." Each context has at most three sentences, as do the follow-ups.

AF's e↵ectiveness in this new setting is shown in Figure 5 (right). We consider three settings, corresponding to endings that are either one, two, or three sentences long. In all cases, BERT performance begins high (70-90%), but there are enough generations for Adversarial Filtering to lower the final accuracy considerably. While the one-sentence case converges to slightly higher than random -35% when it converges -the two and three sentence cases are higher, at 40% and 50% respectively. Given more context, it becomes easier to classify an ending as machine-or humanwritten. We compromise and use two-sentence generations. Particularly in the two-sentence case, we find ourselves in a Goldilocks zone wherein generations are challenging for deep models, yet as we shall soon see, easy for humans.

4.3 Obtaining High Human Agreement

Figure 6: For HellaSwag, we ensure high human agreement through several rounds of annotation. By collecting how likely each ending is we can filter false negative endings – machine generations that sound realistic – and replace them with true negatives. On both subdatasets, BERT performance increases during validation, but the gap to human performance remains wide.

How well can humans distinguish human-written endings from machine generations refined with Adversarial Filtering? In Figure 6 , we compare human performance with that of BERT on a random 80%/20% split. We see a contrast between the ActivityNet and WikiHow performance. While ActivityNet starts o↵ harder for BERT (25.5%), it also proves di cult for humans (60%). In contrast, WikiHow starts easier for BERT (41.1%) and humans find the domain almost trivial (93.5%). We hypothesis this discrepancy is due to the lengths of both datasets (Figure 7) . WikiHow's 2-sentence generations average 41 tokens, versus 13 for ActivityNet. This gives WikiHow generations three times as many opportunities to make a detectable mistake.

Figure 7: Lengths of ActivityNet and WikiHow; the latter with two-sentence generations. WikiHow is much longer, which corresponds to being easier for humans, while taking longer for AF to converge.

To ensure high agreement on ActivityNet, we perform several rounds of human filtering, in- Figure 6 : For HellaSwag, we ensure high human agreement through several rounds of annotation. By collecting how likely each ending is we can filter false negative endings -machine generations that sound realistic -and replace them with true negatives. On both subdatasets, BERT performance increases during validation, but the gap to human performance remains wide. Figure 7 : Lengths of ActivityNet and WikiHow; the latter with two-sentence generations. WikiHow is much longer, which corresponds to being easier for humans, while taking longer for AF to converge.

creasing human performance to 94%. During human validation, crowd workers are given a context and six ending choices, of which one is the true ending, and the other five are from AF. On each iteration, we replace machine-written endings that the worker rated as realistic with new samples. In the end, we keep the 25k best ActivityNet contexts (i.e. those with highest agreement among workers 8 ) and the 45k best WikiHow contexts.

4.4 Zero-Shot Categories For Evaluation

Figure 8: Examples on the in-domain validation set of HellaSwag, grouped by category label. Our evaluation setup equally weights performance on categories seen during training as well as out-of-domain.

Table 1: Performance of models, evaluated with accuracy (%).We report results on the full validation and test sets (Overall), as well as results on informative subsets of the data: evaluated on in-domain, versus zero-shot situations, along with performance on the underlying data sources (ActivityNet versus WikiHow). All models substantially underperform humans: the gap is over 45% on in-domain categories, and 50% on zero-shot categories.
	Overall			In-Domain		Zero-Shot		ActivityNet	WikiHow
Model	Val	Test	Val	Test	Val	Test	Val	Test	Val	Test
Split Size-	10K	10K	5K	5K	5K	5K	3.2K	3.5K	6.8K	6.5K
Chance						25.0
fast' Text	30.9	31.6	33.8	32.9	28.0	30.2	27.7	28.4	32.4	33.3
LSTM+GloVe	31.9	31.7	34.3	32.9	29.5	30.4	34.3	33.8	30.7	30.5
LSTM+ELMo	31.7	31.4	33.2	32.8	30.4	30.0	33.8	33.3	30.8	30.4
LSTM+BERT-Base	35.9	36.2	38.7	38.2	33.2	34.1	40.5	40.5	33.7	33.8
ESIM+ELMo	33.6	33.3	35.7	34.2	31.5	32.3	37.7	36.6	31.6	31.5
OpenAI GPT	41.9	41.7	45.3	44.0	38.6	39.3	46.4	43.8	39.8	40.5
BERT-Base	39.5	40.5	42.9	42.8	36.1	38.3	48.9	45.7	34.9	37.7
BERT-Large	46.7	47.3	50.2	49.7	43.3	45.0	54.7	51.7	42.9	45.0
Human	95.7	95.6	95.6	95.6	95.8	95.7	94.0	94.0	96.5	96.5

To evaluate a model's ability to generalize to new situations, we use category labels from WikiHow and ActivityNet to make 'zero-shot' evaluation sets. For each set (validation or test), we craft two subsets: one containing 5k 'in-domain' examples that come from categories as seen during training (Figure 8 ), and another with 5k 'zero-shot' examples from randomly chosen held-out categories. In total, there are 70k dataset examples. Table 1 : Performance of models, evaluated with accuracy (%).We report results on the full validation and test sets (Overall), as well as results on informative subsets of the data: evaluated on in-domain, versus zero-shot situations, along with performance on the underlying data sources (ActivityNet versus WikiHow). All models substantially underperform humans: the gap is over 45% on in-domain categories, and 50% on zero-shot categories. HellaSwag, grouped by category label. Our evaluation setup equally weights performance on categories seen during training as well as out-of-domain.

5. Results

We evaluate the di culty of HellaSwag using a variety of strong baselines, with and without massive pretraining. The models share the same format: given a context and an ending, return a logit for that ending. Accordingly, we train our models using a four-way cross-entropy loss, where the objective is to predict the correct ending. In addition to BERT-Large, our comparisons include: a. OpenAI GPT (Radford et al., 2018) : A finetuned 12-layer transformer that was pre-trained on the BookCorpus (Zhu et al., 2015) . b. Bert-Base: A smaller version of the BERT model whose architecture size matches GPT. c. ESIM+ELMo (Chen et al., 2017; Peters et al., 2018) : This is the best-performing ELMo model for NLI, modified slightly so the final output layer is now a four-way softmax over endings. d. LSTM sentence encoder: This is a randomly initialized two-layer bi-LSTM; the second layer's hidden states are max-pooled and fed into an MLP to predict the logit. We consider three variations: GloVe embeddings, ELMo embeddings, or (frozen) BERT-Base embeddings. 9 e. FastText: (Joulin et al., 2017) An o↵-the-shelf library for bag-of-words text classification. 10 We compare all models to human performance by asking five independent crowd workers to solve the same four-way multiple choice problems; their predictions are combined via majority vote.

Our results, shown in Table 1 , hint at the diculty of the dataset: human performance is over 95%, while overall model performance is below 50% for every model. Surprisingly, despite BERT-Large having been used as the adversarial filter, it still performs the strongest at 47.3% overall. By making the dataset adversarial for BERT, it seems to also have become adversarial for every other model. For instance, while ESIM+ELMo obtained 59% accuracy on SWAG, it obtains only 33.3% accuracy on HellaSwag.

In addition to pretraining being critical, so too is end-to-end finetuning. Freezing BERT-Base and adding an LSTM on top lowers its overall performance 4.3%. This may help explain why models such as ESIM+ELMo struggled on SWAG, as ELMo isn't updated during finetuning.

While BERT is the best model, it still struggles on HellaSwag, and especially so on zero-shot cat- egories. Performance drops roughly 5% on the test fold, which suggests that the finetuning is not enough for BERT to learn to generalize to novel activities or how-to categories.

Last, we see that WikiHow is a much harder domain that ActivityNet for machines: 45% Bert-Large performance, versus 96.5% for humans. Curiously, it is on this source dataset that we see the smallest gap between OpenAI GPT and BERT. In fact, OpenAI GPT outperforms BERT on Wiki-How, but the reverse is true for ActivityNet. One possibility is that the left-to-right structure of GPT is the right inductive bias for WikiHow -perhaps reasoning bidirectionally over long contexts is too much for a 12-layer transformer to learn.

5.1 Swag To Hellaswag Transfer

Figure 9: Transfer experiments from SWAG to HellaSwag and vice versa, evaluated on the validation sets. Overall, a BERT-Large that is trained on SWAG hardly generalizes to HellaSwag: it scores 34.6%.

Given the shared goals and partial domains of SWAG and HellaSwag, it is natural to ask to what extent models can transfer between the two datasets. In Figure 9 we show the results from transfer experiments: models are trained on one dataset and evaluated on the other. 11 The best models are trained on the same dataset that they are evaluated on: training on SWAG and evaluating on HellaSwag lowers performance by 12%; vice versa lowers performance by 15%. The missing domain for HellaSwag models is movie descriptions (LSMDC), still, Hella-Swag models obtain 69% accuracy. On the other hand, SWAG models do not generalize at all to their missing domain, WikiHow (28%), suggesting that learning general commonsense reasoning 11 Note that the ActivityNet splits are di↵erent for each dataset. To avoid skewing the results, we report only on the validation video captions that are not in the training sets of either dataset. The overall accuracy is then a weighted average, where ActivityNet examples are weighted proportionately more. This gives a slight advantage to training on SWAG, as it sees all the ActivityNet categories when training.

Category: Shaving (ActivityNet; In-domain) A bearded man is seen speaking to the camera and making several faces. the man a) then switches o↵ and shows himself via the washer and dryer rolling down a towel and scrubbing the floor. (0.0%) b) then rubs and wipes down an individual's face and leads into another man playing another person's flute. (0.0%) c) is then seen eating food on a ladder while still speaking. (0.0%) d) then holds up a razor and begins shaving his face. (100.0%) Category: Sharpening knives (ActivityNet; Zero-Shot) Two men are in a room and the man with a blue shirt takes out a bench stone and with a little lubricant on the stone takes an knife and explains how to sharpen it. then he a) uses a sharpener to smooth out the stone using the knife. (100.0%) b) shows how to cut the bottom with the knife and place a tube on the inner and corner. (0.0%) c) bends down and grabs the knife and remove the appliance. (0.0%) d) stops sharpening the knife and takes out some pieces of paper to show how sharp the knife is as he cuts slivers of paper with the knife. (0.0%) Category: Youth (WikiHow; In-Domain) How to make up a good excuse for your homework not being finished Blame technology. One of the easiest and most believable excuses is simply blaming technology. You can say your computer crashed, your printer broke, your internet was down, or any number of problems. a) Your excuses will hardly seem believable. was hardly necessary to solve SWAG.

5.2 Qualitative Examples

Figure 10: Example questions answered by BERT-

We show several qualitative examples in Figure 10, along with BERT-Large's predictions. BERT does well on some ActivityNet contexts, such as in the first row, where it correctly predicts the ending for a shaving caption. Whereas shaving is in-domain, the second example about sharpening knives is zero-shot. In this context, BERT's answer suggests that one would use a knife to sharpen a stone, rather than vice versa. The last example comes from WikiHow, which appears to be incredibly challenging for BERT. BERT picks answer d, which has more words that match the context of technology (planes, tra c, laptop), but is incoherent. 12 Figure 11 : Performance on the WikiHow subset of alternative variations of HellaSwag, where di↵erent Adversarial Filters are used (but without human validation). We consider the shallow stylistic adversaries used by Zellers et al. (2018) (Stylistic Ensemble) , as well as an LSTM with ELMo embeddings, GPT, BERT-Base, and BERT-Large. For each adversarial filtering model, we record the accuracy of that model before and after AF is used. We also evaluate each alternative dataset using BERT-Large. The results suggest that using a a stronger model at test time (over the model used for AF) improves performance, but is not enough to solve the task.

Figure 11: Performance on the WikiHow subset of alternative variations of HellaSwag, where di↵erent Adversarial Filters are used (but without human validation). We consider the shallow stylistic adversaries used by Zellers et al. (2018) (Stylistic Ensemble), as well as an LSTM with ELMo embeddings, GPT, BERT-Base, and BERT-Large. For each adversarial filtering model, we record the accuracy of that model before and after AF is used. We also evaluate each alternative dataset using BERT-Large. The results suggest that using a a stronger model at test time (over the model used for AF) improves performance, but is not enough to solve the task.

6. Discussion

Our results suggest that HellaSwag is a challenging testbed for state-of-the-art NLI models, even those built on extensive pretraining. The question still remains, though, of where will the field go next?

6.1 How easy might HellaSwag be for future discriminators?

In this paper, we showed the existence of a Goldilocks zone of text complexity -in which generations are nonsensical, but existing stateof-the-art NLP models cannot tell the di↵erence. How hard will the dataset be for future, even more powerful, models? Answering this question is challenging because these models don't exist (or are unavailable) at the time of writing. However, one remedy is to perform an ablation study on the Adversarial Filtering model used, comparing weaker filters with stronger discriminators. We present our results in Figure 11 , and find that while weak discriminators (like the stylistic ensemble used to make SWAG) only marginally reduce the accuracy of BERT-Large, increasing the gap between the filter and the final discriminator is not enough to solve the task. For instance, using a discriminator with 3x the parameters as the adversarial filter (BERT-Large vs. BERT-Base) results in 63% machine accuracy. Figure 12 : Estimated pretraining hours required to reach a desired accuracy on HellaSwag. We estimate perfomance with respect to a RTX 2080 Ti -a modern, fast GPU, and fit a log-linear regression line. An extrapolation suggests that to reach human-level performance on HellaSwag, without algorithmic or computational improvements, would require 10 9 GPU-hours of pretraining (over 100k GPU years).

Figure 12: Estimated pretraining hours required to reach a desired accuracy on HellaSwag. We estimate perfomance with respect to a RTX 2080 Ti - a modern, fast GPU, and fit a log-linear regression line. An extrapolation suggests that to reach human-level performance on HellaSwag, without algorithmic or computational improvements, would require 109 GPU-hours of pretraining (over 100k GPU years).

6.2 How Well Does Pretraining Scale?

Overall, the current paradigm of pretraining large models on lots of data has made immense progress on NLP benchmarks. Though we expect this trend to continue, it also behooves us to consider its limits. If more compute is indeed the answer for human-level commonsense inference, what would the compute requirements of this hypothetical massive model look like?

We investigate this in Figure 12 by comparing the accuracies of known models on Hella-Swag with their computational needs. This estimation is a rough estimate: we convert reported TPU runtimes to our benchmark RTX 2080 Ti GPU using the Roofline model (Williams et al., 2009) , which focuses primarily on the bottleneck of loading tensors into GPU memory. Extrapolating from an exponential fit suggests that reaching humanlevel performance on our dataset would require 10 9 GPU hours, or 100k years -unless algorithmic improvements are made.

What might these algorithmic improvements look like? These could include architectural advances, better pretraining objectives, and beyond. However, these improvements share the bottleneck of the data source. To answer some Hella-Swag questions correctly without reasoning deeply -like knowing that it is a bad idea to stop at a red light for 'at most two seconds' -might require an exponential number of samples, due to prob-lems of reporting bias (Gordon and Van Durme, 2013) . Alternatively, future models might answer correctly only by picking up on spurious patterns, in which case a new development of the benchmark -using these models as adversaries -would place us in the same position as we are right now.

Put another way, for humans to answer Hella-Swag questions requires abstracting away from language and modeling world states instead. We postulate that this is what separates solving the task of commonsense NLI, as opposed to a particular dataset. Indeed, we find that existing deep methods often get fooled by lexical false friends. For example, in the WikiHow example from Figure 10, BERT chooses an ending that matches the technology words in the context, rather than matching the deeper topic: using technology as an excuse for not doing homework.

6.3 Towards A Future Of Evolving Benchmarks

What happens when HellaSwag gets solved?

We believe the answer is simple: crowdsource another dataset, with the same exact format, and see where models fail. Indeed, in our work we found this to be straightforward from an algorithmic perspective: by throwing in the best known generator (GPT) and the best known discriminator (BERT-Large), we made a dataset that is adversarial -not just to BERT, but to all models we have access to. While this was easy algorithmically, care must be taken from a data curation standpoint. Indeed, we find success exists within a Goldilocks zone: the data source must be complex enough that stateof-the-art generators often make mistakes, while simple enough such that discriminators often fail to catch them. This ties the future of SWAGstyle benchmarks to progress on language generation: until generation is solved, commonsense NLI will remain unsolved. Even recent promising results on scaling up language models (Radford et al., 2019) find problems in terms of consistency, with the best curated examples requiring 25 random seeds.

7. Conclusion

In this paper, we presented HellaSwag, a new dataset for physically situated commonsense reasoning. By constructing the dataset through adversarial filtering, combined with state-of-the-art models for language generation and discrimination, we produced a dataset that is adversarial to the most robust models available -even when models are evaluated on items from the training distribution. In turn, we provided insight into the inner workings of pretrained models, and suggest a path for NLP progress going forward: towards benchmarks that adversarially co-evolve with evolving state-of-the-art models.

SECTION

A New York Times article at https://nyti.ms/2DycutY.

These biases simply inflate model performance, but past work has also shown that are unwanted social biases induced when humans write the endings, in terms of gender and race(Rudinger et al., 2015).

See the appendix for a discussion of the BERT architecture and hyperparameter settings we used in our experiments.5 These biases are similar to those in NLI datasets, as found byGururangan et al. (2018);Poliak et al. (2018).

The discriminator was an ensemble that featured a bag of words model, a shallow CNN, a multilayer perceptron operating on language model perplexities.

On each iteration, BERT-Large is re-initialized from its pretrained checkpoint, finetuned, and then evaluated in a four-way setting on the dummy test set of held-out data. See Supp A for a details of our BERT-Large AF setup.

See the appendix for details about how we estimate this.

For ELMo and BERT-Base, the model learns scalar weights to combine each internal layer of the encoder.10 This model is trained with binary cross entropy loss.

Among other issues, why would someone suddenly be aware that they are 'flying at high speed on a plane...?'