WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi
Ronan Le Bras
Chandra Bhagavatula
Yejin Choi
AAAI
2020
View in Semantic Scholar

Abstract

The Winograd Schema Challenge (WSC) (Levesque, Davis, and Morgenstern 2011), a benchmark for commonsense reasoning, is a set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations. However, recent advances in neural language models have already reached around 90% accuracy on variants of WSC. This raises an important question whether these models have truly acquired robust commonsense capabilities or whether they rely on spurious biases in the datasets that lead to an overestimation of the true capabilities of machine commonsense. To investigate this question, we introduce WinoGrande, a large-scale dataset of 44k problems, inspired by the original WSC design, but adjusted to improve both the scale and the hardness of the dataset. The key steps of the dataset construction consist of (1) a carefully designed crowdsourcing procedure, followed by (2) systematic bias reduction using a novel AfLite algorithm that generalizes human-detectable word associations to machine-detectable embedding associations. The best state-of-the-art methods on WinoGrande achieve 59.4-79.1%, which are 15-35% below human performance of 94.0%, depending on the amount of the training data allowed. Furthermore, we establish new state-of-the-art results on five related benchmarks - WSC (90.1%), DPR (93.1%), COPA (90.6%), KnowRef (85.6%), and Winogender (97.1%). These results have dual implications: on one hand, they demonstrate the effectiveness of WinoGrande when used as a resource for transfer learning. On the other hand, they raise a concern that we are likely to be overestimating the true capabilities of machine commonsense across all these benchmarks. We emphasize the importance of algorithmic bias reduction in existing and future benchmarks to mitigate such overestimation.

1 Introduction

Commonsense reasoning capability is one of the key differences between human intelligence and modern AI. Most successful modern AI systems rely primarily on statistical patterns without acquiring rich background knowledge about the physical and social world we live in. Thus far, such systems are not robust when given examples that fall outside the data distribution that they were trained on (Gordon and Van Durme, 2013; Davis and Marcus, 2015; Schubert, 2015) .

The Winograd Schema Challenge (WSC), proposed by Levesque et al. (2011) as an alternative to the Turing Test (Turing, 1950) , was designed to challenge the dominant paradigm of AI systems that rely on statistical patterns without deep understanding about how the world works. Concretely, Levesque et al. (2011) introduced simple pronoun resolution problems that are trivial for humans but hard for machines by crafting problems not to be easily solvable based on frequent patterns in language. The WSC problems are defined to be a pair (called twin) of questions with two answer choices. Here is an example:

1a. Pete envies Martin because he is successful. 1b. Pete envies Martin although he is successful. Question: Is he Pete or Martin? Answers: 1a -Martin, 1b -Pete These twin questions consist of a pair of nearly identical sentences that include trigger word(s) that flips the correct answer.

Although WSC questions are carefully crafted by experts, recent studies have shown that they are still prone to incidental biases that statistical methods can exploit. These biases are roughly of two types: (a) language-based and (b) datasetspecific biases. Language-based bias, or word association bias (Trichelair et al. (2018) ), refers to the case where the correct answer aligns with more frequent patterns in natural language, thus can be easily solved by neural language models trained over large corpora (Table 1 (3) and (4)).

Table 1: WSC examples: (1)-(3) from WSC (Levesque et al., 2011) and (4) from DPR (Rahman and Ng, 2012)). Examples marked with 7 have language-based bias that today’s language models can easily detect.

Dataset-specific bias, on the other hand, is the case of annotation artifacts or spurious correlation that several recent studies have reported on crowdsourced datasets (Gururangan et al., 2018; Poliak et al., 2018; Tsuchiya, 2018) . Importantly, even when an individual instance of a WSC problem is free of language-based bias that the original designers of the WSC intended to avoid, a collection of WSC instances can still contain spurious patterns arXiv:1907.10641v1 [cs.CL] that can be exploited by statistical models. This type of bias gets introduced as problem authors subconsciously repeat similar problem-crafting strategies, which reveal how particular trigger words, sentence structure, or positive/negative sentiment correlate with the correct answers. We introduce WINOGRANDE, 1 a new collection of WSC problems that are constructed to be robust against both types of biases discussed above. Compared to the original WSC and the variants ( §2), WINOGRANDE presents problems that are more challenging by reducing such biases, while also scaling to a significantly larger number of problems (273 to 44k) by crowdsourcing.

Crowdsourcing a large-scale dataset of WSC examples has been considered infeasible primarily due to the pre-requisite knowledge about linguistics and the structural constraints of twin sentences. (Trichelair et al., 2018; Talmor et al., 2018) One novelty of our work is that we demonstrate a method to collect WSC problems at scale through crowdsourcing. We show that the crowdsourced examples maintain the characteristics of WSC; they are easy for humans to answer (above 90% accuracy) but very challenging for state-of-the-art deep neural models. Specifically, we introduce a strategic crowdsourcing design to diversify the context of the problems ( §3), followed by introducing a variant of adversarial filtering algorithm (Zellers et al., 2018) , AFLITE, for systematically reducing spurious patterns that state-of-the-art statistical approaches can exploit ( §4).

While we show that WINOGRANDE is considerably challenging for existing state-of-the-art methods based on pre-trained language models such as BERT Devlin et al. (2018) §5, we also present that WINOGRANDE provides powerful transfer learning ability to other existing commonsense benchmarks ( §6), reporting new state-of-the-art results across several benchmarks, including the original WSC (Levesque et al., 2011) (72.2% → 77.6%), PDP ) (70.0% → 75.0%) DPR (Rahman and Ng, 2012) (76.4% → 86.9%), and COPA (Roemmele et al., 2011 ) (71.2% → 81.0%). On the Winogender dataset -which quantifies the gender bias in a trained model -we show that a model trained on WINOGRANDE has significantly lower bias compared to other rule-based and neural models.

Although a substantial increase of the state-ofthe-art over multiple challenging benchmarks is exciting, we cautiously argue that these positive results must be taken with a grain of salt. The result might also indicate the extent to which spurious effects are prevalent in existing datasets, which run the risk of overestimating the true capabilities of machine intelligence on commonsense reasoning. We leave it as a future research question to determine how much of our improvements indicate a true stride in machine commonsense as opposed to a more effective exploitation of biases in datasets.

2 Existing Wsc-Style Datasets

We briefly describe existing WSC-style datasets. Table 2 summarizes them and provides additional statistics about the size, the average token length per sentence and the size of their vocabulary. WSC (Levesque et al., 2011) This is the original Winograd Schema Challenge dataset, which consisits of 273 problems. The problems are manually crafted by the authors by avoiding word association bias as much as possible (e.g., using the number of search results by Google), although Trichelair et al. (2018) It is formulated as a multiple choice task, in which a pronoun must be resolved to one of up to 5 (but mostly binary) possible antecedents. In addition to the datasets above, there are some other WSC-style datasets that have slightly different formats but share a similar spirit with WSC. COPA (Roemmele et al., 2011) This dataset introduces 1, 000 problems that share the same motivation as that of WSC in terms of evaluating machine commonsense reasoning, but focus instead on script knowledge. Each problem in this dataset is formulated as a binary choice about cause and effect of given premises, which is not structurally constrained as twins in WSC.

Premise: The man broke his toe. Question: What was the CAUSE of this? Hypothesis1: He got a hole in his sock. Hypothesis2: He dropped a hammer on his foot.

Winogender This dataset introduces 720 problems focusing on pronouns whose antecedents are either a person referred to by their occupation (e.g., "the doctor") or a secondary participant (e.g., "a patient"). The goal of this dataset is to uncover gender bias in coreference resolution systems. WinoBias (Zhao et al., 2018) This is a concurrent work with Winogender, aimed at diagnosing gender bias in coreference resolution systems. Although the size is larger than Winogender, Wino-Bias is evaluated by F-scores (i.e., detecting the span) as a coreference resolution task (Pradhan et al., 2014) instead of the binary choice accuracy as in WSC. SuperGLUE (Wang et al., 2019 ) SuperGLUE contains multiple datasets for universal benchmarking across different tasks -one of them is a modified version of WSC. We refer to it as SuperGLUE-WSC to differentiate it from the original WSC. SuperGLUE-WSC aggregates the original WSC, PDP and additional PDP-style examples, and recasts them into True/False binary problems, where a sentence with the target pronoun and an answer candidate is given (e.g., "Pete envies Martin because he is very successful." Q: Does he refer to Martin? A: True). Therefore, the number of problems are roughly doubled from WSC and PDP, although the size is still relatively small.

3 Crowdsourcing Twins At Scale

The original WSC problems in Levesque et al. (2011) were carefully crafted by experts in the field of knowledge representation and reasoning, who ensured that the problems were trivial for humans yet hard for AI systems. WSC problems have been considered challenging to crowdsource due to the structural constraints of twins and the requirement of linguistic knowledge -but, contrary to this belief, we present an effective approach to create a large-scale dataset (WINOGRANDE) of WSC problems while maintaining its original properties. Our approach consists of a carefully designed crowdsourcing task followed by a novel adversarial filtering algorithm ( §4) that systematically removes biases in the data. Enhancing Crowd Creativity Creating twin sentences from scratch puts a high cognitive load on crowd workers who subconsciously resort to writing pairs that are lexically and stylistically repetitive. To encourage creativity and reduce cognitive load, we employed creativity from constraints (Stokes, 2005) -a psychological notion which suggests that appropriate constraints can help structure and drive creativity. In practice, crowd workers are primed by a randomly chosen topic as a suggestive context (details below), while they are asked to follow precise guidelines on the structure of the curated data.

Crowdsourcing Task We collect WINO-GRANDE problems via crowdsourcing on Amazon Mechanical Turk (AMT). 2 To prime crowd workers, they were instructed to randomly pick an anchor word(s) from a randomly assigned WikiHow article 3 and to ensure that the twin sentences contain the anchor word, which remarkably improves diversity of topics in the collected data. Additionally, workers were instructed to keep twin sentence length in between 15 and 30 words while maintaining at least 70% word overlap between a pair of twins. 4 Following the original WSC design, we aimed to collect twins in two different domains -(i) social commonsense: a situation involving two same gender people with contrasting attributes, emotions, social roles, etc., and (ii) physical commonsense: a context involving two physical objects with contrasting properties, usage, locations, etc. In both cases, workers are instructed to avoid language-based bias (word association) as much as possible. In total, we collected 56k questions (i.e., 28k twins).

Data Validation

We validated the collected questions, because crowdsourced data often contains noisy results. Each questions is validated by a distinct set of three crowd workers. A question is determined valid if (1) all three workers choose the correct answer option, (2) all three workers agree that the two answer options are not equally plausible and (3) the question cannot be answered just by word association in which local context around the target pronoun is given (e.g., "because it was going so fast." (race car / school bus)). 5 As a result, 90% of the questions (50k) are deemed valid and we discarded the invalid (6k) questions.

While our crowdsourcing procedure addresses instance-level biases, it is still possible that the constructed dataset has dataset-specific biases -especially after it has been scaled up. To address this challenge, we propose a method for systematic bias reduction in datasets.

2 Our crowdsourcing interface is available at https:// mosaic.allenai.org/projects/winogrande.

3 https://www.wikihow.com/Special: Randomizer 4 All the workers met minimum qualification in AMT: 99% approval rate, 5k approvals, and either US, Canada, UK, Australia, or New Zealand as location. The reward was set to be $0.4 per twin sentences.

5 For each sentence validation, workers were paid $0.03.

4 Systematic Data Bias Reduction

Bias from annotation artifacts Several recent studies (Gururangan et al., 2018; Poliak et al., 2018; Tsuchiya, 2018) have reported the presence of annotation artifacts in large-scale (often crowdsourced) datasets. Annotation artifacts are unintentional patterns in the data that leak information about the target label in an undesired way. Machine learning models can exploit such artifacts to solve instances in a dataset by taking a virtual shortcut. While dataset creators can tackle biases that they can identify -e.g. point-wise mutual information (PMI) or conditional probability between a word and an inference class in the Stanford Natural Language Inference (SNLI) corpus (Gururangan et al., 2018; Poliak et al., 2018) -and account for them, these approaches assume that the bias exists in a lexical level. However, it does not deny the existence of other biases derived from structural patterns. Modern machine learning models are endowed with high capacity and also tend to be opaque (often called black boxes), which make identifying the source of bias even more challenging. To tackle these biases that are hard to observe manually, we propose AFLITE -a lightweight algorithmic solution for data bias reduction. Light-weight adversarial filtering Our approach builds upon the adversarial filtering (AF) algorithm proposed by Zellers et al. (2018) , but makes two key improvements: (1) AFLITE is much more broadly applicable (by not requiring over generation of data instances) and (2) it is considerably more lightweight (not requiring re-training a model at each iteration of AF). Overgenerating machine text from a language model to use in test instances runs the risk of distributional bias where a discriminator can learn to distinguish between machine generated instances and human-generated ones. In addition, AF depends on training a model at each iteration, which comes at extremely high computation cost when being adversarial to a model like BERT. Instead of manually identified lexical features, we adopt a dense representation of instances using their pre-computed neural network embeddings. In this work, we use BERT (Devlin et al., 2018) fine-tuned on a small subset of the dataset. Concretely, we use 6k instances (5k for training and 1k for validation) from the dataset (containing 50k instances in total) to fine-tune BERT (referred to as BERT embed ). We use BERT embed to pre-compute the embeddings for the rest of the instances (44k) as the input for AFLITE. We discard the 6k instances from the final dataset.

Next, we use an ensemble of linear classifiers (logistic regressions) trained on random subsets of the data to determine whether the representation used in BERT embed is strongly indicative of the correct answer option. If so, we discard the corresponding instances and proceed iteratively.

Algorithm 1 provides the implementation of AFLITE. The algorithm takes as input the precomputed embeddings X and labels y, along with the size n of the ensemble, the training size m for the classifiers in the ensemble, the size k of the filtering cutoff, and the filtering threshold τ . At each filtering phase, we train n linear classifiers on different random partitions of the data and we collect their prediction on their corresponding validation set. For each instance, we compute its score as the ratio of correct predictions over the total number of predictions. We rank the instances according to their score and remove the top-k instances whose score is above threshold τ . We repeat this process until we remove fewer than k instances in a filtering phase or there are fewer than m remaining instances. When applying AFLITE to WINO-GRANDE, we set m = 15, 000, n = 64, k = 500, and τ = 0.75.

This approach is also reminiscent of recent work in NLP on adversarial learning (Chen and Cardie, 2018; Belinkov and Bisk, 2018; Elazar and Goldberg, 2018) . Belinkov et al. (2019) propose an adversarial removal technique for NLI which en-courages models to learn representations that are free of hypothesis-only biases. When proposing a new benchmark, however, we cannot enforce that any future model will purposefully avoid learning spurious correlations in the data. In addition, while the hypothesis-only bias is an insightful bias in NLI, we make no assumption about the possible sources of bias in WINOGRANDE. Instead, we adopt a more proactive form of bias reduction by relying on state-of-the-art (statistical) methods to uncover undesirable dataset shortcuts. Assessment of AFLITE We assess the impact of AFLITE relative to two baselines: random data reduction and PMI-based filtering. In random data reduction, we randomly subsample the dataset to evaluate how a decrease in dataset size affects the bias. In PMI-based filtering, we first compute the difference (f ) of PMIs for each twin (t) as follows:

f t (t 1 , t 2 ) = w∈t 1 PMI(y; w) − w∈t 2 PMI(y; w).

Then, we select twins in increasing order of f t , assuming that higher values of f t lead to less challenging twin instances. 6 Figure 1 plots BERT pre-computed embeddings whose dimension is reduced to 2D (top) and 1D (bottom) using Principal Component Analysis (PCA). We observe that WINOGRANDE all and the two baselines exhibit distinct components between the two correct answer options (i.e., y ∈ 1, 2), whereas such distinction disappears in WINO-GRANDE debiased , which implies that AFLITE successfully reduces the spurious correlation in the dataset (between instances and labels). To quantify the effect, we compute the KL divergence between the samples with answer options. We find that the random data reduction does not reduce the KL divergence (0.66 → 0.65). It is interesting to see that PMI-filtering marginally reduces the KL divergence (0.66 → 0.46), although the principal component analysis on the PMI-filtered subset still leads to a significant separation between the labels. On the other hand, in WINOGRANDE debiased , AFLITE reduces the KL divergence dramatically (0.66 → 0.02) which suggests that this debiased dataset should be challenging for statistical models that solely rely on spurious correlation. Figure 1: The effect of debiasing by AFLITE. BERT pre-computed embeddings (applied PCA for dimension reduction) are shown in 2D-histograms (top row) and 1D-histograms (bottom row) for WINOGRANDE all , the random samples, PMI-filtered subset, and AFLITE-filtered subset. Data points are colored depending on the label (i.e., the answer y is option 1 (blue) or 2 (red)). In the 1D representation, we show the KL-divergence between p(d 1 , y=1) and q(d 1 , y=2).

Figure 1: The effect of debiasing by AFLITE. BERT pre-computed embeddings (applied PCA for dimension reduction) are shown in 2D-histograms (top row) and 1D-histograms (bottom row) for WINOGRANDEall, the random samples, PMI-filtered subset, and AFLITE-filtered subset. Data points are colored depending on the label (i.e., the answer y is option 1 (blue) or 2 (red)). In the 1D representation, we show the KL-divergence between p(d1, y=1) and q(d1, y=2).

Twin Sentences

Options (answer)

The rock kept its balance on the mountain but the log tumbled down, because it was better situated for stability.

Rock / Log

The rock kept its balance on the mountain but the log tumbled down, because it was poorly situated for stability.

rock / log Nick did not enjoy watching golf as much as Randy because he never played the game.

Nick / Randy Nick did not enjoy watching golf as much as Randy because he often played the game. Nick / Randy The pizza was warmer than the hot dog because it was in the oven for a longer amount of time.

pizza / hot dog The pizza was warmer than the hot dog because it was in the oven for a shorter amount of time. pizza / hot dog Sarah accused Katrina of cheating by looking at her cards, because she kept losing the game.

Sarah / Katrina

Sarah accused Katrina of cheating by looking at her cards, because she kept winning the game. Sarah / Katrina What bias has been actually detected by AFLITE? Is the bias really spurious and undesirable according to the original WSC's goal? Table 3 presents examples of structural biases (i.e., spurious relation) that AFLITE has detected as a dataset-specific bias. We see a structural pattern in the first two twins, where the local context (or sentiment) between the answer option and the target pronoun are highly correlated. In other words, these problems can be easily answered by simply looking at the surrounding context and the polarity of the sentiment (positive or negative). Importantly, this dataset-specific bias is structural rather than at the token level, contrasting with the biases that have been identified in the NLI literature (Gururangan et al., 2018; Poliak et al., 2018) , and it is hard to detect these biases using heuristics such as PMIfiltering. Instead of depending on such heuristics, AFLITE is able to detect samples that potentially have such biases algorithmically. After applying the AFLITE algorithm, we obtain a debiased dataset of 25, 680 instances split into training (18, 538), development (2, 863), and test (4, 279) sets.

Table 3: Examples that have dataset-specific bias detected by AFLITE (marked with 7). The words that include

5.1 Benchmark Models

We evaluate the WINOGRANDE debiased on methods/models that have been effective on the original WSC. Wino Knowledge Hunting Wino Knowledge Hunting (WKH) by is based on an information retrieval approach, where the sentence is parsed into a set of queries and then the model looks for evidence for each answer candidate from the search result snippets. This IR-oriented approach comes from an important line of work in coreference resolution (Kobdani et al., 2011; Ratinov and Roth, 2012; Bansal and Klein, 2012; Zheng et al., 2013; Peng et al., 2015; Sharma et al., 2015) . Ensemble Neural LMs Trinh and Le (2018) is one of the first attempts to apply a neural language model which is pre-trained on a very large corpora (including LM-1-Billion, CommonCrawl, SQuAD, and Gutenberg Books). In this approach, the task is treated as fill-in-the-blank question with binary choice. The target pronoun in the sentence is replaced by each answer candidate and the neural language model provides the likelihood of the two resulting sentences. This simple yet effective approach outperforms previous IR-based methods. OpenAI-GPT OpenAI-GPT (Radford et al., 2018) is one of the earliest methods that uses largescale pre-trained neural language modeling. While the first version of OpenAI-GPT did not report its performance on WSC, 7 the updated model (Radford et al., 2019) reports 70.7% on the original WSC. BERT BERT (Devlin et al., 2018) is another pretrained neural model which has bidirectional paths and consecutive sentence representations in hidden layers. We use three different BERT-related models: 1) BERT masked-LM (BERT-lm), 2) BERT-single-finetuning (BERT-ft), and 3) BERTsequential-finetuning (BERT-seqft). For BERTlm, we use the pre-trained BERT-large model as a language model by comparing the likelihood of each candidate answer. For BERT-ft, we split the sentence into context and option using the candidate answer as delimiter. The input format becomes [CLS] context [SEP] option [SEP] ; e.g., The trophy doesn't fit into the brown suitcase because the

[SEP] is too large [SEP] (The blank is filled with either option 1 or 2). For BERT-seqft, we first finetune BERT-large on an auxiliary dataset (DPR in our case), and then fine-tune the resulting pretrained model on the target dataset. We used grid-search for hyper-parameter tuning: learning rate {1e − 5, 3e − 5, 10e − 5}, number of epochs {3, 4, 5, 10}, batch-size {4, 8, 16}. Word association baseline Using BERT-seqft, we also run the word association baseline (local- context-only) to check if the dataset can be solved by language-based bias. In this baseline, the model is trained with only local contexts (w t−2:EOS ) surrounding the blank to be filled (w t ). This is analogous to the hypothesis-only baseline in NLI (Poliak et al., 2018) , where the task (dataset) does not require the full context to achieve high performance.

In order to see if there is an effect of AFLITE for language-based bias, we fine-tune BERT with either randomly selected samples (25k) or debiased samples (25k), subsequently referred to as seqft random and seqft debiased , respectively. Human evaluation In addition to the methods described above, we compute human performance as the majority vote of three crowd workers for each question. We find that AFLITE does not adversely effect the quality of the dataset as humans are still able to achieve over 90% accuracy, significantly higher than the performance of the best model (62%). Table 4 shows the results. Most baselines only achieve chance-performance, while the best model, BERT-seqft achieves 61.6% test-set accuracy. Crowd workers achieve 90.8% test-set accuracy, indicating that the WINOGRANDE debiased is still easy for humans to answer as desired. The large gap between the performance of the best model and that of humans provides significant scope of improvements for future research. Additionally, word association (i.e., local context) baselines (seqft random and seqft debiased ) achieve close to chance-level performance, illustrating that WINO-GRANDE debiased cannot be answered by the local context only. It is interesting to see that there is no performance gap between seqft random and WSC (Levesque et al., 2011) Liu et al. (2016) 52.8 WKH 57.1 Ensemble LMs (Trinh and Le, 2018) 63.8 GPT2 (Radford et al., 2019) 70.7 BERT-ft (Kocijan et al., 2019) 72.2 This work 77.6 Humans (Bender, 2015) 92.1 Humans* 96.5 PDP Liu et al. (2016) 61.7 Trinh and Le 201870.0 This work 75.0 Humans 90.9 Humans* 92.5 DPR (Rahman and Ng, 2012) Rahman and Ng (2012) 73.0 Peng et al. (2015) 76.4 This work 86.9 Humans* 95.2 COPA (Roemmele et al., 2011) Gordon et al. 201165.4 Sasaki et al. (2017) 76.4 This work 81.0 Humans (Gordon et al., 2012) 99.0 Table 5 : Accuracy (%) on existing WSC-related tasks. We ran human evaluation with our crowd worker pool (indicated by *).

Table 4: Performance of several baseline systems on WINOGRANDEdebiased. The best performing model (BERT-seqft) is over 28 percentage points below human performance.

Table 5: Accuracy (%) on existing WSC-related tasks. We ran human evaluation with our crowd worker pool (indicated by *).

5.2 Results

seqft debiased . This indicates that the word association bias has already been removed during the data validation process ( §3).

6 Using WINOGRANDE as a Resource WINOGRANDE contains a large number of WSC style questions. In addition to serving as a benchmark dataset, we use WINOGRANDE all as a resource -we apply transfer learning by first finetuning a model on our dataset and evaluating its performance on related datasets: WSC, PDP, DPR, COPA, and Winogender). We establish state-of-theart results across several of these existing benchmark datasets. Experimental Setup Our model is based on BERT finetuned with WINOGRANDE all and the hyper-parameters are determined by the following. For WSC, we used PDP as the dev set to choose the best hyper-parameter set, and vice versa (i.e., WSC as the dev set for PDP). Since DPR and COPA Table 6 : Accuracy (%) and gender bias on Winogender dataset. "Gotcha" indicates whether the target gender pronoun (e.g., she) is minority in the correct answer option (e.g., doctor). |∆F| and |∆M| show the system performance gap between "Gotcha" and "non-Gotcha" for each gender (lower the better). The first three baselines are adopted from ; RULE is Lee et al. (2011) , STATS is Durrett and Klein (2013) , and NEURAL is Clark and Manning (2016) . BERT X corresponds to the BERT-large model fine-tuned on X, where X is either WSC, DPR, WINOGRANDE debiased , or WINOGRANDE all .

Table 6: Accuracy (%) and gender bias on Winogender dataset. “Gotcha” indicates whether the target gender pronoun (e.g., she) is minority in the correct answer option (e.g., doctor). |∆F| and |∆M| show the system performance gap between “Gotcha“ and “non-Gotcha“ for each gender (lower the better). The first three baselines are adopted from Rudinger et al. (2018); RULE is Lee et al. (2011), STATS is Durrett and Klein (2013), and NEURAL is Clark and Manning (2016). BERTX corresponds to the BERT-large model fine-tuned on X, where X is either WSC, DPR, WINOGRANDEdebiased, or WINOGRANDEall.

provide training set, we used it as a dev set to determine the hyper parameter set to evaluate the test set. For hyper parameter search, we use the same grid search strategy as in §5. Winogender dataset provides a test set only, and we use the WINO-GRANDE all dev set as a proxy. Additional Human Evaluation We also report human performance for WSC, PDP, and DPR to check the quality of our crowd worker pool as well as supporting previous findings. To our knowledge, this is the first work to report human performance on DPR dataset. 8 Results The results are shown in Table 5 and Table 6 . Overall, BERT finetuned with WINO-GRANDE all helps improve the accuracy of all the related tasks (Table 5) . At first glance, these improvements may not seem surprising because WINO-GRANDE all can be regarded as additional training data for each dataset (particularly WSC, PDP, and DPR). However, the improvement on the COPA dataset (76.4% → 81.0%) is not explained by the same logic, because the COPA task is not a pronoun resolution task like the Winograd Schema Challenge. This indicates that our WINOGRANDE all can serve as a resource to support commonsense knowledge transfer. Important Implications We consider that while these positive results over multiple challenging benchmarks are highly encouraging, they may need to be taken with a grain of salt. In particular, these results might also indicate the extent to which spurious dataset biases are prevalent in existing datasets, which runs the risk of overestimating the true capabilities of machine intelligence on commonsense reasoning.

Our results and analysis indicate the importance of continued research on debiasing benchmarks and the increasing need for algorithmic approaches for systematic bias reduction, which allows for the benchmarks to evolve together with evolving state of the art. We leave it as a future research question to further investigate how much of our improvements are due to dataset biases of the existing benchmarks as opposed to true strides in improving commonsense intelligence. Diagnostics for Gender Bias Winogender is designed as diagnostics for checking whether a model (and/or training corpora) suffers from gender bias. The bias is measured by the difference in accuracy between the cases where the pronoun gender matches the occupation's majority gender (called "non-gotcha") or not ("gotcha"). Formally, it is computed as follows : ∆F = Acc (Female, Non-gotcha) − Acc (Female, Gotcha) ∆M = Acc (Male, Non-gotcha) − Acc (Male, Gotcha) for female and male cases respectively.

If ∆F or ∆M is large, it indicates that the model is highly gender-biased, whereas |∆F | = |∆M | = 0 (with high accuracy) is the ideal scenario. In addition, if ∆F or ∆M is largely negative, it implies that the model is biased in the other way around.

The result of the gender-bias diagnostics is shown in Table 6 . We find that the BERT model trained on WINOGRANDE (BERT WG-debiased , BERT WG-full ) both demonstrate considerably smaller gender-bias (|∆F | and |∆M |) compared to the BERT trained on other datasets. It is important to note that the difference comes purely from dataset but not the model structure with pretraining. Does the data size correlate with the reduc-tion of gender gap? This may be true but is not always the case. The gender gap in BERT WG-debiased (25k) is smaller than that in BERT WG-full (44k), which indicates a possibility that AFLITE can reduce undesirable gender bias in the dataset in addition to reducing structural biases ( §4).

7 Conclusions

We introduce WINOGRANDE, a new collection of WSC problems that is significantly larger than existing variants of the WSC dataset. WINOGRANDE consists of 44k instances, half of which determined adversarial. To create a dataset that is robust against spurious statistical biases, we also present AFLITE -a novel light-weight adversarial filtering algorithm. The resulting dataset is significantly more challenging for existing state-of-the-art models while being trivially easy for humans.

Using WINOGRANDE as a resource, we demonstrate effective transfer learning and achieve stateof-the-art results on several WSC-style benchmark datasets. While this is an exciting result, we also discuss the risk of overestimating the performance of the existing state-of-the-art methods on the existing commonsense benchmarks. There is a possibility that they contain spurious statistical patterns (annotation artifacts) that leak information about the target label in an undesirable way.

We advocate for a new perspective for designing benchmarks for measuring progress in AI. Unlike past decades where the community constructed a static benchmark dataset to work on for the next decade or two, we propose that future benchmarks should dynamically evolves together with the evolving state-of-the-art.

The data and codebase are available at https:// mosaic.allenai.org/projects/winogrande

We also evaluated other variations of PMI-filtering such as the absolute difference (|f |), maximum PMI (= max(maxw∈t 1 PMI(y; w), maxw∈t 2 PMI(y; w))), and second-order PMI(y; w1, w2 ∈ t), but we did not observe a significant difference.

Instead, the model was evaluated on the WNLI dataset from the GLUE benchmark, although it did not perform as well as the baseline.

We didn't run human evaluation on COPA and Winogender because they have slightly different question formats from WSC, PDP, DPR, and WINOGRANDE.