Abstract
Large neural models have demonstrated human-level performance on language and vision benchmarks, while their performance degrades considerably on adversarial or out-of-distribution samples. This raises the question of whether these models have learned to solve a dataset rather than the underlying task by overfitting to spurious dataset biases. We investigate one recently proposed approach, AFLite, which adversarially filters such dataset biases, as a means to mitigate the prevalent overestimation of machine performance. We provide a theoretical understanding for AFLite, by situating it in the generalized framework for optimum bias reduction. We present extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks. Finally, filtering results in a large drop in model performance (e.g., from 92% to 62% for SNLI), while human performance still remains high. Our work thus shows that such filtered datasets can pose new research challenges for robust generalization by serving as upgraded benchmarks.
1. Introduction
Large-scale neural networks have achieved superhuman performance across many popular AI benchmarks, for tasks as diverse as image recognition (ImageNet; Russakovsky et al., 1 Allen Institute for Artificial Intelligence 2 Paul G. Allen School of Computer Science, University of Washington. Correspondence to: Ronan Le Bras, Swabha Swayamdipta <{ronanlb,swabhas}@allenai.org>. 2015), natural language inference (SNLI; Bowman et al., 2015) , and question answering (SQuAD; Rajpurkar et al., 2016) . However, the performance of such neural models degrades considerably when tested on out-of-distribution or adversarial samples, otherwise known as data "in the wild" (Eykholt et al., 2018; Jia & Liang, 2017) . This phenomenon indicates that high performance of the strongest AI models is often confined to specific datasets, implicitly making a closed-world assumption. In contrast, true learning of a task necessitates generalization, or an open-world assumption. A major impediment to generalization is the presence of spurious biases -unintended correlations between input and output -in existing datasets (Torralba & Efros, 2011) . Such biases or artifacts 1 are often introduced during data collection (Fouhey et al., 2018) or during human annotation (Gururangan et al., 2018; Poliak et al., 2018; Tsuchiya, 2018; Geva et al., 2019) . Not only do dataset biases inevitably bias the models trained on them, but they have also been shown to significantly inflate model performance, leading to an overestimation of the true capabilities of current AI systems (Sakaguchi et al., 2020; Hendrycks et al., 2019) .
Many recent studies have investigated task or dataset specific biases, including language bias in Visual Question Answering (Goyal et al., 2017) , texture bias in ImageNet (Geirhos et al., 2018) , and hypothesis-only reliance in Natural Language Inference (Gururangan et al., 2018) . These studies have yielded similarly domain-specific algorithms to address the found biases. However, the vast majority of these studies follow a top-down framework where the bias reduction algorithms are essentially guided by researchers' intuitions and domain insights on particular types of spurious biases. While promising, such approaches are fundamentally limited by what the algorithm designers can manually recognize and enumerate as unwanted biases.
Our work investigates AFLITE, an alternative bottom-up approach to algorithmic bias reduction. AFLITE 2 was recently proposed by Sakaguchi et al. (2020) -albeit very succinctly-to systematically discover and filter any dataset artifact in crowdsourced commonsense problems. AFLITE employs a model-based approach with the goal of removing 1.0 0.7 0.7 0.8 0.7 0.7 1.0 0.9 0.8 0.7 0.7 0.9 1.0 0.8 0.7 0.8 0.8 0.8 1.0 0.7 0.7 0.7 0.7 0.7 1.0 1.0 0.5 0.5 0.4 0.5 0.5 1.0 0.8 0.8 0.6 0.5 0.8 1.0 0.7 0.6 0.4 0.8 0.7 1.0 0.6 0.5 0.6 0.6 0.6 1.0 Figure 1 . Example images of the Monarch Butterfly and Chickadee from ImageNet. On the right are images in each category which were removed by AFLITE, and on the left, the ones which were filtered or retained. The heatmap shows pairwise cosine similarity between EfficientNet-B7 features (Tan & Le, 2019) . The retained images (left) show significantly greater diversity -such as the cocoon of a butterfly, or the non-canonical chickadee poses -also reflected by the cosine similarity values. This diversity suggests that the AFLITE-filtered examples presents a more accurate benchmark for the task of image classification, as opposed to fitting to particular dataset biases.
spurious artifacts in data beyond what humans can intuitively recognize, but those which are exploited by powerful models. Figure 1 illustrates how AFLITE reduces dataset biases in the ImageNet dataset for object classification.
This paper presents the first theoretical understanding and comprehensive empirical investigations into AFLITE. More concretely, we make the following four novel contributions.
First, we situate AFLITE in a theoretical framework for optimal bias reduction, and demonstrate that AFLITE provides a practical approximation of AFOPT, the ideal but computationally intractable bias reduction method under this framework ( §2).
Second, we present an extensive suite of experiments that were lacking in the work of Sakaguchi et al. (2020) , to validate whether AFLITE truly removes spurious biases in data as originally assumed. Our baselines and thorough analyses use both synthetic (thus easier to control) datasets ( §3) as well as real datasets. The latter span benchmarks across NLP ( §4) and vision ( §5) tasks: the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets for natural language inference, QNLI (Wang et al., 2018a) for question answering, and the ImageNet dataset (Russakovsky et al., 2015) for object recognition.
Third, we demonstrate that models trained on AFLITEfiltered data generalize substantially better to out-of-domain samples, compared to models that are trained on the original biased datasets ( §4, §5). These findings indicate that spurious biases in datasets make benchmarks artificially easier, as models learn to overly rely on these biases instead of learning more transferable features, thereby hurting out-of-domain generalization.
Finally, we show that AFLITE-filtering makes widely used AI benchmarks considerably more challenging. We consistently observe a significant drop in the in-domain performance even for state-of-the-art models on all benchmarks, even though human performance still remains high; this suggests currently reported performance on benchmarks might be inflated. For instance, the best model on SNLI-AFLITE achieves only 63% accuracy, a 30% drop compared to its accuracy on the original SNLI. These findings are especially surprising since AFLITE maintains an identical train-test distribution, while also retaining a sizable training set.
In summary, AFLITE-filtered datasets can serve as upgraded benchmarks, posing new research challenges for robust generalization.
2. Aflite
Large datasets run the risk of prioritizing performance on the data-rich head of the distribution, where examples are plentiful, and discounting the tail. AFLITE seeks to minimize the ability of a model to exploit biases in the head of the distribution, while preserving the inherent complexity of the tail. In this section, we provide a formal framework for studying such bias reduction techniques, revealing that AFLITE can be viewed as a practical approximation of a desirable but computationally intractable optimum bias re-duction objective.
Formalization. Let Φ be any feature representation defined over a dataset D = (X, Y ). AFLITE seeks a subset S ⊂ D, |S| ≥ n that is maximally resilient to the features uncovered by Φ, that is, for any identically-distributed traintest split of S, features Φ should not help models generalize to the held-out set.
Let M denote a family of classification models (e.g., logistic regression, SVM, or a particular neural architecture) that can be trained on subsets S of D = (X, Y ) using features Φ(X). We define the representation bias of Φ in S w.r.t M, denoted R(Φ, S, M), as the best possible out-of-sample classification accuracy achievable by models in M when predicting labels Y using features Φ(X). Given a target minimum reduced dataset size n, the goal is to find a subset S ⊂ D of size at least n that minimizes this representation bias in S w.r.t. M:
EQUATION (1): Not extracted; please refer to original document.
Eq. (1) corresponds to optimum bias reduction, referred to as AFOPT. We formulate R(Φ, S, M) as the expected classification accuracy resulting from the following process. Let q : 2 S → [0, 1] be a probability distribution over subsets
T = (X T , Y T ) of S.
The process is to randomly choose T with probability q(T ), train a classifier M T ∈ M on S \ T , and evaluate its classification accuracy
f M T (Φ(X T ), Y T ) on T .
The resulting accuracy on T itself is a random variable, since the training set S \ T is randomly sampled. We define the expected value of this classification accuracy to be the representation bias:
EQUATION (2): Not extracted; please refer to original document.
The expectation in Eq. (2), however, involves a summation over exponentially many choices of T even to compute the representation bias for a single S. This makes optimizing Eq. (1), which involves a search over S, highly intractable. To circumvent this challenge, we refactor R(Φ, S, M) as a sum over instances i ∈ S of the aggregate contribution of i to the representation bias across all T . Importantly, this summation has only |S| terms, allowing more efficient computation. We call this the predictability score p(i) for i: on average, how reliably can label y i be predicted using features Φ(x i ) when a model from M is trained on a randomly chosen training set S \ T not containing i. Instances with high predictability scores are undesirable as their feature representation can be exploited to confidently correctly predict such instances.
With some abuse of notation, for i ∈ S, let q(i)
T i q(T ) denote the marginal probability of choosing a subset T that contains i. The ratio q(T ) q(i) is then the probability of T conditioned on it containing i. Let f M T (Φ(x i ), y i ) be the classification accuracy of M T on i. Then the expectation in Eq. (2) can be written in terms of p(i) as follows:
T ⊂S q(T ) • 1 |T | i∈T fM T (Φ(xi), yi) = T ⊂S i∈T q(T ) • fM T (Φ(xi), yi) |T | = i∈S T ⊂S T i q(T ) • fM T (Φ(xi), yi) |T | = i∈S q(i) T ⊂S T i q(T ) q(i) fM T (Φ(xi), yi) |T | = i∈S q(i) ET ⊂S, T i fM T (Φ(xi), yi) |T | = i∈S p(i)
where p(i) is the predictability score of i defined as:
EQUATION (3): Not extracted; please refer to original document.
While this refactoring works for any probability distribution q with non-zero support on all instances, for simplicity of exposition, we assume q to be the uniform distribution over all T ⊂ S of a fixed size. This makes both |T | and q(i) fixed constants; in particular,
q(i) = |S|−1 |T |−1 / |S| |T | = |T | |S| .
This yields a simplified predictability scorep(i) and a factored reformulation of the representation bias from Eq. (2):
EQUATION (5): Not extracted; please refer to original document.
While this refactoring reduces the exponential summation underlying the expectation in Eq.
(2) to a linear sum, solving Eq. (1) for optimum bias reduction (AFOPT) remains challenging due to the exponentially many choices of S. However, the refactoring does enable computationally efficient heuristic approximations that start with S = D and iteratively filter out from S the most predictable instances i, as identified by the (simplified) predictability scoresp(i) computed over the current candidate for S. In all cases, we use a fixed training set size |S \ T | = t < n. Further, since a larger filtered set is generally desirable, we terminate the filtering process early (i.e., while |S| > n) if the predictability score for every i falls below a pre-specified early stopping threshold τ ∈ [0, 1].
We consider three such heuristic approaches. (A) A simple greedy approach starts with the full set S = D, identifies an i ∈ S that maximizesp(i), removes it from S, and repeats up to |D| − n times. (B) A greedy slicing approach identifies the instances with the k highest predictability scores, removes all of them from S, and repeats the process up to
Algorithm 1 AFLITE Input: dataset D = (X, Y ), pre-computed representation Φ(X),
model family M, target dataset size n, number of random partitions m, training set size t < n, slice size k ≤ n, early-stopping threshold τ Output:
reduced dataset S S = D while |S| > n do // Filtering phase forall i ∈ S do Initialize multiset of out-of-sample predictions E(i) = ∅ for iteration j : 1..m do Randomly partition S into (Tj, S \ Tj) s.t. |S \ Tj| = t Train a classifier L ∈ M on {(Φ(x), y) | (x, y) ∈ S \ Tj} (L is typically a linear classifier) forall i = (x, y) ∈ Tj do Add the prediction L(Φ(x)) to E(i) forall i = (x, y) ∈ S do Compute the predictability scorep(i) = |{ŷ ∈ E(i) s.t.ŷ = y}| / |E(i)| Select up to k instances S in S with the highest predictability scores subject top(i) ≥ τ S = S \ S if |S | < k then break return S |D|−n k
times. (C) A slice sampling approach, instead of greedily choosing the top k instances, randomly samples k instances with probabilities proportional to their predictability scores. The Gumbel method provides an efficient way to perform such sampling (Gumbel & Lieblein, 1954; Maddison et al., 2014; Kim et al., 2016; Balog et al., 2017; Kool et al., 2019) , by independently perturbing eachp(i) with a Gumbel random variable and identifying k instances with the highest perturbed predictability scores (cf. Appendix A.1).
All three strategies can be improved further by considering not only the predictability score of the top-k instances but also (via retraining without these instances) how their removal would influence the predictability scores of other instances in the next step. We found our computationally lighter approaches to work well even without the additional overhead of such look-ahead. AFLITE implements the greedy slicing approach, and can thus be viewed as a scalable and practical approximation of (intractable) AFOPT for optimum bias reduction.
Implementation. Algorithm 1 provides an implementation of AFLITE. The algorithm takes as input a dataset D = (X, Y ), a representation Φ(X) we are interested in minimizing the bias in, a model family M (e.g., linear classifiers), a target dataset size n, size m of the support of the expectation in Eq. (4), training set size t for the classifiers, size k of each slice, and an early-stopping filtering threshold τ . Importantly, for efficiency, Φ(X) is provided to AFLITE in the form of pre-computed embeddings for all of X. In practice, to obtain Φ(X), we train a first "warm-up" model on a small fraction of the data based on the learning curve in low-data regime, and do not reuse this data for the rest of our experiments. Moreover, this fraction corresponds to the training size t for AFLITE and it remains unchanged across iterations. We follow the iterative filtering approach, starting with S = D and iteratively removing some instances with the highest predictability scores using the greedy slicing strategy. Slice size k and number of partitions m are determined by the available computation budget.
At each filtering phase, we train models (linear classifiers) on m different random partitions of the data, and collect their predictions on their corresponding test set. For each instance i, we compute its predictability score as the ratio of the number of times its label y i is predicted correctly, over the total number of predictions for it. We rank the instances according to their predictability score and use the greedy slicing strategy of removing the top-k instances whose score is not less than the early-stopping threshold Ï„ . We repeat this process until fewer than k instances pass the Ï„ threshold in a filtering phase or fewer than n instances remain. Please refer to Table 7 in the appendix for actual values of hyperparameters used in different experiments.
3. Synthetic Data Experiments
We present experiments to evaluate whether AFLITE successfully removes examples with spurious correlations in a synthetic setting. Our dataset consists of two-dimensional data, arranged in concentric circles, at two different levels of separation, as shown in Figure 2 . As is evident, a linear function is inadequate for separating the two classes; it requires a more complex non-linear model such as a support vector machine (SVM) with a radial basis function (RBF) kernel.
To simulate spurious correlations in the data, we add classspecific artificially constructed features (biases) sampled from two different Gaussian distributions. These features are only added to 75% of the data in each class, while for the rest of the data, we insert random (noise) features. The bias features make the task solvable through a linear function. Furthermore, for the first dataset, with the largest separation, we flipped the labels of some biased samples, making the data slightly adversarial even to the RBF. Both models can clearly leverage the biases, and demonstrate improved performance over a baseline without biases. 3 Once we apply AFLITE, as expected, the number of biased samples is reduced considerably, making the task hard once again for the linear model, but still solvable for the nonlinear one. The filtered dataset is shown in the bottom half of Fig. 2 , and the captions indicate the performance of a linear and an SVM model (also see Appendix §A.2). For the first dataset, we see that AFLITE removes most of those examples with flipped labels. These results show that AFLITE indeed lowers the performance of models relying on biases by removing samples with spurious correlations from a dataset. Figure 2 . Two sample biased datasets as input to AFLITE (top). Blue and orange indicate two different classes. Only the original two dimensions are shown, not the bias features. For the dataset on the left, with the highest separation, we flip some labels at random, so even an RBF kernel cannot achieve perfect performance. AFLITE makes the data more challenging for the models (bottom). Also see Appendix §A.2 for more details.
4. Nlp Experiments
As our first real-world data evaluation for AFLITE, we consider out-of-domain and in-domain generalization for a variety of language datasets. The primary task we consider is natural language inference (NLI) on the Stanford NLI dataset (Bowman et al., 2015, SNLI) . Each instance in the NLI task consists of a premise-hypothesis sentence pair, the task involves predicting whether the hypothesis either entails, contradicts or is neutral to the premise.
Experimental Setup
We use feature representations from RoBERTa-large, φ RoBERTa (Liu et al., 2019b) , a large-scale pretrained masked language model. This is extracted from the final layer before the output layer, trained on a random 10% sample (warm-up) of the original training set. The resultant filtered NLI dataset, D(φ RoBERTa ), is compared to the original dataset D as well as a randomly subsampled dataset D 189k , with the same sample size as D(φ RoBERTa ), amounting to only a third of the full data D. The same RoBERTa-large architecture is used to train the three NLI models.
4.1. Out-Of-Distribution Generalization
As motivated in Section §1, large-scale architectures often learn to solve datasets rather than the underlying task by overfitting on unintended correlations between input and output in the data. However, this reliance might be hurtful for generalization to out-of-distribution examples, since they may not contain the same biases. We evaluate AFLITE for this criterion on the NLI task. Gururangan et al. (2018) , among others, showed the existence of certain annotation artifacts (lexical associations etc.) in SNLI which make the task considerably easier for most current methods. This spurred the development of several out-of-distribution test sets which carefully control for the presence of said artifacts. We evaluate on four such out-of-distribution datasets: HANS (McCoy et al., 2019b), NLI Diagnostics (Wang et al., 2018a) , Stress tests (Naik et al., 2018) and Adversarial-NLI (Nie et al., 2019) , see Appendix §A.3 for details. Given that these benchmarks are collected independently of the original SNLI task, the biases from SNLI are less likely to carry over; however these benchmarks might contain their own biases (Liu et al., 2019a) . Table 1 shows results on three out of four diagnostic datasets (HANS, NLI-Diagnostics and Stress), where we perform a zero-shot evaluation of the models. Models trained on SNLI-AFLITE consistently exceed or match the performance of the full model on the benchmarks above, up to standard deviation. To control for the size, we compare to a baseline trained on a random subsample of the same size (D 182k ). AFLITE models report higher generalization performance suggesting that the filtered samples are more informative than a random subset. In particular, AFLITE substantially outperforms challenging examples on the HANS benchmark, which targets models purely relying on lexical and syntactic cues. Table 2 shows results on the adversarial NLI benchmark, which allows for evaluation of transfer capabilities, by finetuning models on each of the three training datasets (Rd1, Rd2 and Rd3). A RoBERTa-large model trained on SNLI-AFLITE surpasses the performance in all three settings.
4.2. In-Distribution Benchmark Re-Estimation
AFLITE additionally provides a more accurate estimation of the benchmark performance on several tasks. Here we simply lower the AFLITE early-stopping threshold, τ in order to filter most biased examples from the data, resulting Table 2 . SNLI accuracy on Adversarial NLI using RoBERTa-large models pre-trained on the original SNLI data (D, size 550k) and on AFLITE-filtered data (D(φRoBERTa), size 182k). Both models were finetuned on the in-distribution training data for each round (Rd1, Rd2, and Rd3).
in a stricter benchmark with 92k train samples.
SNLI In addition to RoBERTa-large, we consider here pre-computed embeddings from BERT-large (Devlin et al., 2019) , and GloVe (Pennington et al., 2014) , resulting in three different feature representations for SNLI: φ BERT , φ RoBERTa from RoBERTa-large (Liu et al., 2019b) , and Φ ESIM+GLoVe which uses the ESIM model (Chen et al., 2016) with GloVe embeddings . Table 3 shows the results for SNLI. In all cases, applying AFLITE substantially reduces overall model accuracy, with typical drops of 15-35% depending on the models used for learning the feature representations and those used for evaluation of the filtered dataset. In general, performance is lowest when using the strongest model (RoBERTa) for learning feature representations. Results also highlight the ability of weaker adversaries to produce datasets that are still challenging for much stronger models with a drop of 13.7% for RoBERTa using Φ ESIM+GLoVe as feature representation.
To control for the reduction in dataset size by filtering, we randomly subsample D, creating D 92k whose size is approximately equal to that of D(φ RoBERTa ). All models achieve nearly the same performance as their performance on the full dataset -even when trained on just one-fifth the original data. This result further highlights that current benchmark datasets contain significant redundancy within its instances.
We also include two other baselines, which target known dataset artifacts in NLI. The first baseline uses Point-wise Mutual Information (PMI) between words in a given instance and the target label as its only feature. Hence it captures the extent to which datasets exhibit word-association biases, one particular class of spurious correlations. While this baseline is relatively weaker than other models, its performance still reduces by nearly 13% on the D(φ RoBERTa ) dataset. The second baseline trains only the hypothesis of an NLI instance (-HypOnly). Such partial input baselines (Gururangan et al., 2018) capture reliance on lexical cues only in the hypothesis, instead of learning a semantic relationship between the hypothesis and premise. This reduces performance by almost 24% before and after filtering with RoBERTa. AFLITE, which is agnostic to any particular known bias in the data, results in a drop of about 30% on the same dataset, indicating that it might be capturing a larger class of spurious biases than either of the above baselines.
Finally, to demonstrate the value of the iterative, ensemblebased AFLITE algorithm, we compare with a baseline where using a single model, we filter out the most predictable examples in a single iteration -a non-iterative, single-model version of AFLITE. A RoBERTa-large model trained on this subset (of the same size as D(φ RoBERTa )) achieves a dev accuracy of 72.1%. Compared to the performance of RoBERTa on D(φ RoBERTa ) (62.6%, see Table 3 ), it makes this baseline a sensible yet less effective approach. In particular, this illustrates the need for an iterative procedure involving models trained on multiple partitions of the remaining data in each iteration.
MultiNLI and QNLI We evaluate the performance of another large-scale NLI dataset multi-genre NLI (Williams et al., 2018, MultiNLI) , and the QNLI dataset (Wang et al., 2018a) which is a sentence-pair classification version of the SQuAD (Rajpurkar et al., 2016) performance across the board in SNLI, we only experiment with RoBERTa as adversary for MultiNLI and QNLI. While RoBERTa achieves over 90% on both original datasets, its performance drops to 66.2% for MultiNLI and to 77.7% for QNLI on the filtered datasets. Similarly, partial input baseline performance also decreases substantially on both dataset compared to their performance on the original dataset. Overall, our experiments indicate that AFLITE consistently results in reduced accuracy on the filtered datasets across multiple language benchmark datasets, even after controlling for the size of the training set. Table 3 shows that human performance on SNLI-AFLITE is lower than that on full SNLI 5 . This indicates that the filtered dataset is somewhat harder even for humans, though to a much lesser degree than any model. Indeed, removal 5 Measured by annotator labels provided in the original SNLI validation data
ImageNet-A Model EfficientNet-B5 EfficientNet-B7 D 16.5 20.6 D 40%
5.9 8.5 D (ΦEN-B7) 7.2 10.4 Table 5 . Top-1 accuracy on ImageNet-A (Hendrycks et al., 2019) , an adversarial test set for image classification. The strongest model, EfficientNet-B7 improves by 2% on out-of-distribution ImageNet-A images when trained on AFLITE-filtered data.
of examples with spurious correlations could inadvertently lead to removal of genuinely easy examples; this might be a limitation of a model-based bias reduction approach such as AFLITE (see Appendix §A.6 for a qualitative analysis). Future directions for bias reduction techniques should account for unaltered human performance before and after reduction.
5. Vision Experiments
We evaluate AFLITE on image classification through Ima-geNet (ILSVRC2012) classification. On ImageNet, we use the state-of-the-art EfficientNet-B7 model (Tan & Le, 2019) as our core feature extractor Φ EN-B7 . The EfficientNet model is learned from scratch on a fixed 20% sample of the Ima-geNet training set, using RandAugment data augmentation (Cubuk et al., 2019) . We then use the 2560-dimensional features extracted by EfficientNet-B7 as then underlying representation for AFLITE to use to filter the remaining dataset, and stop when data size is 40% of ImageNet.
Adversarial Image Classification In Table 5 , we report performance of image classification models on ImageNet-A, a dataset with out-of-distribution images (Hendrycks et al., 2019) . As shown, all EfficientNet models struggle on this task, even when trained on the entire ImageNet. However, we find that training on AFLITE-filtered data Table 6 . Results on ImageNet, in Top-1 accuracy (%). We trained on AFLITE-filtered instances (D(ΦEN-B7)), and compare this to an equal-sized but random 40% subsample of ImageNet (D 40% ). We report results on the ImageNet validation set before and after filtering with AFLITE. ∆ indicates the difference in accuracy of the full model and the filtered model. Notably, evaluating on ImageNet-AFLITE is much harder-resulting in a drop of nearly 21 percentage points in accuracy for the strongest model. leads to models with greater generalization, in comparison to training on a randomly sampled ImageNet of the same size, leading to up to 2% improvement in performance.
In-distribution Image Classification In Table 6 , we present ImageNet accuracy across the EfficientNet and ResNet (He et al., 2016) model families before and after filtering with AFLITE. For evaluation, the Imagenet-AFLITE filtered validation set is much harder than the standard validation set (also see Figure 1 ). While the top performer after filtering is still EfficientNet-B7, its top-1 accuracy drops from 84.4% to 63.5%. A model trained on a randomly filtered subsample of the same size though suffers much lesser, most likely due to reduction in training data.
Overall, these results suggest that image classificationeven within a subset of the closed world of ImageNetis far from solved. These results echo other findings that suggest that common biases that naturally occur in webscale image data, such as towards canonical poses (Alcorn et al., 2019) or towards texture rather than shape (Geirhos et al., 2018) , are problems for ImageNet-trained classifiers.
6. Related Work
AFLITE is related to Zellers et al. (2018)'s adversarial filtering (AF) algorithm, yet distinct in two key ways: it is (i) much more broadly applicable (by not requiring over generation of data instances), and (ii) considerably more lightweight (by not requiring re-training a model at each iteration of AF). Variants of this AF approach have recently been used to create other datasets such as HellaSwag (Zellers et al., 2019) and ANLI (Bhagavatula et al., 2019) by iteratively perturbing dataset instances until a target model cannot fit the resulting dataset. While effective, these ap-proaches run into three main pitfalls. First, dataset curators need to explicitly devise a strategy of collecting or generating perturbations of a given instance. Second, the approach runs the risk of distributional bias where a discriminator can learn to distinguish between machine generated instances and human-generated ones. Finally it requires re-training a model at each iteration, which is computationally expensive especially when using a large model such as BERT as the adversary. In contrast, AFLITE focuses on addressing dataset biases from existing datasets instead of adversarially perturbing instances. AFLITE was earlier proposed by Sakaguchi et al. (2020) to create the Winogrande dataset. This paper presents more thorough experiments, theoretical justification and results from generalizing the proposed approach to multiple popular NLP and Vision datasets.
Li & Vasconcelos (2019) recently proposed REPAIR, a method to remove representation bias by dataset resampling.
The motivation in REPAIR is to learn a probability distribution over the dataset that favors instances that are hard for a given representation. In addition, the implementation of REPAIR relies on in-training classification loss as opposed to out-of-sample generalization accuracy. RESOUND (Li et al., 2018) quantifies the representation biases of datasets. It uses the representation biases to assemble a new K-class dataset with smaller biases by sampling an existing C-class dataset (C > K).
Arjovsky et al. 2019propose Invariant Risk Minimization as an objective that promotes learning representations of the data which are stable across environments. Instead of learning optimal classifiers, AFLITE aims to remove instances that exhibit artifacts in a dataset. Also related are approaches in He et al. (2019) where specific NLI biases are targeted; we show AFLITE is capable of removing any spurious bias. Data selection methods such as Wang et al. (2018b) aim to filter data to preserve downstream performance, AFLITE is adversarial to this goal.
7. Conclusion
We presented a deep-dive into AFLITE -an iterative greedy algorithm that adversarially filters out spurious biases from data for accurate benchmark estimation. We presented a theoretical framework supporting AFLITE, and showed its effectiveness in bias reduction on synthetic and real datasets, providing extensive analyses. We apply AFLITE to four datasets, including widely used benchmarks such as SNLI and ImageNet, and show that the strongest performance on the resulting filtered dataset drops by 30 points for SNLI and 20 points for ImageNet. We showed on out-of-distribution and adversarial test sets, models trained on the AFLITEfiltered subset generalize better. We hope that dataset creators will employ AFLITE to identify unobservable artifacts before releasing new challenge datasets for more reliable estimates of task progress on future AI benchmarks. All datasets and code for this work will be made public soon.
A. Appendix
A.1. Slice Sampling Details
The slice sampling approach can be efficiently implemented using what is known as the Gumbel method or Gumbel trick (Gumbel & Lieblein, 1954; Maddison et al., 2014) , which uses random perturbations to turn sampling into a simpler problem of optimization. This has recently found success in several probabilistic inference applications (Kim et al., 2016; Jang et al., 2016; Maddison et al., 2016; Balog et al., 2017; Kool et al., 2019) . Starting with the logpredictability scores logp(i) for various i, the idea is to perturb them by adding an independent random noise γ i drawn from the standard Gumbel distribution. Interestingly, the maximizer i * of γ i + logp(i) turns out to be an exact sample drawn from the (unnormalized) distribution defined byp. Note that i * is a random variable since the γ i are drawn at random. This result can be generalized (Vieira, 2014) for slice sampling: the k highest values of Gumbelperturbed log-predictability scores correspond to sampling, without replacement, k items from the probability distribution defined byp. The Gumbel method is typically applied to exponentially large combinatorial spaces, where it is challenging to scale up. In our setting, however, the overhead is minimal since the cost of drawing a random γ i is negligible compared to computingp(i). Figure 3 shows the effect of AFLITE on four synthetic datasets containing data arranged in concentric circles at four degrees of class separation. This adds two more experiments (shown on the extreme right) exhibiting similar phenomena as those shown in Figure 2 . For greater visibility, we have provided the accuracies of the SVM with RBF kernel and logistic regression in Table 8 .
A.2. Results On Synthetic Data Experiments
In summary, a stronger model such as the SVM is more robust to the presence of artifacts than a simple linear classifier. Thus, the implications for real datasets is to move towards models designed for reasoning about a specific task, hence avoiding a dependence on spurious artifacts.
A.3. Nli Out-Of-Distribution Benchmarks
We describe the four out-of-distribution evaluation benchmarks for NLI from Section §4.1 below:
• HANS (McCoy et al., 2019b) contains evaluation examples designed to avoid common structural heuristics (such as word overlap) which could be used by models to correctly predict NLI inputs, without true inferential reasoning. • NLI Diagnostics (Wang et al., 2018a) Table 7 . AFLITE hyperparameters used for running for indistribution benchmark estimation on different datasets. m denotes the size of the support of the expectation in Eq. (4), t is the training set size for the linear classifiers, k is the size of each slice, and τ is an early-stopping filtering threshold. For ImageNet, we set n = 640K and hence do not need to control for τ . In every other setting, we set τ as above, and hence do not need to control for n.
formance on several fine-grained semantic categories, such as logical reasoning and commonsense knowledge. • Stress tests for NLI (Naik et al., 2018) are a collection of tests targeting the weaknesses of strong NLI models, to check if these are robust to semantics (competence), irrelevance (distraction) and typos (noise). • Adversarial NLI (Nie et al., 2019) consists of premises collected from Wikipedia and other news corpora, and human generated hypotheses, arranged at different tiers of the challenge they present to a model, using a human and model in-the-loop procedure.
Recent work (McCoy et al., 2019a) has observed large variance on out-of-distribution test sets with random seeds. Hence, we report the mean and variance across 5 random seeds in all settings in Table 1 . Since Adversarial NLI involves finetuning the model, and not just reporting on a different test set, we skip this step in Table 2. A.4. Hyperparameters for AFLITE Table 7 shows hyperparameters used to run AFLITE to obtain filtered subsets for in-distribution benchmark estimation on different datasets. Target dataset size, n and the early stop filtering threshold Ï„ are interdependent, as the predictability score threshold determines what examples to keep, which in turn influences the desired size of the dataset, n. For ImageNet, we set n = 640K and do not control for Ï„ . We use much larger values for t and k for ImageNet than in all NLP experiments, where the use of powerful language representations such as RoBERTa allows us to get reasonable performance even with smaller training sets; ImageNet does not offer any such benefits due to pretrained representations.
For all out-of-distribution NLP experiments, we explicitly control for the size of n, as discussed in the corresponding sections in the paper. In these cases, we typically end up using slightly larger n, allowing for the final models to get more exposure to task data which is, to a degree, helpful for out-of-distribution generalization. In ImageNet, we use the Figure 3 . Four sample biased datasets as input to AFLITE (top). Blue and orange indicate two different classes. Only the original two dimensions are shown, not the bias features. For the leftmost dataset with the highest separation, we flip some labels at random, so even an RBF kernel cannot achieve perfect performance. AFLITE makes the data more challenging for the models (bottom).
same hyperparameters in both sets of experiments.
A.5. Hyperparameters For Nlp Experiments
For all NLP experiments, our implementation is based on the GLUE (Wang et al., 2018a) experiments in the Transformers repository (Wolf et al., 2019) from Huggingface 6 . We used the Adam optimizer (Kingma & Ba, 2014) for every training set up, with a learning rate of 1e-05, and an epsilon value of 1e-08. We trained for 3 epochs for all *NLI tasks, maintaining a batch size of 92. All above hyperparameters were selected using a grid search; we kept other hyperparameters unaltered from the original HuggingFace repository. Each experiment was performed on a single Quadro RTX 8000 GPU.
Results on the synthetic dataset are provided in Table 8 . Please refer to Section ( §3) for a detailed description.
A.6. Qualitative Analysis of SNLI Table 9 shows some examples removed and retained by AFLITE on the NLI dataset. Table 8 . Mean Dev accuracy (%) on two models trained on four synthetic datasets before (D) and after (D(Φ)) AFLITE. Standard deviation across 10 runs with randomly chosen seeds is provided as a subscript. The datasets, also shown in Fig. 3 differ in the degree of separation between the two classes. Both models (SVM with an RBF kernel & linear classifier with logisitic regression) perform well on the original synthetic dataset, before filtering. The linear classifier performs well on the data, because it contains spurious artifacts, making the task artificially easier for it. However, after AFLITE, the linear model, relying mostly on the spurious features, clearly underperforms.
Premise Hypothesis Label
A woman, in a green shirt, preparing to run on a treadmill.
A woman is preparing to sleep on a treadmill. contradiction
The dog is catching a treat. The cat is not catching a treat. contradiction Three young men are watching a tennis match on a large screen outdoors.
Three young men watching a tennis match on a screen outdoors, because their brother is playing.
Neutral
A girl dressed in a pink shirt, jeans, and flip-flops sitting down playing with a lollipop machine.
A funny person in a shirt. neutral A man in a green apron smiles behind a food stand. A man smiles. entailment A little girl with a hat sits between a woman's feet in the sand in front of a pair of colorful tents.
The girl is wearing a hat. entailment
Retained By Aflite Premise Hypothesis Label
People are throwing tomatoes at each other. The people are having a food fight. entailment A man poses for a photo in front of a Chinese building by jumping.
The man is prepared for his photo. entailment
An older gentleman speaking at a podium. A man giving a speech neutral A man poses for a photo in front of a Chinese building by jumping.
The man has experience in taking photos. neutral
People are waiting in line by a food vendor. People sit and wait for their orders at a nice sit down restaurant.
contradiction Number 13 kicks a soccer ball towards the goal during children's soccer game.
A player passing the ball in a soccer game. contradiction Table 9 . Examples from SNLI, removed (top) and retained (bottom) by AFLITE. As is evident, the retained instances are slightly more challenging and capture more nuanced semantics in contrast to the removed instances. Removed instances also exhibit larger word overlap, and many other artifacts found in Gururangan et al. (2018) .
A.7. Hyperparameters For Imagenet
We trained our ImageNet models using v3-512 TPU pods. For EfficientNet (Tan & Le, 2019) , we used RandAugment data augmentation (Cubuk et al., 2019) with 2 layers, and a magnitude of 28, for all model sizes. We trained our models using a batch size of 4096, a learning rate of 0.128, and kept other hyperparameters the same as in (Tan & Le, 2019) . We trained for 350 epochs for all dataset sizes -so when training on 20% or 40% of ImageNet (or a smaller dataset), we scaled the number of optimization steps accordingly. For ResNet (He et al., 2016) , we used a learning rate of 0.1, a batch size of 8192, and trained for 90 epochs.
We will henceforth use biases and artifacts interchangeably. 2 Stands for Lightweight Adversarial Filtering.
We use standard implementations from scikit-learn: https: //scikit-learn.org/stable/.
QNLI is stylized as an NLI classification task, where the task is to determine whether or not a sentence contains the answer to a question.