## Abstract

Large neural models have demonstrated human-level performance on language and vision benchmarks, while their performance degrades considerably on adversarial or out-of-distribution samples. This raises the question of whether these models have learned to solve a dataset rather than the underlying task by overfitting to spurious dataset biases. We investigate one recently proposed approach, AFLite, which adversarially filters such dataset biases, as a means to mitigate the prevalent overestimation of machine performance. We provide a theoretical understanding for AFLite, by situating it in the generalized framework for optimum bias reduction. We present extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks. Finally, filtering results in a large drop in model performance (e.g., from 92% to 62% for SNLI), while human performance still remains high. Our work thus shows that such filtered datasets can pose new research challenges for robust generalization by serving as upgraded benchmarks.

## 1. Introduction

Large-scale neural networks have achieved superhuman performance across many popular AI benchmarks, for tasks as diverse as image recognition (ImageNet; Russakovsky et al., 1 Allen Institute for Artificial Intelligence 2 Paul G. Allen School of Computer Science, University of Washington. Correspondence to: Ronan Le Bras, Swabha Swayamdipta <{ronanlb,swabhas}@allenai.org>. 2015), natural language inference (SNLI; Bowman et al., 2015) , and question answering (SQuAD; Rajpurkar et al., 2016) . However, the performance of such neural models degrades considerably when tested on out-of-distribution or adversarial samples, otherwise known as data "in the wild" (Eykholt et al., 2018; Jia & Liang, 2017) . This phenomenon indicates that high performance of the strongest AI models is often confined to specific datasets, implicitly making a closed-world assumption. In contrast, true learning of a task necessitates generalization, or an open-world assumption. A major impediment to generalization is the presence of spurious biases -unintended correlations between input and output -in existing datasets (Torralba & Efros, 2011) . Such biases or artifacts 1 are often introduced during data collection (Fouhey et al., 2018) or during human annotation (Gururangan et al., 2018; Poliak et al., 2018; Tsuchiya, 2018; Geva et al., 2019) . Not only do dataset biases inevitably bias the models trained on them, but they have also been shown to significantly inflate model performance, leading to an overestimation of the true capabilities of current AI systems (Sakaguchi et al., 2020; Hendrycks et al., 2019) .

Many recent studies have investigated task or dataset specific biases, including language bias in Visual Question Answering (Goyal et al., 2017) , texture bias in ImageNet (Geirhos et al., 2018) , and hypothesis-only reliance in Natural Language Inference (Gururangan et al., 2018) . These studies have yielded similarly domain-specific algorithms to address the found biases. However, the vast majority of these studies follow a top-down framework where the bias reduction algorithms are essentially guided by researchers' intuitions and domain insights on particular types of spurious biases. While promising, such approaches are fundamentally limited by what the algorithm designers can manually recognize and enumerate as unwanted biases.

Our work investigates AFLITE, an alternative bottom-up approach to algorithmic bias reduction. AFLITE 2 was recently proposed by Sakaguchi et al. (2020) -albeit very succinctly-to systematically discover and filter any dataset artifact in crowdsourced commonsense problems. AFLITE employs a model-based approach with the goal of removing 1.0 0.7 0.7 0.8 0.7 0.7 1.0 0.9 0.8 0.7 0.7 0.9 1.0 0.8 0.7 0.8 0.8 0.8 1.0 0.7 0.7 0.7 0.7 0.7 1.0 1.0 0.5 0.5 0.4 0.5 0.5 1.0 0.8 0.8 0.6 0.5 0.8 1.0 0.7 0.6 0.4 0.8 0.7 1.0 0.6 0.5 0.6 0.6 0.6 1.0 Figure 1 . Example images of the Monarch Butterfly and Chickadee from ImageNet. On the right are images in each category which were removed by AFLITE, and on the left, the ones which were filtered or retained. The heatmap shows pairwise cosine similarity between EfficientNet-B7 features (Tan & Le, 2019) . The retained images (left) show significantly greater diversity -such as the cocoon of a butterfly, or the non-canonical chickadee poses -also reflected by the cosine similarity values. This diversity suggests that the AFLITE-filtered examples presents a more accurate benchmark for the task of image classification, as opposed to fitting to particular dataset biases.

spurious artifacts in data beyond what humans can intuitively recognize, but those which are exploited by powerful models. Figure 1 illustrates how AFLITE reduces dataset biases in the ImageNet dataset for object classification.

This paper presents the first theoretical understanding and comprehensive empirical investigations into AFLITE. More concretely, we make the following four novel contributions.

First, we situate AFLITE in a theoretical framework for optimal bias reduction, and demonstrate that AFLITE provides a practical approximation of AFOPT, the ideal but computationally intractable bias reduction method under this framework ( §2).

Second, we present an extensive suite of experiments that were lacking in the work of Sakaguchi et al. (2020) , to validate whether AFLITE truly removes spurious biases in data as originally assumed. Our baselines and thorough analyses use both synthetic (thus easier to control) datasets ( §3) as well as real datasets. The latter span benchmarks across NLP ( §4) and vision ( §5) tasks: the SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018) datasets for natural language inference, QNLI (Wang et al., 2018a) for question answering, and the ImageNet dataset (Russakovsky et al., 2015) for object recognition.

Third, we demonstrate that models trained on AFLITEfiltered data generalize substantially better to out-of-domain samples, compared to models that are trained on the original biased datasets ( §4, §5). These findings indicate that spurious biases in datasets make benchmarks artificially easier, as models learn to overly rely on these biases instead of learning more transferable features, thereby hurting out-of-domain generalization.

Finally, we show that AFLITE-filtering makes widely used AI benchmarks considerably more challenging. We consistently observe a significant drop in the in-domain performance even for state-of-the-art models on all benchmarks, even though human performance still remains high; this suggests currently reported performance on benchmarks might be inflated. For instance, the best model on SNLI-AFLITE achieves only 63% accuracy, a 30% drop compared to its accuracy on the original SNLI. These findings are especially surprising since AFLITE maintains an identical train-test distribution, while also retaining a sizable training set.

In summary, AFLITE-filtered datasets can serve as upgraded benchmarks, posing new research challenges for robust generalization.

## 2. Aflite

Large datasets run the risk of prioritizing performance on the data-rich head of the distribution, where examples are plentiful, and discounting the tail. AFLITE seeks to minimize the ability of a model to exploit biases in the head of the distribution, while preserving the inherent complexity of the tail. In this section, we provide a formal framework for studying such bias reduction techniques, revealing that AFLITE can be viewed as a practical approximation of a desirable but computationally intractable optimum bias re-duction objective.

Formalization. Let Φ be any feature representation defined over a dataset D = (X, Y ). AFLITE seeks a subset S ⊂ D, |S| ≥ n that is maximally resilient to the features uncovered by Φ, that is, for any identically-distributed traintest split of S, features Φ should not help models generalize to the held-out set.

Let M denote a family of classification models (e.g., logistic regression, SVM, or a particular neural architecture) that can be trained on subsets S of D = (X, Y ) using features Φ(X). We define the representation bias of Φ in S w.r.t M, denoted R(Φ, S, M), as the best possible out-of-sample classification accuracy achievable by models in M when predicting labels Y using features Φ(X). Given a target minimum reduced dataset size n, the goal is to find a subset S ⊂ D of size at least n that minimizes this representation bias in S w.r.t. M:

`EQUATION (1): Not extracted; please refer to original document.`

Eq. (1) corresponds to optimum bias reduction, referred to as AFOPT. We formulate R(Φ, S, M) as the expected classification accuracy resulting from the following process. Let q : 2 S → [0, 1] be a probability distribution over subsets

T = (X T , Y T ) of S.

The process is to randomly choose T with probability q(T ), train a classifier M T ∈ M on S \ T , and evaluate its classification accuracy

f M T (Φ(X T ), Y T ) on T .

The resulting accuracy on T itself is a random variable, since the training set S \ T is randomly sampled. We define the expected value of this classification accuracy to be the representation bias:

`EQUATION (2): Not extracted; please refer to original document.`

The expectation in Eq. (2), however, involves a summation over exponentially many choices of T even to compute the representation bias for a single S. This makes optimizing Eq. (1), which involves a search over S, highly intractable. To circumvent this challenge, we refactor R(Φ, S, M) as a sum over instances i ∈ S of the aggregate contribution of i to the representation bias across all T . Importantly, this summation has only |S| terms, allowing more efficient computation. We call this the predictability score p(i) for i: on average, how reliably can label y i be predicted using features Φ(x i ) when a model from M is trained on a randomly chosen training set S \ T not containing i. Instances with high predictability scores are undesirable as their feature representation can be exploited to confidently correctly predict such instances.

With some abuse of notation, for i ∈ S, let q(i)

T i q(T ) denote the marginal probability of choosing a subset T that contains i. The ratio q(T ) q(i) is then the probability of T conditioned on it containing i. Let f M T (Φ(x i ), y i ) be the classification accuracy of M T on i. Then the expectation in Eq. (2) can be written in terms of p(i) as follows:

T ⊂S q(T ) • 1 |T | i∈T fM T (Φ(xi), yi) = T ⊂S i∈T q(T ) • fM T (Φ(xi), yi) |T | = i∈S T ⊂S T i q(T ) • fM T (Φ(xi), yi) |T | = i∈S q(i) T ⊂S T i q(T ) q(i) fM T (Φ(xi), yi) |T | = i∈S q(i) ET ⊂S, T i fM T (Φ(xi), yi) |T | = i∈S p(i)

where p(i) is the predictability score of i defined as:

`EQUATION (3): Not extracted; please refer to original document.`

While this refactoring works for any probability distribution q with non-zero support on all instances, for simplicity of exposition, we assume q to be the uniform distribution over all T ⊂ S of a fixed size. This makes both |T | and q(i) fixed constants; in particular,

q(i) = |S|−1 |T |−1 / |S| |T | = |T | |S| .

This yields a simplified predictability scorep(i) and a factored reformulation of the representation bias from Eq. (2):

`EQUATION (5): Not extracted; please refer to original document.`

While this refactoring reduces the exponential summation underlying the expectation in Eq.

(2) to a linear sum, solving Eq. (1) for optimum bias reduction (AFOPT) remains challenging due to the exponentially many choices of S. However, the refactoring does enable computationally efficient heuristic approximations that start with S = D and iteratively filter out from S the most predictable instances i, as identified by the (simplified) predictability scoresp(i) computed over the current candidate for S. In all cases, we use a fixed training set size |S \ T | = t < n. Further, since a larger filtered set is generally desirable, we terminate the filtering process early (i.e., while |S| > n) if the predictability score for every i falls below a pre-specified early stopping threshold τ ∈ [0, 1].

We consider three such heuristic approaches. (A) A simple greedy approach starts with the full set S = D, identifies an i ∈ S that maximizesp(i), removes it from S, and repeats up to |D| − n times. (B) A greedy slicing approach identifies the instances with the k highest predictability scores, removes all of them from S, and repeats the process up to

Algorithm 1 AFLITE Input: dataset D = (X, Y ), pre-computed representation Φ(X),

model family M, target dataset size n, number of random partitions m, training set size t < n, slice size k ≤ n, early-stopping threshold τ Output:

reduced dataset S S = D while |S| > n do // Filtering phase forall i ∈ S do Initialize multiset of out-of-sample predictions E(i) = ∅ for iteration j : 1..m do Randomly partition S into (Tj, S \ Tj) s.t. |S \ Tj| = t Train a classifier L ∈ M on {(Φ(x), y) | (x, y) ∈ S \ Tj} (L is typically a linear classifier) forall i = (x, y) ∈ Tj do Add the prediction L(Φ(x)) to E(i) forall i = (x, y) ∈ S do Compute the predictability scorep(i) = |{ŷ ∈ E(i) s.t.ŷ = y}| / |E(i)| Select up to k instances S in S with the highest predictability scores subject top(i) ≥ τ S = S \ S if |S | < k then break return S |D|−n k

times. (C) A slice sampling approach, instead of greedily choosing the top k instances, randomly samples k instances with probabilities proportional to their predictability scores. The Gumbel method provides an efficient way to perform such sampling (Gumbel & Lieblein, 1954; Maddison et al., 2014; Kim et al., 2016; Balog et al., 2017; Kool et al., 2019) , by independently perturbing eachp(i) with a Gumbel random variable and identifying k instances with the highest perturbed predictability scores (cf. Appendix A.1).

All three strategies can be improved further by considering not only the predictability score of the top-k instances but also (via retraining without these instances) how their removal would influence the predictability scores of other instances in the next step. We found our computationally lighter approaches to work well even without the additional overhead of such look-ahead. AFLITE implements the greedy slicing approach, and can thus be viewed as a scalable and practical approximation of (intractable) AFOPT for optimum bias reduction.

Implementation. Algorithm 1 provides an implementation of AFLITE. The algorithm takes as input a dataset D = (X, Y ), a representation Φ(X) we are interested in minimizing the bias in, a model family M (e.g., linear classifiers), a target dataset size n, size m of the support of the expectation in Eq. (4), training set size t for the classifiers, size k of each slice, and an early-stopping filtering threshold τ . Importantly, for efficiency, Φ(X) is provided to AFLITE in the form of pre-computed embeddings for all of X. In practice, to obtain Φ(X), we train a first "warm-up" model on a small fraction of the data based on the learning curve in low-data regime, and do not reuse this data for the rest of our experiments. Moreover, this fraction corresponds to the training size t for AFLITE and it remains unchanged across iterations. We follow the iterative filtering approach, starting with S = D and iteratively removing some instances with the highest predictability scores using the greedy slicing strategy. Slice size k and number of partitions m are determined by the available computation budget.