# Dataset Cartography: Mapping and Diagnosing Datasets with Training Dynamics

## Authors

## Abstract

Large datasets have become commonplace in NLP research. However, the increased emphasis on data quantity has made it challenging to assess the quality of data. We introduce Data Maps---a model-based tool to characterize and diagnose datasets. We leverage a largely ignored source of information: the behavior of the model on individual instances during training (training dynamics) for building data maps. This yields two intuitive measures for each example---the model's confidence in the true class, and the variability of this confidence across epochs---obtained in a single run of training. Experiments across four datasets show that these model-dependent measures reveal three distinct regions in the data map, each with pronounced characteristics. First, our data maps show the presence of "ambiguous" regions with respect to the model, which contribute the most towards out-of-distribution generalization. Second, the most populous regions in the data are "easy to learn" for the model, and play an important role in model optimization. Finally, data maps uncover a region with instances that the model finds "hard to learn"; these often correspond to labeling errors. Our results indicate that a shift in focus from quantity to quality of data could lead to robust models and improved out-of-distribution generalization.

## 1 Introduction

The creation of large labeled datasets has fueled the advance of AI (Russakovsky et al., 2015; Antol et al., 2015) and NLP in particular (Bowman et al., 2015; Rajpurkar et al., 2016) . The common belief is that the more abundant the labeled data, the higher the likelihood of learning diverse phenomena, which in turn leads to models that generalize well. In practice, however, out-of-distribution * Work done at the Allen Institute for AI. corresponds to easy-to-learn examples, the bottomleft corner (low variability, low confidence) corresponds to hard-to-learn examples, and examples on the right (with high variability) are ambiguous; all definitions are with respect to the ROBERTA-large model. The modal group in the data is formed by the easy-to-learn regions. For clarity we only plot 25K random samples from the SNLI train set. Fig. 8b in App. §C shows the same map in greater relief.

(OOD) generalization remains a challenge (Yogatama et al., 2019; Linzen, 2020) ; and, while recent large pretrained language models help, they fail to close this gap (Hendrycks et al., 2020) . This urges a closer look at datasets, where not all examples might contribute equally towards learning (Vodrahalli et al., 2018) . However, the scale of data can make this assessment challenging. How can we automatically characterize data instances with respect to their role in achieving good performance in-and out-of-distribution? Answering this question may take us a step closer to bridging the gap between dataset collection and broader task objectives. Drawing analogies from cartography, we propose to find coordinates for instances within the broader trends of a dataset. We introduce data maps: a model-based tool for contextualizing examples in a dataset. We construct coordinates for data maps by leveraging training dynamics-the behavior of a model as training progresses. We consider the mean and standard deviation of the gold label probabilities, predicted for each example across training epochs; these are referred to as confidence and variability, respectively ( §2). Fig. 1 shows the data map for the SNLI dataset (Bowman et al., 2015) constructed using the ROBERTA-large model (Liu et al., 2019) . The map reveals three distinct regions in the dataset: a region with instances whose true class probabilities fluctuate frequently during training (high variability), and are hence ambiguous for the model; a region with easy-to-learn instances that the model predicts correctly and consistently (high confidence, low variability); and a region with hard-to-learn instances with low confidence, low variability, many of which we find are mislabeled during annotation . 1 Similar regions are observed across three other datasets: MultiNLI (Williams et al., 2018) , WinoGrande (Sakaguchi et al., 2020) and SQuAD (Rajpurkar et al., 2016) , with respect to respective ROBERTA-large classifiers.

We further investigate the above regions by training models exclusively on examples from each region ( §3). Training on ambiguous instances promotes generalization to OOD test sets, with little or no effect on in-distribution (ID) performance. 2 Our data maps also reveal that datasets contain a majority of easy-to-learn instances, which are not as critical for ID or OOD performance, but without any such instances, training could fail to converge ( §4). In §5, we show that hard-to-learn instances frequently correspond to labeling errors. Lastly, we discuss connections between our measures and uncertainty measures ( §6).

Our findings indicate that data maps could serve as effective tools to diagnose large datasets, at the reasonable cost of training a model on them. Locating different regions within the data might pave the way for constructing higher quality datasets., and ultimately models that generalize better. Our code and higher resolution visualizations are publicly available. 3

## 2 Mapping Datasets With Training Dynamics

Our goal is to construct Data Maps for datasets to help visualize a dataset with respect to a model, as well as understand the contributions of different groups of instances towards that model's learning.

Intuitively, instances that a model always predicts correctly are different from those it almost never does, or those on which it vacillates. For building such maps, each instance in the dataset must be contextualized in the larger set. We consider one contextualization approach, based on statistics arising from the behavior of the training procedure across time, or the "training dynamics". We formally define our notations ( §2.1) and describe our data maps ( §2.2).

## 2.1 Training Dynamics

Consider a training dataset of size

N , D = {(x, y * ) i } N i=1

where the ith instance consists of the observation, x i and its true label under the task, y * i . Our method assumes a particular model (family) whose parameters are selected to minimize empirical risk using a particular algorithm. 4 We assume the model defines a probability distribution over labels given an observation. We assume a stochastic gradient-based optimization procedure is used, with training instances randomly ordered at each epoch, across E epochs.

The training dynamics of instance i are defined as statistics calculated across the E epochs. The values of these measures then serve as coordinates in our map. The first measure aims to capture how confidently the learner assigns the true label to the observation, based on its probability distribution. We define confidence as the mean model probability of the true label (y * i ) across epochs:

µ i = 1 E E e=1 p θ (e) (y * i | x i )

where p θ (e) denotes the model's probability with parameters θ (e) at the end of the e th epoch. Sakaguchi et al., 2020) train set, based on a ROBERTA-large classifier, with the same axes as Fig. 1 . Density plots for the three different measures based on training dynamics are shown towards the right. Hard-to-learn regions have lower density in WinoGrande, compared to SNLI , perhaps as a result of a rigorous validation of collected annotations. However, manual errors remain, which we showcase in Tab. 1 as well as in Section §5. The plot shows only 25K train examples for clarity, and is best viewed enlarged.

some cases we also consider a coarser, and perhaps more intuitive statistic, the fraction of times the model correctly labels x i across epochs, named correctness; this score only has 1 + E possible values. Intuitively, a high-confidence instance is "easier" for the given learner. Lastly, we also consider variability, which measures the spread of p θ (e) (y * i | x i ) across epochs, using the standard deviation:

σ i = E e=1 p θ (e) (y * i | x i ) −μ i 2 E

Note that variability also depends on the gold label, y * i . A given instance to which the model assigns the same label consistently (whether accurately or not) will have low variability; one which the model is indecisive about across training, will have high variability.

Finally, we observe that confidence and variability are fairly stable across different pa-rameter initializations. 6 Training dynamics can be computed at different granularities, such as steps vs. epochs; see App. A.1.

## 2.2 Data Maps

We construct data maps for four large datasets: WinoGrande (Sakaguchi et al., 2020)-a clozestyle task for commonsense reasoning, two NLI datasets (SNLI ; Bowman et al., 2015; and MultiNLI ; Williams et al., 2018) , and QNLI , which is a sentence-level question answering task derived from SQuAD (Rajpurkar et al., 2016) . All data maps are built with models based on ROBERTAlarge architectures. Details on the model and datasets can be found in App. §A.2 and §A.3. Fig. 1 presents the data map for the SNLI dataset. As is evident, the data follows a bell-shaped curve with respect to confidence and variability;

## Option1

Option2 easy-to-learn The man chose to buy the roses instead of the carnations because the were more beautiful.

## Roses* Carnations

We enjoyed the meeting tonight but not the play as the was rather dull. correctness further determines discrete regions therein. The vast majority of instances belong to the high confidence and low variability region of the map (Fig. 1 , top-left). The model consistently predicts such instances correctly with high confidence; thus, we refer to them as easy-tolearn (for the model). A second, smaller group is formed by instances with low variability and low confidence (Fig. 1 , bottom-left corner). Since such instances are seldom predicted correctly during training, we refer to them as hard-to-learn (for the model). The third notable group contains ambiguous examples, or those with high variability (Fig. 1 , right-hand side); the model tends to be indecisive about these instances, such that they may or may not correspond to high confidence or correctness. We refer to such instances as ambiguous (to the model). Fig. 2 shows the data map for WinoGrande, which exhibits high structural similarity to the SNLI data map (Fig. 1) . The most remarkable difference between the maps is in the density of the hard-to-learn region, which is much lower for WinoGrande, as is evident from the histograms below. One explanation for this might be that WinoGrande labels were rigorously validated post annotation. App. §C includes data maps for all four datasets, with respect to ROBERTA-large, in greater relief.

Different model architectures trained on a given dataset could be effectively compared using data maps, as an alternative to standard quantitative evaluation methods. App. §C includes data maps for WinoGrande (Fig. 9b) and SNLI ( Fig. 10 and Fig. 11 ) based on other (somewhat weaker) architectures. While data maps based on similar architectures have similar appearance, the regions to which a given instance belongs might vary. Data maps for weaker architectures still display similar regions, but the regions are not as distinct as those in ROBERTA based data maps.