Do NLP Models Know Numbers? Probing Numeracy in Embeddings

Eric Wallace
Yizhong Wang
Sujian Li
Sameer Singh
Matt Gardner
EMNLP/IJCNLP
2019
View in Semantic Scholar

Abstract

The ability to understand and work with numbers (numeracy) is critical for many complex reasoning tasks. Currently, most NLP models treat numbers in text in the same way as other tokens---they embed them as distributed vectors. Is this enough to capture numeracy? We begin by investigating the numerical reasoning capabilities of a state-of-the-art question answering model on the DROP dataset. We find this model excels on questions that require numerical reasoning, i.e., it already captures numeracy. To understand how this capability emerges, we probe token embedding methods (e.g., BERT, GloVe) on synthetic list maximum, number decoding, and addition tasks. A surprising degree of numeracy is naturally present in standard embeddings. For example, GloVe and word2vec accurately encode magnitude for numbers up to 1,000. Furthermore, character-level embeddings are even more precise---ELMo captures numeracy the best for all pre-trained methods---but BERT, which uses sub-word units, is less exact.

1 Introduction

Figure 1: We train a probing model to decode a number from its word embedding over a random 80% of the integers from [-500, 500], e.g., “71”→ 71.0. We plot the model’s predictions for all numbers from [-2000, 2000]. The model accurately decodes numbers within the training range (in blue), i.e., pre-trained embeddings like GloVe and BERT capture numeracy. However, the probe fails to extrapolate to larger numbers (in red). The Char-CNN (e) and Char-LSTM (f) are trained jointly with the probing model.

Neural NLP models have become the de-facto standard tool across language understanding tasks, even solving basic reading comprehension and textual entailment datasets (Yu et al., 2018; Devlin et al., 2019) . Despite this, existing models are incapable of complex forms of reasoning, in particular, we focus on the ability to reason numerically. Recent datasets such as DROP (Dua et al., 2019) , EQUATE , or Mathematics Questions (Saxton et al., 2019) test numerical reasoning; they contain examples which require comparing, sorting, and adding numbers in natural language (e.g., Figure 2 ).

Figure 2: Three DROP questions that require numerical reasoning; the state-of-the-art NAQANet answers every question correct. Plausible answer candidates to the questions are underlined and the model’s predictions are shown in bold.

The first step in performing numerical reasoning over natural language is numeracy: the abil- * Equal contribution; work done while interning at AI2. We train a probing model to decode a number from its word embedding over a random 80% of the integers from [-500, 500], e.g., "71" → 71.0. We plot the model's predictions for all numbers from [-2000, 2000] . The model accurately decodes numbers within the training range (in blue), i.e., pre-trained embeddings like GloVe and BERT capture numeracy. However, the probe fails to extrapolate to larger numbers (in red). The Char-CNN (e) and Char-LSTM (f) are trained jointly with the probing model. ity to understand and work with numbers in either digit or word form (Spithourakis and Riedel, 2018) . For example, one must understand that the string "23" represents a bigger value than "twentytwo". Once a number's value is (perhaps implicitly) represented, reasoning algorithms can then process the text, e.g., extracting the list of field goals and computing that list's maximum (first question in Figure 2 ). Learning to reason numerically over paragraphs with only question-answer supervision appears daunting for end-to-end models; our work seeks to understand if and how "outof-the-box" neural NLP models already learn this. merical reasoning (Section 2). To our surprise, the model exhibits excellent numerical reasoning abilities. Amidst reading and comprehending natural language, the model successfully computes list maximums/minimums, extracts superlative entities (argmax reasoning), and compares numerical quantities. For instance, despite NAQANet achieving only 49 F1 on the entire validation set, it scores 89 F1 on numerical comparison questions. We also stress test the model by perturbing the validation paragraphs and find one failure mode: the model struggles to extrapolate to numbers outside its training range.

We are especially intrigued by the model's ability to learn numeracy, i.e., how does the model know the value of a number given its embedding? The model uses standard embeddings (GloVe and a Char-CNN) and receives no direct supervision for number magnitude/ordering. To understand how numeracy emerges, we probe token embedding methods (e.g., BERT, GloVe) using synthetic list maximum, number decoding, and addition tasks (Section 3).

We find that all widely-used pre-trained embeddings, e.g., ELMo (Peters et al., 2018) , BERT (Devlin et al., 2019) , and GloVe (Pennington et al., 2014) , capture numeracy: number magnitude is present in the embeddings, even for numbers in the thousands. Among all embeddings, characterlevel methods exhibit stronger numeracy than word-and sub-word-level methods (e.g., ELMo excels while BERT struggles), and character-level models learned directly on the synthetic tasks are the strongest overall. Finally, we investigate why NAQANet had trouble extrapolating-was it a failure in the model or the embeddings? We repeat our probing tasks and test for model extrapolation, finding that neural models struggle to predict numbers outside the training range.

2 Numeracy Case Study: Drop Qa

This section examines the state-of-the-art model for DROP by investigating its accuracy on questions that require numerical reasoning.

2.1 Drop Dataset

DROP is a reading comprehension dataset that tests numerical reasoning operations such as counting, sorting, and addition (Dua et al., 2019) . The dataset's input-output format is a superset of SQuAD (Rajpurkar et al., 2016) : the an-. . . JaMarcus Russell completed a 91-yard touchdown pass to rookie wide receiver Chaz Schilens. The Texans would respond with fullback Vonta Leach getting a 1-yard touchdown run, yet the Raiders would answer with kicker Sebastian Janikowski getting a 33-yard and a 21-yard field goal. Houston would tie the game in the second quarter with kicker Kris Brown getting a 53-yard and a 24-yard field goal. Oakland would take the lead in the third quarter with wide receiver Johnnie Lee Higgins catching a 29yard touchdown pass from Russell, followed up by an 80yard punt return for a touchdown. Q: How many yards was the longest field goal? A: 53 Q: How long was the shortest touchdown pass? A: 29-yard Q: Who caught the longest touchdown? A: Chaz Schilens Figure 2 : Three DROP questions that require numerical reasoning; the state-of-the-art NAQANet answers every question correct. Plausible answer candidates to the questions are underlined and the model's predictions are shown in bold.

swers are paragraph spans, as well as question spans, number answers (e.g., 35), and dates (e.g., 03/01/2014). The only supervision provided is the question-answer pairs, i.e., a model must learn to reason numerically while simultaneously learning to read and comprehend.

2.2 Naqanet Model

Modeling approaches for DROP include both semantic parsing (Krishnamurthy et al., 2017) and reading comprehension (Yu et al., 2018) models. We focus on the latter, specifically on Numerically-augmented QANet (NAQANet), the current state-of-the-art model (Dua et al., 2019 ). 1 The model's core structure closely follows QANet (Yu et al., 2018) except that it contains four output branches, one for each of the four answer types (passage span, question span, count answer, or addition/subtraction of numbers.)

Words and numbers are represented as the concatenation of GloVe embeddings and the output of a character-level CNN. The model contains no auxiliary components for representing number magnitude or performing explicit comparisons. We refer readers to Yu et al. (2018) and Dua et al. (2019) for further details.

2.3 Comparative And Superlative Questions

We focus on questions that NAQANet requires numeracy to answer, namely Comparative and Superlative questions. 2 Comparative questions probe a model's understanding of quantities or events that are "larger", "smaller", or "longer" than others. Certain comparative questions ask about "either-or" relations (e.g., first row of Table 1), which test binary comparison. Other comparative questions require more diverse comparative reasoning, such as greater than relationships (e.g., second row of Table 1 ). Superlative questions ask about the "shortest", "largest", or "biggest" quantity in a passage. When the answer type is a number, superlative questions require finding the maximum or minimum of a list (e.g., third row of Table 1) . When the answer type is a span, superlative questions usually require an argmax operation, i.e., one must find the superlative action or quantity and then extract the associated entity (e.g., fourth row of Table 1). We filter the validation set to comparative and superlative questions by writing templates to match words in the question.

Table 1: We focus on DROP Comparative and Superlative questions which test NAQANet’s numeracy.

2.4 Emergent Numeracy In Naqanet

NAQANet's accuracy on comparative and superlative questions is significantly higher than its average accuracy on the validation set (Table 2) . 3 NAQANet achieves 89.0 F1 on binary (eitheror) comparative questions, approximately 40 F1 points higher than the average validation question and within 7 F1 points of human test performance. The model achieves a lower, but respectable, accuracy on non-binary comparisons. These questions require multiple reasoning steps, e.g., the second question in Table 1 requires (1) extracting all the touchdown distances, (2) finding the distance that is greater than twenty, and (3) selecting the player associated with the touchdown of that distance.

Table 2: NAQANet achieves higher accuracy on questions that require numerical reasoning (Superlative and Comparative) than on standard validation questions. Human performance is reported from Dua et al. (2019).

We divide the superlative questions into questions that have number answers and questions with span answers according to the dataset's provided answer type. NAQANet achieves nearly 70 F1 on superlative questions with number answers, i.e., it can compute list maximum and minimums. The model answers about two-thirds of superlative questions with span answers correctly (66.3 F1), i.e., it can perform argmax reasoning. Figure 2 shows examples of superlative questions answered correctly by NAQANet. The first two questions require computing the maximum/minimum of a list: the model must recognize which digits correspond to field goals and touchdowns passes, and then extract the maximum/minimum of the correct list. The third question requires argmax reasoning: the model must first compute the longest touchdown pass and then find the corresponding receiver "Chaz Schilens".

2.5 Stress Testing Naqanet'S Numeracy

Just how far does the numeracy of NAQANet go? Here, we stress test the model by automatically modifying DROP validation paragraphs.

We test two phenomena: larger numbers and word-form numbers. For larger numbers, we generate a random positive integer and multiply or add that value to the numbers in each paragraph. For word forms, we replace every digit in the paragraph with its word form (e.g., "75" → "seventyfive"). Since word-form numbers are usually small in magnitude when they occur in DROP, we per- form word replacements for integers in the range [0, 100] . We guarantee the ground-truth answer is still valid by only modifying NAQANet's internal representation (Appendix E). Table 3 shows the results for different paragraph modifications. The model exhibits a tiny degradation in performance for small magnitude changes (e.g., NAQANet drops 1.5 F1 overall for Add [1,20]) but severely struggles on larger changes (e.g., NAQANet drops 35.7 F1 on superlative questions for Multiply [11, 200] ). Similar trends hold for word forms: the model exhibits small drops in accuracy when converting small numbers to words (3.9 degradation on Digits to Words [0,20]) but fails on larger magnitude word forms (21.6 F1 drop over [21, 100] ). These results show that NAQANet has a strong understanding of numeracy for numbers in the training range, but, the model can fail to extrapolate to other values.

Table 3: We stress test NAQANet’s numeracy by manipulating the numbers in the validation paragraphs. Add or Multiply [x, y] indicates adding or multiplying all of the numbers in the passage by a random integer in the range [x, y]. Digits→ Words [x, y] converts all integers in the passage within the range [x, y] to their corresponding word form (e.g., “75”→ “seventy-five”).

2.6 Whence This Behavior?

NAQANet exhibits numerical reasoning capabilities that exceed our expectations. What enables this behavior? Aside from reading and comprehending the passage/question, this kind of numerical reasoning requires two components: numeracy (i.e., representing numbers) and comparison algorithms (i.e., computing the maximum of a list).

Although the natural emergence of comparison algorithms is surprising, previous results show neural models are capable of learning to count and sort synthetic lists of scalar values when given explicit supervision (Weiss et al., 2018; Vinyals et al., 2016) . NAQANet demonstrates that a model can learn comparison algorithms while simultaneously learning to read and comprehend, even with only question-answer supervision.

How, then, does NAQANet know numeracy? The source of numerical information eventually lies in the token embeddings themselves, i.e., the character-level convolutions and GloVe embeddings of the NAQANet model. Therefore, we can understand the source of numeracy by isolating and probing these embeddings.

3 Probing Numeracy Of Embeddings

We use synthetic numerical tasks to probe the numeracy of token embeddings.

3.1 Probing Tasks

We consider three synthetic tasks to evaluate numeracy ( Figure 3 ). Appendix C provides further details on training and evaluation.

Figure 3: Our probing setup. We pass numbers through a pre-trained embedder (e.g., BERT, GloVe) and train a probing model to solve numerical tasks such as finding a list’s maximum, decoding a number, or adding two numbers. If the probing model generalizes to held-out numbers, the pre-trained embeddings must contain numerical information. We provide numbers as either words (shown here), digits (“9”), floats (“9.1”), or negatives (“-9”).

List Maximum Given a list of the embeddings for five numbers, the task is to predict the index of the maximum number. Each list consists of values of similar magnitude in order to evaluate fine-grained comparisons (see Appendix C). As in typical span selection models (Seo et al., 2017) , an LSTM reads the list of token embeddings, and a weight matrix and softmax function assign a probability to each index using the model's hidden state. We use the negative log-likelihood of the maximum number as the loss function.

Decoding The decoding task probes whether number magnitude is captured (rather than the relative ordering of numbers as in list maximum). Given a number's embedding, the task is to regress to its value, e.g., the embedding for the string "five" has a target of 5.0. We consider a linear regression model and a three-layer fully-connected network with ReLU activations. The models are trained using a mean squared error (MSE) loss.

Addition The addition task requires number manipulation-given the embeddings of two numbers, the task is to predict their sum. Our model concatenates the two token embeddings and feeds the result through a three-layer fullyconnected network with ReLU activations, trained using MSE loss. Unlike the decoding task, the model needs to capture number magnitude internally without direct label supervision. Figure 3 : Our probing setup. We pass numbers through a pre-trained embedder (e.g., BERT, GloVe) and train a probing model to solve numerical tasks such as finding a list's maximum, decoding a number, or adding two numbers. If the probing model generalizes to held-out numbers, the pre-trained embeddings must contain numerical information. We provide numbers as either words (shown here), digits ("9"), floats ("9.1"), or negatives ("-9").

Training And Evaluation

We focus on a numerical interpolation setting (we revisit extrapolation in Section 3.4): the model is tested on values that are within the training range. We first pick a range (we vary the range in our experiments) and randomly shuffle the integers over it. We then split 80% of the numbers into a training set and 20% into a test set. We report the mean and standard deviation across five different random shuffles for a particular range, using the exact same shuffles across all embedding methods. Numbers are provided as integers ("75"), single-word form ("seventy-five"), floats ("75.1"), or negatives ("-75"). We consider positive numbers less than 100 for word-form numbers to avoid multiple tokens. We report the classification accuracy for the list maximum task (5 classes), and the Root Mean Squared Error (RMSE) for decoding and addition. Note that larger ranges will naturally amplify the RMSE error.

3.2 Embedding Methods

We evaluate various token embedding methods. Word Vectors We use 300-dimensional GloVe (Pennington et al., 2014) and word2vec vectors (Mikolov et al., 2018) . We ensure all values are in-vocabulary for word vectors. Contextualized Embeddings We use ELMo (Peters et al., 2018) and BERT (Devlin et al., 2019) embeddings. 4 ELMo uses character-level convo- 4 Since our inputs are numbers, not natural sentences, language models may exhibit strange behavior. We ex-lutions of size 1-7 with max pooling. BERT represents tokens via sub-word pieces; we use lowercased BERT-base with 30k pieces.

Naqanet Embeddings

We extract the GloVe embeddings and Char-CNN from the NAQANet model trained on DROP. We also consider an ablation that removes the GloVe embeddings. Learned Embeddings We use a character-level CNN (Char-CNN) and a character-Level LSTM (Char-LSTM). We use left character padding, which greatly improves numeracy for characterlevel CNNs (details in Appendix B).

Untrained Embeddings

We consider two untrained baselines. The first baseline is random token vectors, which trivially fail to generalize (there is no pattern between train and test numbers). These embeddings are useful for measuring the improvement of pre-trained embeddings. We also consider a randomly initialized and untrained Char-CNN and Char-LSTM.

Number'S Value As Embedding

The final embedding method is simple: map a number's embedding directly to its value (e.g., "seventy-five" embeds to [75]). We found this strategy performs poorly for large ranges; using a base-10 logarithmic scale improves performance. We report this as Value Embedding in our results. 5 perimented with extracting the context-independent feature vector immediately following the character convolutions for ELMo but found little difference in results. 5 We suspect the failures result from the raw values being too high in magnitude and/or variance for the model. We All pre-trained embeddings (all methods except the Char-CNN and Char-LSTM) are fixed during training. The probing models are trained on the synthetic tasks on top of these embeddings.

3.3 Results: Embeddings Capture Numeracy

We find that all pre-trained embeddings contain fine-grained information about number magnitude and order. We first focus on integers (Table 4) .

Table 4: Interpolation with integers (e.g., “18”). All pre-trained embedding methods (e.g., GloVe and ELMo) surprisingly capture numeracy. The probing model is trained on a randomly shuffled 80% of the Integer Range and tested on the remaining 20%. The probing model architecture and train/test splits are equivalent across all embeddings. We show the mean over 5 random shuffles (standard deviation in Appendix D).

Word Vectors Succeed Both word2vec and GloVe significantly outperform the random vector baseline and are among the strongest methods overall. This is particularly surprising given the training methodology for these embeddings, e.g., a continuous bag of words objective can teach finegrained number magnitude.

Character-level Methods Dominate Models which use character-level information have a clear advantage over word-level models for encoding numbers. This is reflected in our probing results: character-level CNNs are the best architecture for capturing numeracy. For example, the NAQANet model without GloVe (only using its Char-CNN) and ELMo (uses a Char-CNN) are the strongest pre-trained methods, and a learned Char-CNN is the strongest method overall. The strength of the character-level convolutions seems to lie in the architectural prior-an untrained Char-CNN is surprisingly competitive. Similar results have been shown for images (Saxe et al., 2011): random CNNs are powerful feature extractors.

Sub-word Models Struggle BERT struggles for large ranges (e.g., 52% accuracy for list maximum for [0,9999]). We suspect this results from subword pieces being a poor method to encode digits: two numbers which are similar in value can have very different sub-word divisions.

A Linear Subspace Exists For small ranges on the decoding task (e.g., [0, 99] ), a linear model is competitive, i.e., a linear subspace captures number magnitude (Appendix D). For larger ranges (e.g., [0, 999] ), the linear model's performance degrades, especially for BERT.

Value Embedding Fails The Value Embedding method fails for large ranges. This is surprising as the embedding directly provides a number's value, thus, the synthetic tasks should be easy to solve. also experimented with normalizing the values to mean 0 and variance 1; a logarithmic scale performed better.

However, we had difficulty training models for large ranges, even when using numerous architecture variants (e.g., tiny networks with 10 hidden units and tanh activations) and hyperparameters. Trask et al. (2018) discuss similar problems and ameliorate them using new neural architectures.

Words, Floats, and Negatives are Captured Finally, we probe the embeddings on word-form numbers, floats, and negatives. We observe similar trends for these inputs as integers: pre-trained models exhibit natural numeracy and learned embeddings are strong (Tables 5, 6, and 10). The ordering of the different embedding methods according to performance is also relatively consistent across the different input types. One notable exception is that BERT struggles on floats, which is likely a result of its sub-word pieces. We do not test word2vec and GloVe on floats/negatives because they are out-of-vocabulary.

3.4 Probing Models Struggle To Extrapolate

Thus far, our synthetic experiments evaluate on held-out values within the same range as the training data (i.e., numerical interpolation). In Section 2.5, we found that NAQANet struggles to extrapolate to values outside the training range. Is this an idiosyncrasy of NAQANet or is it a more general problem? We investigate this using a numerical extrapolation setting: we train models on a specific integer range and test them on values greater than the largest training number and smaller than the smallest training number.

Extrapolation for Decoding and Addition For decoding and addition, models struggle to extrapolate. Figure 1 shows the predictions for models trained on 80% of the values from [-500,500] and tested on held-out numbers in the range [-2000, 2000] for six embedding types. The embedding methods fail to extrapolate in different ways, e.g., predictions using word2vec decrease almost monotonically as the input increases, while predictions using BERT are usually near the highest training value. Trask et al. (2018) also observe that models struggle outside the training range; they attribute this to failures in neural models themselves.

Extrapolation for List Maximum For the list maximum task, accuracies are closer to those in the interpolation setting, however, they still fall short. Table 4 : Interpolation with integers (e.g., "18"). All pre-trained embedding methods (e.g., GloVe and ELMo) surprisingly capture numeracy. The probing model is trained on a randomly shuffled 80% of the Integer Range and tested on the remaining 20%. The probing model architecture and train/test splits are equivalent across all embeddings. We show the mean over 5 random shuffles (standard deviation in Appendix D). Table 5 : Interpolation with floats (e.g., "18.1") for list maximum. Pre-trained embeddings capture numeracy for float values. The probing model is trained on a randomly shuffled 80% of the Float Range and tested on the remaining 20%. See the text for details on selecting decimal values. We show the mean alongside the standard deviation over 5 different random shuffles.

Table 5: Interpolation with floats (e.g., “18.1”) for list maximum. Pre-trained embeddings capture numeracy for float values. The probing model is trained on a randomly shuffled 80% of the Float Range and tested on the remaining 20%. See the text for details on selecting decimal values. We show the mean alongside the standard deviation over 5 different random shuffles.

the ranges [151, 160] , [151, 180] , and [151, 200] ; all methods struggle, especially token vectors.

Table 6: Interpolation with negatives (e.g., “-18”) on list maximum. Pre-trained embeddings capture numeracy for negative values.

Table 7: Extrapolation on list maximum. The probing model is trained on the integer range [0,150] and evaluated on integers from the Test Range. The probing model struggles to extrapolate when trained on the pre-trained embeddings.

Table 8: Number Decoding interpolation accuracy with linear regression. Linear regression is competitive to the fully connected probe for smaller numbers.

Table 9: Mean and standard deviation for Table 4 (interpolation tasks with integers).

Table 10: Interpolation task accuracies with word form (e.g., “twenty-five”). The model is trained on a randomly shuffled 80% of the Integer Range and tested on the remaining 20%. We show the mean and standard deviation for five random shuffles.

Augmenting Data to Aid Extrapolation Of course, in many real-word tasks it is possible to ameliorate these extrapolation failures by augmenting the training data (i.e., turn extrapolation into interpolation). Here, we apply this idea to aid in training NAQANet for DROP. For each superlative and comparative example, we duplicate the example and modify the numbers in its paragraph using the Add and Multiply techniques de- scribed in Section 2.5. Table 11 shows that this data augmentation can improve both interpolation and extrapolation, e.g., the accuracy on superlative questions with large numbers can double.

Table 11: Data augmentation improves NAQANet’s interpolation and extrapolation results. We created the Bigger version of DROP by multiplying numbers in the passage by a random integer from [11, 20] and then adding a random integer from [21, 40]. Scores are shown in EM / F1 format.

4 Discussion And Related Work

An open question is how the training process elicits numeracy for word vectors and contextualized embeddings. Understanding this, perhaps by tracing numeracy back to the training data, is a fruitful direction to explore further (c.f., influence functions (Koh and Liang, 2017; Brunet et al., 2019) ). More generally, numeracy is one type of emergent knowledge. For instance, embeddings may capture the size of objects (Forbes and Choi, 2017) , speed of vehicles, and many other "commonsense" phenomena (Yang et al., 2018) . Vendrov et al. (2016) introduce methods to encode the order of such phenomena into embeddings for concepts such as hypernymy; our work and Yang et al. (2018) show that a relative ordering naturally emerges for certain concepts.

In concurrent work, also explore numeracy in word vectors. Their methodology is based on variants of nearest neighbors and cosine distance; we use neural network probing classifiers which can capture highly non-linear dependencies between embeddings. We also explore more powerful embedding methods such as ELMo, BERT, and learned embedding methods.

Probing Models Our probes of numeracy parallel work in understanding the linguistic capabilities (literacy) of neural models (Conneau et al., 2018; Liu et al., 2019) . LSTMs can remember sentence length, word order, and which words were present in a sentence (Adi et al., 2017) . Khandel-wal et al. (2018) show how language models leverage context, while Linzen et al. (2016) demonstrate that language models understand subjectverb agreement. Spithourakis and Riedel (2018) improve the ability of language models to predict numbers, i.e., they go beyond categorical predictions over a fixed-size vocabulary. They focus on improving models; our focus is probing embeddings. Kotnis and García-Durán (2019) predict numerical attributes in knowledge bases, e.g., they develop models that try to predict the population of Paris.

Numerical Value Prediction

Synthetic Numerical Tasks Similar to our synthetic numerical reasoning tasks, other work considers sorting (Graves et al., 2014) , counting (Weiss et al., 2018) , or decoding tasks (Trask et al., 2018) . They use synthetic tasks as a testbed to prove or design better models, whereas we use synthetic tasks as a probe to understand token embeddings. In developing the Neural Arithmetic Logic Unit, Trask et al. (2018) arrive at similar conclusions regarding extrapolation: neural models have difficulty outputting numerical values outside the training range.

5 Conclusion

How much do NLP models know about numbers? By digging into a surprisingly successful model on a numerical reasoning dataset (DROP), we discover that pre-trained token representations naturally encode numeracy.

We analyze the limits of this numeracy, finding that CNNs are a particularly good prior (and likely the cause of ELMo's superior numeracy compared to BERT) and that it is difficult for neural models to extrapolate beyond the values seen during training. There are still many fruitful areas for future research, including discovering why numeracy naturally emerges in embeddings, and what other properties are similarly emergent.

C Training Details For Probing

We create training/test splits for the addition task in the following manner. We first shuffle and split an integer range, putting 80% into train and 20% into test. Next, we enumerate all possible pairs of numbers in the two respective splits. When using large ranges such as [0,999], we sub-sample a random 10% of the training and test pairs. For the list maximum task, we first shuffle and split the data, putting 80% into a training pool of numbers and 20% into a test pool. In initial experiments, we created the lists of five numbers by sampling uniformly over the training/test pool. However, as the random samples will likely be spread out over the range, the numbers are easy to distinguish. We instead create 100,000 training examples and 10,000 examples in the following manner. We first sample a random integer from the training or test pool. Next, we sample from a Gaussian with mean zero and variance equal to 0.01 times the total size of the range. Finally, we add the random Gaussian sample to the random integer, and round to the nearest value in the pool. This forces the numbers to be nearby. Table 9 shows the mean and standard deviation for the synthetic tasks using five random shuffles.

D.3 Float Values

We test floats with one decimal point. We follow the setup for the list maximum task (Appendix C) with a minor modification. For 50% of the training/test lists, we reuse the same integer five times but sample a different random value to put after the decimal point. For example, 50% of the lists are of the form: [15.3, 15.6, 15.1, 15.8, 15.2] (the same base integer is repeated with a different decimal), and 50% are random integers with a random digit after the decimal: [11.7, 16.4, 9.3, 7.9, 13.3] . This forces the model to consider the numbers on both the left and the right of the decimal. Table 10 presents the results using word-forms. We do not use numbers larger than 100 as they consist of multiple words.

E Automatically Modifying Drop Paragraphs

Modifying a DROP paragraph automatically is challenging as it may change the answer to the question if done incorrectly. We also cannot modify the answer because many DROP questions have count answers. To guarantee that the original annotated answer can still be used, we perform the number transformation in the following manner. We keep the original passage text the same, but, we modify the model's internal embeddings for the numbers directly. In other words, the model uses exactly the same embeddings for the original text except for the modified numbers. The model then needs to find the correct index (e.g. the index of the correct span, or the index of the correct number) given these modified embeddings.

F Extrapolation With Data Augmentation

For each superlative and comparative example, we modify the numbers in its paragraph using the Add and Multiply techniques mentioned in Section 2.5. We first multiply the paragraph's numbers by a random integer from [1, 10] , and then add another random integer from [0, 20] . We train NAQANet on the original paragraph and an additional modified version for all training examples. We use a single additional paragraph for computational efficiency; augmenting the data with more modified paragraphs may further improve results. We test NAQANet on the original validation set, as well as a Bigger validation set. We created the Bigger validation set by multiplying each paragraph's numbers by a random integer from [11, 20] and then adding a random value from [21, 40] . Note that this range is larger than the one used for data augmentation. Table 11 shows the results of NAQANet trained with data augmentation. Data augmentation provide small gains on the original superlative and comparative question subset, and significant improvements on the Bigger version (it doubles the model's F1 score for superlative questions).

Result as of May 21st, 2019.

DROP addition, subtraction, and count questions do not require numeracy for NAQANet, see Appendix A.3 We have a public NAQANet demo available https:// demo.allennlp.org/reading-comprehension.

Padding on the left and right via SAME convolutions also mitigates this issue.