## Abstract

With models reaching human performance on many popular reading comprehension datasets in recent years, a new dataset, DROP, introduced questions that were expected to present a harder challenge for reading comprehension models. Among these new types of questions were "multi-span questions", questions whose answers consist of several spans from either the paragraph or the question itself. Until now, only one model attempted to tackle multi-span questions as a part of its design. In this work, we suggest a new approach for tackling multi-span questions, based on sequence tagging, which differs from previous approaches for answering span questions. We show that our approach leads to an absolute improvement of 29.7 EM and 15.1 F1 compared to existing state-of-the-art results, while not hurting performance on other question types. Furthermore, we show that our model slightly eclipses the current state-of-the-art results on the entire DROP dataset.

## 1. Introduction

The task of reading comprehension, where systems must understand a single passage of text well enough to answer arbitrary questions about it, has seen significant progress in the last few years. With models reaching human performance on the popular SQuAD dataset (Rajpurkar et al., 2016) , and with much of the most popular reading comprehension datasets having been solved (Chen et al., 2016; Devlin et al., 2018) , a new dataset, DROP (Dua et al., 2019) , was recently published.

DROP aimed to present questions that require more complex reasoning in order to answer than that of previous datasets, in a hope to push the field towards a more comprehensive analysis of paragraphs of text. In addition to questions whose answers are a single continuous span from the paragraph text (questions of a type already included in SQuAD), DROP introduced additional types of questions. Among these new types were questions that require simple numerical reasoning, i.e questions whose answer is the result of a simple arithmetic expression containing numbers from the passage, and questions whose answers consist of several spans taken from the paragraph or the question itself, what we will denote as "multi-span questions".

Of all the existing models that tried to tackle DROP, only one model (Hu et al., 2019) directly targeted multi-span questions in a manner that wasn't just a by-product of the model's overall performance. In this paper, we propose a new method for tackling multi-span questions. Our method takes a different path from that of the aforementioned model. It does not try to generalize the existing approach for tackling single-span questions, but instead attempts to attack this issue with a new, tag-based, approach.

## 2. Related Work

Numerically-aware QANet (NAQANet) (Dua et al., 2019) was the model released with DROP. It uses QANET (Yu et al., 2018) , at the time the best-performing published model on SQuAD 1.1 (Rajpurkar et al., 2016 ) (without data augmentation or pretraining), as the encoder. On top of QANET, NAQANet adds four different output layers, which we refer to as "heads". Each of these heads is designed to tackle a specific question type from DROP, where these types where identified by DROP's authors post-creation of the dataset. These four heads are (1) Passage span head, designed for producing answers that consist of a single span from the passage. This head deals with the type of questions already introduced in SQuAD. (2) Question span head, for answers that consist of a single span from the question. (3) Arithmetic head, for answers that require adding or subtracting numbers from the passage. (4) Count head, for answers that require counting and sorting entities from the text. In addition, to determine which head should be used to predict an answer, a 4-way categorical variable, as per the number of heads, is trained. We denote this categorical variable as the "head predictor".

Numerically-aware BERT (NABERT+) (Kinley & Lin, 2019) introduced two main improvements over NAQANET. The first was to replace the QANET encoder with BERT. This change alone resulted in an absolute improvement of more than eight points in both EM and F1 metrics. The second improvement was to the arithmetic head, consisting of the addition of "standard numbers" and "templates". Standard numbers were predefined numbers which were added as additional inputs to the arithmetic head, regardless of their occurrence in the passage. Templates were an attempt to enrich the head's arithmetic capabilities, by adding the ability of doing simple multiplications and divisions between up to three numbers.

MTMSN (Hu et al., 2019) is the first, and only model so far, that specifically tried to tackle the multi-span questions of DROP. Their approach consisted of two parts. The first was to train a dedicated categorical variable to predict the number of spans to extract. The second was to generalize the single-span head method of extracting a span, by utilizing the non-maximum suppression (NMS) algorithm (Rosenfeld & Thurston, 1971) to find the most probable set of non-overlapping spans. The number of spans to extract was determined by the aforementioned categorical variable.

Additionally, MTMSN introduced two new other, non spanrelated, components. The first was a new "negation" head, meant to deal with questions deemed as requiring logical negation (e.g. "How many percent were not German?") 1 . The second was improving the arithmetic head by using beam search to re-rank candidate arithmetic expressions.

## 3. Model

Problem statement. Given a pair (x P , x Q ) of a passage and a question respectively, both comprised of tokens from a vocabulary V , we wish to predict an answer y. The answer could be either a collection of spans from the input, or a number, supposedly arrived to by performing arithmetic reasoning on the input. We want to estimate p(y; x P , x Q ).

The basic structure of our model is shared with NABERT+, which in turn is shared with that of NAQANET (the model initially released with DROP). Consequently, meticulously presenting every part of our model would very likely prove redundant. As a reasonable compromise, we will introduce the shared parts with more brevity, and will go into greater detail when presenting our contributions.

## 3.1. Nabert+

Assume there are K answer heads in the model and their weights denoted by θ. For each pair (x P , x Q ) we assume a latent categorical random variable z ∈ {1, . . . K} such that the probability of an answer y is

p(y; x P , x Q , θ) = K z=1 p(z; x P , x Q , θ)p(y|z; x P , x Q , θ)

where each component of the mixture corresponds to an output head such that

p(y|z; x P , x Q , θ) = head z (y, x P , x Q , θ)

Note that a head is not always capable of producing the correct answer y gold for each type of question, in which case p y gold |z; x P , x Q , θ = 0. For example, the arithmetic head, whose output is always a single number, cannot possibly produce a correct answer for a multi-span question.

For a multi-span question with an answer composed of l spans, denote y gold MS = y gold 1 , . . . , y gold l . NAQANET and NABERT+ had no head capable of outputting correct answers for multi-span questions. Instead of ignoring them in training, both models settled on using "semi-correct answers": each y gold ∈ y gold MS was considered to be a correct answer (only in training). By deliberately encouraging the model to provide partial answers for multi-span questions, they were able to improve the corresponding F1 score. As our model does have a head with the ability to answer multi-span questions correctly, we didn't provide the aforementioned semi-correct answers to any of the other heads. Otherwise, we would have skewed the predictions of the head predictor and effectively mislead the other heads to believe they could predict correct answers for multi-span questions.

## 3.1.1. Heads Shared With Nabert+

Before going over the answer heads, two additional components should be introduced -the summary vectors, and the head predictor.

Summary vectors. The summary vectors are two fixedsize learned representations of the question and the passage, which serve as an input for some of the heads. To create the summary vectors, first define T as BERT's output on a (x P , x Q ) input. Then, let T P and T Q be subsequences of T that correspond to x P and x Q respectively. Finally, let us also define Bdim as the dimension of the tokens in T (e.g 768 for BERTBASE), and have W P ∈ R Bdim and W Q ∈ R Bdim as learned linear layers. Then, the summary vectors are computed as:

α P = softmax(W P T P ) α Q = softmax(W Q T Q ) h P = α P T P h Q = α Q T Q

Head predictor. A learned categorical variable with its number of outcomes equal to the number of answer heads in the model. Used to assign probabilities for using each of the heads in prediction.

p head = softmax(FFN head ([h P ; h Q ]))

where FFN is a two-layer feed-forward network with RELU activation.

Passage span. Define W S ∈ R Bdim and W E ∈ R Bdim as learned vectors. Then the probabilities of the start and end positions of a passage span are computed as

p p start = softmax(W S T P ) p p end = softmax(W E T P )

Question span. The probabilities of the start and end positions of a question span are computed as

p q start = softmax(FFN ( [T Q , e |T Q | ⊗ h P ])) p q end = softmax(FFN([T Q , e |T Q | ⊗ h P ]))

where

e |T Q | ⊗ h P repeats h P for each component of T Q .

Count. Counting is treated as a multi-class prediction problem with the numbers 0-9 as possible labels. The label probabilities are computed as

p count = softmax(FFN count (h P ))

Arithmetic. As in NAQNET, this head obtains all of the numbers from the passage, and assigns a plus, minus or zero ("ignore") for each number. As BERT uses wordpiece tokenization, some numbers are broken up into multiple tokens. Following NABERT+, we chose to represent each number by its first wordpiece. That is, if N i is the set of tokens corresponding to the i th number, we define a number representation as h N i = N i 0 . The selection of the sign for each number is a multi-class prediction problem with options {0, +, −}, and the probabilities for the signs are given by

p sign i = softmax(FFN arithmetic (h N i ))

As for NABERT+'s two additional arithmetic features, we decided on using only the standard numbers, as the benefits from using templates were deemed inconclusive. Note that unlike the single-span heads, which are related to our introduction of a multi-span head, the arithmetic and count heads were not intended to play a significant role in our work. We didn't aim to improve results on these types of questions 2 , perhaps only as a by-product of improving the general reading comprehension ability of our model.

## 3.2. Multi-Span Head

A subset of questions that wasn't directly dealt with by the base models (NAQANET, NABERT+) is questions that have an answer which is composed of multiple noncontinuous spans. We suggest a head that will be able to deal with both single-span and multi-span questions.

To model an answer which is a collection of spans, the multi-span head uses the BIO tagging format (Ramshaw & Marcus, 1995) : B is used to mark the beginning of a span, I is used to mark the inside of a span and O is used to mark tokens not included in a span. In this way, we get a sequence of chunks that can be decoded to a final answer -a collection of spans.

As words are broken up by the wordpiece tokenization for BERT, we decided on only considering the representation of the first sub-token of the word to tag, following the NER task from (Devlin et al., 2018) .

For the i-th token of an input, the probability to be assigned a tag ∈ {B, I, O} is computed as

p tag i = softmax(FFN multi-span (T i ))

## 3.3. Objective And Training

To train our model, we try to maximize the log-likelihood of the correct answer p(y gold ; x P , x Q , θ) as defined in Section 3.1. If no head is capable of predicting the gold answer, the sample is skipped.

We enumerate over every answer head z ∈ {PS, QS, C, A, MS} (Passage Span, Question Span, Count, Arithmetic, Multi-Span) to compute each of the objective's addends:

p(z; x P , x Q , θ) = p head z p(y gold |z; x P , x Q , θ) = head z (y gold , x P , x Q , θ)

Note that we are in a weakly supervised setup: the answer type is not given, and neither is the correct arithmetic expression required for deriving some answers. Therefore, it is possible that y gold could be derived by more than one way, even from the same head, with no indication of which is the "correct" one.

We use the weakly supervised training method used in NABERT+ and NAQANET. Based on (Berant et al., 2013) , for each head we find all the executions that evaluate to the correct answer and maximize their marginal likelihood 3 4 .

For a datapoint y, x P , x Q let χ z be the set of all possible ways to get y for answer head z ∈ {PS, QS, C, A, MS}. Then, as in NABERT+, we have

p(y|z = PS; x P , x Q , θ) = (s,e)∈χ PS p p start s • p p end e p(y|z = QS; x P , x Q , θ) = (s,e)∈χ QS p q start s • p q end e p(y|z = C; x P , x Q , θ) = p count i 0 ≤ i ≤ 9, i ∈ χ C 0 otherwise

Finally, for the arithmetic head, let µ be the set of all the standard numbers and the numbers from the passage, and let χ A be the set of correct sign assignments to these numbers. Then, we have

p(y|z = A; x P , x Q , θ) = (sign 1 ,...,sign |µ| )∈χ A |µ| i=1 p sign i i 3.3.1

## . Multi-Span Head Training Objective

Denote by χ MS the set of correct tag sequences. If the concatenation of a question and a passage is m tokens long, then denote a correct tag sequence as (tag 1 , . . . , tag m ).

We approximate the likelihood of a tag sequence by assuming independence between the sequence's positions, and multiplying the likelihoods of all the correct tags in the sequence. Then, we have

p(y|z = MS; x P , x Q , θ) = (tag 1 ,...,tag m )∈χ MS m i=1 p tag i i 3.3.

## 2. Multi-Span Head Correct Tag Sequences

Since a given multi-span answer is a collection of spans, it is required to obtain its matching tag sequences in order to compute the training objective.

In what we consider to be a correct tag sequence, each answer span will be marked at least once. Due to the weakly supervised setup, we consider all the question/passage spans that match the answer spans as being correct. To illustrate, consider the following simple example. Given the text "X Y Z Z" and the correct multi-span answer ["Y", "Z"], there are three correct tag sequences:

O B B B, O B B O, O B O B.

## 3.3.3. Dealing With Too Many Correct Tag Sequences

The number of correct tag sequences can be expressed by

# of correct tag sequences = s i=1 2 #i − 1

where s is the number of spans in the answer and # i is the number of times the i th span appears in the text.

For questions with a reasonable amount of correct tag sequences, we generate all of them before the training starts. However, there is a small group of questions for which the amount of such sequences is between 10,000 and 100,000,000 -too many to generate and train on. In such cases, inspired by (Berant et al., 2013) , instead of just using an arbitrary subset of the correct sequences, we use beam search to generate the top-k predictions of the training model, and then filter out the incorrect sequences. Compared to using an arbitrary subset, using these sequences causes the optimization to be done with respect to answers more compatible with the model. If no correct tag sequences were predicted within the top-k, we use the tag sequence that has all of the answer spans marked.

## 3.4. Tag Sequence Prediction With The Multi-Span Head

Based on the outputs p

tag i i

we would like to predict the most likely sequence given the BIO constraints. Denote validSeqs as the set of all BIO sequences of length m that are valid according to the rules specified in Section 3.2. The BIO tag sequence to predict is then

y predicted = arg max (tag 1 ,...,tag m )∈validSeqs m i=1 p tag i i

We considered the following approaches:

Viterbi Decoding A natural candidate for getting the most likely sequence is Viterbi decoding, (Viterbi, 1967) with transition probabilities learned by a BIO constrained Conditional Random Field (CRF) (Lafferty et al., 2001 ). However, further inspection of our sequence's properties reveals that such a computational effort is probably not necessary, as explained in following paragraphs.

Beam Search Due to our use of BIO tags and their constraints, observe that past tag predictions only affect future tag predictions from the last B prediction and as long as the best tag to predict is I. Considering the frequency and length of the correct spans in the question and the passage, effectively there's no effect of past sequence's positions on future ones, other than a very few positions ahead. Together with the fact that at each prediction step there are no more than 3 tags to consider, it means using beam search to get the most likely sequence is very reasonable and even allows near-optimal results with small beam width values.

Greedy Tagging Notice that greedy tagging does not enforce the BIO constraints. However, since the multi-span head's training objective adheres to the BIO constraints via being given the correct tag sequences, we can expect that even with greedy tagging the predictions will mostly adhere to these constraints as well. In case there are violations, their amendment is required post-prediction. Albeit faster, greedy tagging resulted in a small performance hit, as seen in Table 4 .

## 4. Preprocessing

We tokenize the passage, question, and all answer texts using the BERT uncased wordpiece tokenizer from HUGGINGFACE 5 . The tokenization resulting from each (x P , x Q ) input pair is truncated at 512 tokens so it can be fed to BERT as an input. However, before tokenizing the dataset texts, we perform additional preprocessing as listed below.

## 4.1.1. Improved Textual Parsing

The raw dataset included almost a thousand of HTML entities that did not get parsed properly, e.g " " instead of a simple space. In addition, we fixed some quirks that were introduced by the original Wikipedia parsing method. For example, when encountering a reference to an external source that included a specific page from that reference, the original parser ended up introducing a redundant ":

## 4.1.2. Improved Handling Of Numbers

Although we previously stated that we aren't focusing on improving arithmetic performance, while analyzing the training process we encountered two arithmetic-related issues that could be resolved rather quickly: a precision issue and a number extraction issue. Regarding precision, we noticed that while either generating expressions for the arithmetic head, or using the arithmetic head to predict a numeric answer 7 , the value resulting from an arithmetic operation would not always yield the exact result due to floating point precision limitations. For example, 5.8 + 6.6 = 12.3999... instead of 12.4. This issue has caused a significant performance hit of about 1.5 points for both F1 and EM and was fixed by simply rounding numbers to 5 decimal places, assuming that no answer requires a greater precision. Regarding number extraction, we noticed that some numeric entities, required in order to produce a correct answer, weren't being extracted from the passage. Examples include ordinals (121st, 189th) and some "per-" units (1,580.7/km 2 , 1050.95/month).

## 4.2. Using Ner For Cleaning Up Multi-Span Questions

The training dataset contains multi-span questions with answers that are clearly incorrect, with examples shown in Table 1 . In order to mitigate this, we applied an answercleaning technique using a pretrained Named Entity Recognition (NER) model (Peters et al., 2017) in the following manner: (1) Pre-define question prefixes whose answer spans are expected to contain only a specific entity type and filter the matching questions. 2For a given answer of a filtered question, remove any span that does not contain at least one token of the expected type, where the types are determined by applying the NER model on the passage. For example, if a question starts with "who scored", we expect that any valid span will include a person entity (PER). By applying such rules, we discovered that at least 3% of the multi-span questions in the training dataset included incorrect spans. As our analysis of prefixes wasn't exhaustive, we believe that this method could yield further gains. Table 1 shows a few of our cleaning method results, where we perfectly clean the first two questions, and partially clean a third question.

## 5. Training

The starting point for our implementation was the NABERT+ model 8 , which in turn was based on ALLENAI's NAQANET 9 . Our implementation can be found on GitHub 10 . All three models utilize the allennlp framework 11 .

The pretrained BERT models were supplied by HUGGINGFACE.

For our BASE model we used bert-base-uncased. For our LARGE models we used the standard bert-large-uncased-whole-word-masking and the squad fine-tuned bert-large-uncasedwhole-word-masking-finetuned-squad.

Due to limited computational resources, we did not perform any hyperparameter searching. We preferred to focus our efforts on the ablation studies, in hope to gain further insights on the effect of the components that we ourselves introduced. For ease of performance comparison, we followed NABERT+'s training settings: we used the BERT Adam optimizer from HUGGINGFACE with default settings and a learning rate of 1e −5 . The only difference was that we used a batch size of 12. We trained our BASE model for 20 epochs. For the LARGE models we used a batch size of Table 1 . Examples of faulty answers for multi-span questions in the training dataset, with their perfect clean answers, and answers generated by our cleaning method

## Question

Original 3 with a learning rate of 5e −6 and trained for 5 epochs, except for the model without the single-span heads that was trained with a batch size of 2 for 7 epochs. F1 was used as our validation metric. All models were trained on a single GPU with 12-16GB of memory.

## 6. Results And Discussion

6.1. Performance on DROP's Development Set Table 2 shows the results on DROP's development set. Compared to our BASE models, our LARGE models exhibit a substantial improvement across all metrics.

## 6.1.1. Comparison To The Nabert+ Baseline

We can see that our BASE model surpasses the NABERT+ baseline in every metric. The major improvement in multi-span performance was expected, as our multi-span head was introduced specifically to tackle this type of questions. For the other types, most of the improvement came from better preprocessing. A more detailed discussion could be found in Section (6.3).

## 6.1.2. Comparison To Mtmsn

Notice that different BERTLARGE models were used, so the comparison is less direct. Overall, our LARGE models exhibits similar results to those of MTMSNLARGE.

For multi-span questions we achieve a significantly better performance. While a breakdown of metrics was only available for MTMSNLARGE, notice that even when comparing these metrics to our BASE model, we still achieve a 12.2 absolute improvement in EM, and a 2.3 improvement in F1. All that, while keeping in mind we compare a BASE model to a LARGE model (for reference, note the 8 point improvement between MTMSNBASE and MTMSNLARGE in both EM and F1). Our best model, LARGE-SQUAD, exhibits a huge improvement of 29.7 in EM and 15.1 in F1 compared to MTMSNLARGE.

When comparing single-span performance, our best model exhibits slightly better results, but it should be noted that it retains the single-span heads from NABERT+, while in MTMSN they have one head to predict both single- span and multi-span answers. For a fairer comparison, we trained our model with the single-span heads removed, where our multi-span head remained the only head aimed for handling span questions. With this no-single-spanheads setting, while our multi-span performance even improved a bit, our single-span performance suffered a slight drop, ending up trailing by 0.8 in EM and 0.6 in F1 compared to MTMSN. Therefore, it could prove beneficial to try and analyze the reasons behind each model's (ours and MTMSN) relative advantages, and perhaps try to combine them into a more holistic approach of tackling span questions. Table 3 shows the results on DROP's test set 12 , with our model being the best overall as of the time of writing, and not just on multi-span questions.

## 6.3. Ablation Studies

In order to analyze the effect of each of our changes, we conduct ablation studies on the development set, depicted in Table 4. 1. Not using the simple preprocessing from Section 4.1 resulted in a 2.5 point decrease in both EM and F1. The numeric questions were the most affected, with their performance dropping by 3.5 points. Given that number questions make up about 61% of the dataset, we can deduce that our improved number handling is responsible for about a 2.1 point gain, while the rest could be be attributed to the improved Wikipedia parsing.

2. Although NER span cleaning (Section 4.2) affected only 3% of the multi-span questions, it provided a solid improvement of 5.4 EM in multi-span questions and 1.5 EM in single-span questions. The single-span improvement is probably due to the combination of better multi-span head learning as a result of fixing 12 https://leaderboard.allenai.org/drop/submissions/public multi-span questions and the fact that the multi-span head can answer single-span questions as well.

3. Not using the single-span heads results in a slight drop in multi-span performance, and a noticeable drop in single-span performance. However when performing the same comparison between our LARGE models (see Table 2 ), this performance gap becomes significantly smaller.

4. As expected, not using the multi-span head causes the multi-span performance to plummet. Note that for this ablation test the single-span heads were permitted to train on multi-span questions.

5. Compared to using greedy decoding in the prediction of multi-span questions, using beam search results in a small improvement. We used a beam with of 5, and didn't perform extensive tuning of the beam width.

## 7. Conclusion

In this work, we introduced a new approach for tackling multi-span questions in reading comprehension datasets. This approach is based on individually tagging each token with a categorical tag, relying on the tokens' contextual representation to bridge the information gap resulting from the tokens being tagged individually.

First, we show that integrating this new approach into an existing model, NABERT+, does not hinder performance on other questions types, while substantially improving the results on multi-span questions. Later, we compare our results to the current state-of-the-art on multi-span questions.

We show that our model has a clear advantage in handling multi-span questions, with a 29.7 absolute improvement in EM, and a 15.1 absolute improvement in F1. Furthermore, we show that our model slightly eclipses the current stateof-the-art results on the entire DROP dataeset. Finally, we present some ablation studies, analyzing the benefit gained from individual components of our model.

We believe that combining our tag-based approach for handling multi-span questions with current successful tech-niques for handling single-span questions could prove beneficial in finding better, more holistic ways, of tackling span questions in general.

## 8. Future Work

A Different Loss for Multi-span Questions Currently, For each individual span, we optimize the average likelihood over all its possible tag sequences (see Section 3.3.1). A different approach could be not taking each possible tag sequence into account but only the most likely one. This could provide the model more flexibility during training and the ability to focus on the more "correct" tag sequences.

Explore Utilization of Non-First Wordpiece Sub-Tokens As mentioned in Section 3.2, we only considered the representation of the first wordpiece sub-token in our model. It would be interesting to see how different approaches to utilize the other sub-tokens' representations in the tagging task affect the results.

This "negation" head seems to at least partially parallel the usage of 100 as a "standard number" in NABERT+'s arithmetic head, although it was not mentioned in the MTMSN paper.

However, we did end up improving the arithmetic performance, briefly mentioned in Section 4.1.2

As in NAQANET, we only search for the addition/subtraction of all two-number combinations.4 For a small portion of the multi-span questions we didn't use all the executions that evaluate to the correct answer. See Section 3.3.3

https://github.com/huggingface/pytorch-transformers6 An example for such a passage is history 1302, taken from https://en.wikipedia.org/wiki/Russo-Crimean Wars.7 For further explanation of this process, refer to sections 3.3 and 4.2.1 from(Kinley & Lin, 2019).

https://github.com/raylin1000/drop-bert 9 https://github.com/allenai/allennlp/blob/master/allennlp/ models/reading comprehension/naqanet.py 10 https://github.com/eladsegal/tag-based-multi-span-extraction 11 https://allennlp.org