What's in your Head? Emergent Behaviour in Multi-Task Transformer Models

Mor Geva
Uri Katz
Aviv Ben-Arie
Jonathan Berant
ArXiv
2021
View in Semantic Scholar

Abstract

The primary paradigm for multi-task training in natural language processing is to represent the input with a shared pre-trained language model, and add a small, thin network (head) per task. Given an input, a target head is the head that is selected for outputting the final prediction. In this work, we examine the behaviour of non-target heads, that is, the output of heads when given input that belongs to a different task than the one they were trained for. We find that non-target heads exhibit emergent behaviour, which may either explain the target task, or generalize beyond their original task. For example, in a numerical reasoning task, a span extraction head extracts from the input the arguments to a computation that results in a number generated by a target generative head. In addition, a summarization head that is trained with a target question answering head, outputs query-based summaries when given a question and a context from which the answer is to be extracted. This emergent behaviour suggests that multi-task training leads to nontrivial extrapolation of skills, which can be harnessed for interpretability and generalization.

1 Introduction

The typical framework for training a model in natural language processing to perform multiple tasks is to have a shared pre-trained language model (LM), and add a small, compact neural network, often termed head, on top of the LM, for each task (Clark et al., 2019; Liu et al., 2019b; Nishida et al., 2019; Hu and Singh, 2021) . The heads are trained in a supervised manner, each on labelled data collected for the task it performs (Devlin et al., 2019) . At inference time, the output is read out of a selected target head, while the outputs from the other heads are discarded (Figure 1) .

Figure 1: An illustration of multi-task training with a pre-trained LM. Given an input for one of the tasks, a shared representation is computed with a pre-trained LM (green). The target head outputs the prediction, while the other heads are ignored. In this work, we characterize the behaviour of the non-target head.

What is the nature of predictions made by nontarget heads given inputs directed to the target : An illustration of multi-task training with a pre-trained LM. Given an input for one of the tasks, a shared representation is computed with a pre-trained LM (green). The target head outputs the prediction, while the other heads are ignored. In this work, we characterize the behaviour of the non-target head.

head? One extreme possibility is that the pretrained LM identifies the underlying task, and constructs unrelated representations for each task. In this case, the output of the non-target head might be arbitrary, as the non-target head observes inputs considerably different from those it was trained on. Conversely, the pre-trained LM might create similar representations for all tasks, which can lead to meaningful interactions between the heads.

In this work, we test whether such interactions occur in multi-task transformer models, and if nontarget heads decode useful information given inputs directed to the target head. We show that multihead training leads to a steering effect, where the target head guides the behaviour of the non-target head, steering it to exhibit emergent behaviour, which can explain the target head's predictions, or generalize beyond the task the non-target head was trained for.

We study three multi-head models (see Figure 2) , and describe the "steering effect" in each of them. In a numerical reading comprehension task ( (October 10, 1924 -December 10, 1978 was an American" "Wood Jr. (October 10, 1924 -December 10, 1978 was an American" Figure 2 : An overview of the three models analyzed in this work. For each model, the target head, which outputs the model's prediction, is shown on the left (in yellow). The non-target head, shown on the right (in blue) exhibits new behaviour without being trained for this objective.

Figure 2: An overview of the three models analyzed in this work. For each model, the target head, which outputs the model’s prediction, is shown on the left (in yellow). The non-target head, shown on the right (in blue) exhibits new behaviour without being trained for this objective.

et al., 2019), the model is given a question and paragraph and either uses an extractive head to output an input span, or a generative head to generate a number using arithmetic operations over numbers in the input (Figure 2, left) . Treating the extractive head as the non-target head, we observe that it tends to output the arguments to the arithmetic operation performed by the decoder, and that successful argument extraction is correlated with higher performance. Moreover, we perform interventions (Woodward, 2005; Elazar et al., 2021) , where we modify the input representation based on the arguments output by the extractive head, and show this leads to predictable changes in the behaviour of the generative head. Thus, we can use the output of the non-target head to improve interpretability.

We observe a similar phenomenon in multi-hop question answering (QA) model (Yang et al., 2018) , where a non-target span extraction head outputs supporting evidence for the answer predicted by a classification head (Figure 2, center) . This emerging interpretability is considerably different from methods that explicitly train models to output explanations (Perez et al., 2019; Schuff et al., 2020; .

Beyond interpretability, we observe non-trivial extrapolation of skills when performing multi-task training on extractive summarization (Hermann et al., 2015) and right) . Specifically, a head trained for extractive summarization outputs supporting evidence for the answer when given a question and a paragraph, showing that multi-task training steers its behaviour towards query-based summarization. We show this does not happen in lieu of multi-task training.

To summarize, we investigate the behaviour of non-target heads in three multi-task transformer models, and find that without any dedicated training, non-target heads provide explanations for the predictions of target heads, and exhibit capabilities beyond the ones they were trained for.

2 Multi-Head Transformer Models

The prevailing method for training models to perform NLP tasks is to add parameter-thin heads on top of a pre-trained LM, and fine-tune the entire network on labeled examples (Devlin et al., 2019) .

Given a text input with n tokens x = x 1 , . . . , x n , the model first computes contextualized representations H = h 1 , . . . , h n = LM θ (x) using the pre-trained LM parameterized by θ. These representations are then fed into the output heads, with each head o estimating the probability p ψo (y | H) of the true output y given the encoded input H and the parameters ψ o of o. The head that produces the final model output, termed the target head, is chosen either deterministically, based on the input task, or predicted by an output head classifier p(o | x). Predictions made by nontarget heads are typically ignored. When p(o | x) is deterministic it can be viewed as an indicator function for the target head.

Training multi-head transformer models is done by marginalizing over the set of output heads O, and maximizing the probability

p(y | x) = o∈O p(o | x) • p(y | H, ψ o ),

where p(y | H, ψ o ) > 0 only if y is in the output space of the head o. For a head o, we denote by S o the set of examples (x, y) such that y is in the output space of o, and byS o the other training examples. The sets S o andS o may consist of examples from different tasks (e.g., question answering and text summarization), or of examples from the same task but with different output formats (e.g., yes/no vs. span extraction questions). Our goal is to evaluate the predictions of o on examples fromS o , for which another head o is the target head, and the relation between these outputs and the predictions of o .

In the next sections, we will show that the predictions of o interact with the predictions of o . We will denote by o t the target head, and by o s the steered head.

3 Overview: Experiments & Findings

This section provides an overview of our experiments and main findings, which are discussed in detail in §4, §5, and §6.

Given a model with a target head o t and a steered head o s , our goal is to understand the behaviour of o s on inputs where o t provides the prediction. To this end, we focus on head combinations, where o s is expressive enough to explain the outputs of o t , but unlike most prior work aiming to explain by examining model outputs (Perez et al., 2019; Schuff et al., 2020; , o s is not explicitly trained for this purpose. Concretely, our analysis covers three settings, illustrated in Figure 2 and summarized in Table 1 .

Table 1: A summary of the main findings in each of the settings investigated in this work.

The first setting (Figure 2 left, and §4) considers a model with generative and extractive heads, trained on the DROP dataset for numerical reasoning over text (Dua et al., 2019) . Surprisingly, we observe that the arguments for the arithmetic computation required for the generative head to generate its answer often emerge in the outputs of the extractive head. The second setting (Figure 2 middle, and §5) considers a model with a classification head outputting 'yes'/'no' answers, and a span extraction head, trained on the HOTPOTQA dataset (Yang et al., 2018) for multi-hop reasoning. The predictions of the extractive head once again provide explanations in the form of supporting facts from the input context. The last setting (Figure 2 right, and §6) considers a model with two extractive heads, one for span extraction and another for (sentence-level) extractive summarization. Each head is trained on a different dataset, HOTPOTQA for span extraction and CNN/DAILYMAIL (See et al., 2017) for summarization. We find that the summarization head tends to extract the supporting facts given inputs from HOTPOTQA, effectively acting as a query-based summarization model.

We now present the above settings. Table 1 summarizes the main results. We denote by FFNN (l) m×n a feed-forward neural network with l layers that maps inputs of dimension m to dimension n.

4 Setting 1: Emerging Computation Arguments In Span Extraction

We start by examining a combination of generative and extractive heads (Figure 2 , left), and analyze the spans extracted from the input when the generative head is selected to output the final answer.

4.1 Experimental Setting

Model We take GENBERT (Geva et al., 2020) , a BERT-base model fine-tuned for numerical reasoning, and use it to initialize a variant called MSEGENBERT, in which the single-span extraction head is replaced by a multi-span extraction (MSE) head introduced by Segal et al. (2020), which allows extracting multiple spans from the input. This is important for supporting extraction of more than one argument. MSEGENBERT has three output heads: The multi-span head, which takes H ∈ R d×n , and uses the BIO scheme (Ramshaw and Marcus, 1995) to classify each to- ken in the input as the beginning of (B), inside of (I), or outside of (O) an answer span:

o mse := FFNN (1) d×3 (H) ∈ R 3×n .

The second head is the generative head, o gen , a standard transformer decoder (Vaswani et al., 2017) initialized by BERT-base, that is tied to the encoder and performs cross-attention over H (Geva et al., 2020) . Last, a classification head takes the representation h CLS of the CLS token and selects the target head (o mse or o gen ):

o type := FFNN (1) d×2 (h CLS ) ∈ R 2 .

Implementation details are in Appendix A.1.

Data

We fine-tune MSEGENBERT on DROP (Dua et al., 2019) , a dataset for numerical reasoning over paragraphs, consisting of passage-questionanswer triplets where answers are either spans from the input or numbers that are not in the input. Importantly, o mse is trained only on span questions, as its outputs are restricted to tokens from the input. Moreover, less than 5% of DROP examples have multiple spans as an answer.

Table 2: Evaluation results of MSEGENBERT. Except for the last two columns, all scores are on the annotated sample from DROP of 400 examples. DROP F1 and DROPspan F1 were computed over the entire development set and the development examples with span answers, respectively. Correlation values are between recall and F1 scores. The parameter l refers to the number of linear layers in omse.

To evaluate the outputs of o mse on questions where the answer is a number that is not in the input, we use crowdsourcing to annotate 400 such examples from the development set. Each example was annotated with the arguments to the computation that are in the passage and are required to compute the answer. An example annotation is provided in Table 3 . On average, there are 1.95 arguments annotated per question.

Table 3. Not extracted; please refer to original document.

Evaluation metrics Given a list P of extracted spans by o mse and a list of annotated arguments G, we define the following metrics for evaluation: 1 We check argument recall by computing the fraction of arguments in G that are also in P: |P ∩ G| |G| . We can then compute average recall over the dataset, and the proportion of questions with a perfect recall of 1.0 (first column in Table 2 ). Similarly, we compute precision by computing the fraction of arguments in P that are also in G: |P ∩ G| |P| and then the average precision over the dataset. Table 2 presents the results. Comparing MSEGEN-BERT to MSEBERT, where the model was trained without o gen only on span extraction examples, we observe that multi-task training substantially changes the behaviour of the extractive head. First, MSEGENBERT dramatically improves the extraction of computation arguments: recall increases from 0.2→0.56, precision goes up from 0.32→0.6, and the fraction of questions with perfect recall reaches 0.41. The number of extracted spans also goes up to 2.1, despite the fact that most span extraction questions are a single span. The performance of MSEBERT on span questions is similar to MSEGENBERT, showing that the difference is not explained by performance degradation. We also measure the number of extracted spans on out-ofdistribution examples of math word problems, and observe similar patterns (for brevity, the details are in Appendix B).

We fine-tune a READER model on the gold paragraphs of HOTPOTQA (Yang et al., 2018) , a dataset for multi-hop question answering. Specifically, we feed the model question-context pairs and let it predict the answer type with o type . If o type predicts span then the output by o sse is taken as the final prediction, otherwise it is the output by o type . Therefore, o sse is trained only on examples with an answer span.

Examples in HOTPOTQA are annotated with supporting facts, which are sentences from the context that provide evidence for the final answer. We use the supporting facts to evaluate the outputs of o sse as explanations for questions where the gold answer is yes or no.

Table 5: Intervention results on ogen outputs. Percentage out of 6,051 DROP’s development examples for which swapping B (or O) tokens changed (or did not change) the output of MSEGENBERT.

Evaluation metrics Let F be the set of annotated supporting facts per question and P be the top-k output spans of o sse , ordered by decreasing probability. We define Recall@k to be the proportion of supporting facts covered by the top-k predicted spans, where a fact is considered covered if a predicted span is within the supporting fact sentence and is not a single stop word (see Table 6 ). 3 We use k = 5 and report the fraction of questions where Recall@5 is 1 (Table 7, first column), to measure the cases where o sse covers all relevant sentences in the first few predicted spans.

Table 6: Example questions from HOTPOTQA and the extracted spans by the READER model.

Table 7: Evaluation results of READER. F1 scores were computed over the development set of HOTPOTQA, and the rest of the scores on the development subset of yes/no questions, using k = 5. The parameter l refers to the number of linear layers in each of osse and otype.

Additionally, we introduce an InverseMRR metric, based on the MRR measure, as a proxy for precision. We take the rank r of the first predicted span in P that is not a supporting fact from F, and use 1 − 1 r as the measure (e.g., if the rank of the first non overlapping span is 3, the reciprocal is 1/3 and the InverseMRR is 2/3). If the first predicted span is not in a supporting fact, InverseMRR is 0; if all spans for k = 5 overlap, InverseMRR is 1. Table 7 : Evaluation results of READER. F 1 scores were computed over the development set of HOTPOTQA, and the rest of the scores on the development subset of yes/no questions, using k = 5. The parameter l refers to the number of linear layers in each of o sse and o type .

4.2 Results

Moreover, model performance, which depends on o gen , is correlated with the recall of extracted spans, performed by o mse . The Spearman correlation between model F 1 and recall for MSEGEN-BERT is high at 0.351 (Table 2 ) and statistically significant (p-value= 5.6e −13 ), showing that when the computation arguments are covered, performance is higher.

These findings illustrate that multi-task training leads to emergent behaviour in the extractive head, which outputs computation arguments for the output of the generative head. We now provide more fine-grained analysis. Figure 3 shows the number of extracted spans by MSEGENBERT compared to the annotated spans. On average, MSEGENBERT extracts 2.12 spans per example, which is similar to 1.95 spans extracted by annotators. However, MSEGENBERT tends to overpredict one span compared to the annotated examples. The average ratio |P| |G| is 1.2, indicating good correlation at the single-example level. Table 4 shows example outputs of o mse vs. the annotated arguments for the same questions.

Figure 3: The number of spans extracted by MSEGENBERT omse vs. the number of annotated arguments for the same questions.

Table 4: Example outputs by omse on DROP in comparison to the annotated computation arguments.

Distribution Of Extracted Spans

Parameter sharing across heads We conjecture that the steering effect occurs when the heads are strongly tied, with most of their parameters shared. To examine this, we increase the capacity of the FFNN in o mse from l = 1 layer to l = 2, 4 layers, and also experiment with a decoder whose parameters, unlike GENBERT, are not tied to the encoder.

We find (Table 2 ) that reducing the dependence between the heads also diminishes the steering effect. While the models still tend to extract computation arguments, with much higher recall and 1923, 1922 1923, 1937, 1922 precision compared to MSEBERT, they output 1.2 spans on average, which is similar to the distribution they were trained on. This leads to higher precision, but much lower recall and fewer cases of prefect recall. Overall model performance is not affected by changing the capacity of the heads.

4.3 Influence Of Extracted Spans On Generation

The outputs of o mse and o gen are correlated, but can we somehow control the output of o gen by modifying the value of span tokens extracted by o mse ? To perform such intervention, we change the crossattention mechanism in MSEGENBERT's decoder. Typically, the keys and values are both the encoder representations H. To modify the values read by the decoder, we change the value matrix to H i↔j :

MultiHeadAttention(Q, H, H i↔j ),

where in H i↔j the positions of the representations h i and h j are swapped (illustrated in Figure 4) . Thus, when the decoder attends to the i'th token, it will get the value of the j'th token and vice versa. We choose the tokens i, j to swap based on the output of o mse . Specifically, for every input token x k that is a digit, 2 we compute the probability p B k by o mse that it is a beginning of an output span. Then, we choose the position i = arg max k p B k , and the position j as a random position of a digit token. As a baseline, we employ the same procedure, but swap the positions i, j of the two digit tokens with the highest outside (O) probabilities.

Figure 4: Illustration of our intervention method, where two value vectors vi,vj are swapped in cross-attention.

We focus on questions where MSEGENBERT predicted a numeric output. many cases each intervention instance changed the model prediction (by o gen ). For 40.6% of the questions, the model prediction was altered due to the intervention in the highest probability B token, compared to only 0.03% (2 examples) by the baseline intervention. This shows that selecting the token based on o mse affects whether this token will lead to a change.

More interestingly, we test whether we can predict the change in the output of o gen by looking at the two digits that were swapped. Let d and d be the values of digits swapped, and let n and n be the numeric outputs generated by o gen before and after the swap. We check whether |n−n | = |d−d | * 10 c for some integer c. For example, if we swap the digits '7' and '9', we expect the output to change by 2, 20, 0.2, etc. We find that in 543 cases out of 2,460 (22.1%) the change in the model output is indeed predictable in the non-baseline intervention, which is much higher than random guessing, which would yield 10%. Last, measuring model accuracy on the original examples, we observe that exact-match performance is 76% when the change is predictable, but only 69% when it is not. This suggests that interventions lead to predictable changes with higher probability when the model is correct.

Overall, our findings show that the spans extracted by o mse affect the output of o gen , while spans o mse marks as irrelevant do not affect the output. Moreover, the (relative) predictability of the output after swapping shows that the model performs the same computation, but with a different argument.

5 Setting 2: Emerging Supporting Facts In Span Extraction

We now consider the case of combining an extractive head and a classification head (Fig. 2, middle) .

5.1 Experimental Setting

Model We use the BERT-base READER model introduced by Asai et al. (2020), which has two output heads: A single-span extraction head, which predicts for each token the probabilities for being the start and end position of the answer span:

o sse := FFNN (1)

d×2 (H). The second head is a classification head that predicts the answer type: yes, no, span, or no-answer:

o type := FFNN (1) d×4 (h CLS )

. Implementation details are in Appendix A.2.

5.2 Results

Results are presented in Table 7 . Comparing READER and READER only sse , the Recall@5 and InverseMRR scores are substantially higher when using multi-task training, with an increase of 10.9% and 4.5%, respectively, showing again that multitask training is the key factor for emerging explanations. Example questions with the spans extracted by READER are provided in Table 6 . As in §4, adding an additional layer to o sse (READER l=2 ) decreases the frequency of questions with perfect Recall@5 (0.605 → 0.568). This shows again that reducing the dependency between the heads also reduces the steering effect. It is notable that the performance on HOTPOTQA is similar across the different models, with only a slight deterioration when training only the extraction head (o sse ). This is expected as READER only sse is not trained with yes/no questions, which make up a small fraction of HOTPOTQA.

6 Setting 3: Emerging Query-Based Summaries

In §4 and §5, we considered models with output heads trained on examples from the same data distribution. Would the steering effect occur when output heads are trained on different datasets? We now consider a model trained to summarize text and answer multi-hop questions (Figure 2, right) .

6.1 Experimental Setting

Model We create a model called READERSUM as follows:

We take the READER model from §5, and add the classification head presented by Liu and Lapata (2019) , that summarizes an input text by selecting sentences from it. Sentence selection is done by inserting a special [CLS] token before each sentence and training a summarization head to predict a score for each such [CLS] token from the representation h CLS :

o sum := FFNN (1) d×1 (h CLS )

. The sentences are ranked by their scores and the top-3 highest score sentences are taken as the summary (we use top-3, since choosing the first 3 sentences of a document is a standard baseline in extractive summarization (Nallapati et al., 2017; Liu and Lapata, 2019) ). Implementation details are in Appendix A.3.

Data The QA heads (o sse and o type ) are trained on HOTPOTQA, while the summarization head is trained on the CNN/DAILYMAIL dataset for extractive summarization (Hermann et al., 2015) We use the supporting facts from HOTPOTQA to evaluate the outputs of o sum as explanations for predictions of the QA heads.

Evaluation metrics Annotated supporting facts and the summary are defined by sentences from the input context. Therefore, given a set T of sentences extracted by o sum (|T | = 3) and the set of supporting facts F, we compute the Recall@3 of T against F.

6.2 Results

Results are summarized in Table 8 . When given HOTPOTQA examples, READERSUM extracts summaries that cover a large fraction of the supporting facts (0.79 Recall@3). This is much higher compared to a model that is trained only on the extractive summarization task (READERSUM only sum with 0.69 Recall@3). Results on CNN/DAILYMAIL show that this behaviour in READERSUM does not stem from an overall improvement in extractive summarization, as READERSUM performance is slightly lower compared to READERSUM only sum .

Table 8: Evaluation results of READERSUM. Recall@3 and F1 scores were computed over the development set of HOTPOTQA, and ROUGE over CNN/DAILYMAIL.

To further validate this against other baselines, both READERSUM and READERSUM only sum achieved substantially better Recall@3 scores compared to a baseline that extracts three random sentences from the context (RANDOM with 0.53 Recall@3), and summaries generated by taking the first three sentences from the context (LEAD3 with 0.6 Recall@3).

Overall, the results show multi-head training endows o sum with an emergent behavior of querybased summarization, which we evaluate next. Example summaries extracted by READERSUM for HOTPOTQA are provided in Appendix C.

Influence Of Questions On Predicted Summaries

To better understand the influence of question on the behaviour of o sum , we run READERSUM on examples from HOTPOTQA while zeroing/masking out the questions. In other words, o sum observes only the context sentences. As shown in Table 8 (READERSUM masked ), masking the question leads to a substantial decrease of 13 Recall@3 points in comparison to the same model without masking (0.79→0.66).

Because our model appends a [CLS] token to every sentence, including the question (which never appears in the summary), we can rank the question sentence based on the score of o sum . Computing the rank distribution of question sentences, we see that the distributions of READERSUM only sum and READERSUM are ( Figure 5 ) significantly different, 4 and that the questions are ranked higher in 4 Wilcoxon signed-rank test with p-value 0.001. READERSUM. This shows that the summarization head puts higher emphasis on the question in the multi-head setup. Overall, these results provide evidence that multi-head training pushes o sum to perform querybased summarization on inputs from HOTPOTQA.

Figure 5: Distribution over ranks for HOTPOTQA question sentences, based on scores predicted by osum, for READERSUM and READERSUMonly sum.

7 Related Work

Transformer models with multiple output heads have been widely employed in previous works (Hu and Singh, 2021; Segal et al., 2020; Hu et al., 2019; Clark et al., 2019) . To the best of our knowledge, this is the first work that analyzes the outputs of the non-target heads.

Previous work used additional output heads to generate explanations for model predictions (Perez et al., 2019; Schuff et al., 2020; . Specifically, recent work has explored utilization of summarization modules for explainable QA (Nishida et al., 2019; Deng et al., 2020) . In the context of summarization, Xu and Lapata (2020) have leveraged QA resources for training query-based summarization models. Hierarchies between NLP tasks have also been explored in multi-task models not based on transformers (Søgaard and Goldberg, 2016; Hashimoto et al., 2017; Swayamdipta et al., 2018) . In contrast to previous work, the models considered in this work were not trained to perform the desired behaviour. Instead, explanations and generalized behaviour emerged from training the model on multiple tasks. a related line of research has focused on developing probes, which are supervised networks modules that predict properties from model representations (Conneau et al., 2018; van Aken et al., 2019; Tenney et al., 2019; Liu et al., 2019a) . A key challenge with probes is determining whether the information exists in the representation or is learned during probing (Hewitt and Liang, 2019; Tamkin et al., 2020; Talmor et al., 2020) . Unlike probes, steered heads are trained in parallel to target heads rather than on a fixed model fine-tuned on a target task. Moreover, steered heads are not designed to decode specific properties from representations, but their behaviour naturally extends beyond their training objective.

Our findings also relate to explainability methods that highlight parts from the input via the model's attention (Wiegreffe and Pinter, 2019) , and extract rationales through unsupervised training (Lei et al., 2016) . The emerging explanations we observe are based on the predictions of a head trained to perform some task rather than on internal representations.

8 Conclusion

In this work, we show that training multiple heads on top of a pre-trained language model creates a steering effect, where the target head influences the behaviour of another head, steering it towards capabilities beyond its training objective. In three multi-task settings, we find that without any dedicated training, the steered head often outputs explanations for the model predictions. Moreover, modifying the model representation based on the outputs of the steered head can lead to predictable changes in the target head predictions. Our findings provide evidence for extrapolation of skills as a consequence of multi-task training, opening the door to new research directions in interpretability and generalization.

B Distribution Of Extracted Spans On Math-Word-Problems

To further test the emergent behaviour of MSEGEN-BERT ( §4), we compare the number of extracted spans on an out-of-distribution sample, by MSEGENBERT and MSEGENBERT only mse that was trained without the decoder head (o gen ). Specifically, we run the models on MAWPS (Koncel-Kedziorski et al., 2016) , a collection of small-size math word problem datasets. The results, shown in Figure 6 , demonstrate the generalized behaviour of o mse , which learns to extract multiple spans when trained jointly with the decoder.

Figure 6: The portion of examples per number of extracted spans by omse, for MSEGENBERT that was trained on with and without ogen.

Question: Who is Bruce Spizer an expert on, known as the most influential act of the rock era? ("The Beatles") Context: The Beatles were an English rock band formed in Liverpool in 1960. With members John Lennon, Paul McCartney, George Harrison and Ringo Starr, they became widely regarded as the foremost and most influential act of the rock era. Rooted in skiffle, beat and 1950s rock and roll, the Beatles later experimented with several musical styles, ranging from pop ballads and Indian music to psychedelia and hard rock, often incorporating classical elements and unconventional recording techniques in innovative ways. In 1963 their enormous popularity first emerged as "Beatlemania", and as the group's music grew in sophistication in subsequent years, led by primary songwriters Lennon and McCartney, they came to be perceived as an embodiment of the ideals shared by the counterculture of the 1960s. David "Bruce" Spizer (born July 2, 1955) is a tax attorney in New Orleans, Louisiana, who is also recognized as an expert on The Beatles. He has published eight books, and is frequently quoted as an authority on the history of the band and its recordings.

C Example Emergent Query-Based Summaries

Examples are provided in Tables 9, 10 , 11, and 12.

Table 9. Not extracted; please refer to original document.

Question: Which Eminem album included vocals from a singer who had an album titled "Unapologetic"? ("The Marshall Mathers LP 2") Context: "Numb" is a song by Barbadian singer Rihanna from her seventh studio album "Unapologetic" (2012). It features guest vocals by American rapper Eminem, making it the pair's third collaboration since the two official versions of "Love the Way You Lie". Following the album's release, "Numb" charted on multiple charts worldwide including in Canada, the United Kingdom and the United States. "The Monster" is a song by American rapper Eminem, featuring guest vocals from Barbadian singer Rihanna, taken from Eminem's album "The Marshall Mathers LP 2" (2013). The song was written by Eminem, Jon Bellion, and Bebe Rexha, with production handled by Frequency. "The Monster" marks the fourth collaboration between Eminem and Rihanna, following "Love the Way You Lie", its sequel "Love the Way You Lie (Part II)" (2010), and "Numb" (2012). "The Monster" was released on October 29, 2013, as the fourth single from the album. The song's lyrics present Rihanna coming to grips with her inner demons, while Eminem ponders the negative effects of his fame. Table 10 : Example (2) input from HOTPOTQA and the predicted summary by READERSUM. The summary is marked in bold.

Table 10. Not extracted; please refer to original document.

Question:Are both Dictyosperma, and Huernia described as a genus? ("yes") Context:The genus Huernia (family Apocynaceae, subfamily Asclepiadoideae) consists of stem succulents from Eastern and Southern Africa, first described as a genus in 1810. The flowers are five-lobed, usually somewhat more funnel-or bell-shaped than in the closely related genus "Stapelia", and often striped vividly in contrasting colours or tones, some glossy, others matt and wrinkled depending on the species concerned. To pollinate, the flowers attract flies by emitting a scent similar to that of carrion. The genus is considered close to the genera "Stapelia" and "Hoodia". The name is in honour of Justin Heurnius (1587-1652) a Dutch missionary who is reputed to have been the first collector of South African Cape plants. His name was actually mis-spelt by the collector.Dictyosperma is a monotypic genus of flowering plant in the palm family found in the Mascarene Islands in the Indian Ocean (Mauritius, Reunion and Rodrigues). The sole species, Dictyosperma album, is widely cultivated in the tropics but has been farmed to near extinction in its native habitat. It is commonly called princess palm or hurricane palm, the latter owing to its ability to withstand strong winds by easily shedding leaves. It is closely related to, and resembles, palms in the "Archontophoenix" genus. The genus is named from two Greek words meaning "net" and "seed" and the epithet is Latin for "white", the common color of the crownshaft at the top of the trunk.

For arguments and spans which include a number as well as text, only the numeric sub-strings were considered when preforming the comparison between P and G.

GENBERT uses digit tokenization for numbers.

We do not define coverage as the fraction of tokens in a supporting fact that the span covers, because supporting facts are at the sentence-level, and often times most of the tokens in the supporting fact are irrelevant for the answer.

https://github.com/ag1988/injecting_ numeracy/ 6 https://github.com/eladsegal/ tag-based-multi-span-extraction 7 https://github.com/AkariAsai/ learning_to_retrieve_reasoning_paths 8 https://github.com/nlpyang/BertSum model trained with a learning rate of 5e −5 and batch size 8 for 1 epoch.

Table 11. Not extracted; please refer to original document.

Table 12: Example (4) input from HOTPOTQA and the predicted summary by READERSUM. The summary is marked in bold.