UnifiedQA: Crossing Format Boundaries With a Single QA System

Daniel Khashabi
Sewon Min
Tushar Khot
Ashish Sabharwal
Oyvind Tafjord
P. Clark
Hannaneh Hajishirzi
FINDINGS
2020
View in Semantic Scholar

Abstract

Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 19 QA datasets spanning 4 diverse formats. UNIFIEDQA performs on par with 8 different models that were trained on individual datasets themselves. Even when faced with 12 unseen datasets of observed formats, UNIFIEDQA performs surprisingly well, showing strong generalization from its outof-format training data. Finally, simply finetuning this pre trained QA model into specialized models results in a new state of the art on 10 factoid and commonsense question answering datasets, establishing UNIFIEDQA as a strong starting point for building QA systems.

1 Introduction

Question answering is a common tool for assessing how well can computers understand language and reason with it. To this end, the NLP community has introduced several distinct QA formats, with four popular formats illustrated in Figure 1 . These formats differ not only in how the question is presented but also in some implicit assumptions. For instance, the assumption that the expected answer is either "yes" or "no", or that there is always a unique answer span in the associated paragraph (as opposed to multiple or no spans), etc. These differences have motivated their study in silos, often encoding QA format and assumptions into the Figure 1 : Four formats (color-coded throughout the paper) commonly used for posing questions and answering them: Extractive (EX) , Abstractive (AB) , Multiple-Choice (MC) , and Yes/No (YN) . Sample dataset names are shown in square brackets. We study generalization and transfer across these formats. model architecture itself. Efforts to exploit multiple datasets remain largely restricted to a single format. For example, Clark et al. (2019c) limit consideration to multiple-choice datasets, while focus their generalization study on extractive span prediction models. To the best of our knowledge, no single QA system targets, not to mention excels at, all of these formats.

Figure 1: Four formats (color-coded throughout the paper) commonly used for posing questions and an-

This raises the question: Can QA models learn linguistic reasoning abilities that generalize across formats? Our intuition is simple: while question format and relevant knowledge may vary across QA datasets, the underlying linguistic understanding and reasoning abilities are largely common. A multiple-choice model may, therefore, benefit from training on an extractive answers dataset. Building upon this intuition, we present a single pre-trained Figure 2 : Properties of various QA datasets included in this study: 4 extractive (EX), 3 abstractive (AB), 7 multiplechoice (MC), and 3 yes/no (YN). 'idk' denotes 'I don't know' or unanswerable questions. Regents denotes both 4th and 8th grade datasets. BoolQ represents both the original dataset and its contrast-sets extension BoolQ-CS; similarly for ROPES, Quoref, and DROP.

Figure 2: Properties of various QA datasets included in this study: 4 extractive (EX), 3 abstractive (AB), 7 multiplechoice (MC), and 3 yes/no (YN). ‘idk’ denotes ‘I don’t know’ or unanswerable questions. Regents denotes both 4th and 8th grade datasets. BoolQ represents both the original dataset and its contrast-sets extension BoolQ-CS; similarly for ROPES, Quoref, and DROP.

QA system, named UnifiedQA, that exploits information across 4 different QA formats to achieve surprisingly strong performance across 17 different datasets listed in Figure 2 .

Our work is enabled by recent progress in textto-text neural architectures (Raffel et al., 2019; Radford et al., 2019) . This paradigm conceptually unifies many NLP models that formerly had task-specific designs. While there have been hopes of-and a few attempts atusing this paradigm to achieve strong generalization and transfer across tasks, success so far has been limited. Most approaches fine-tuned a different set of parameters for each end task (Raffel et al., 2019; Radford et al., 2019) , and when they have attempted to make a single model for different NLP tasks (Raffel et al., 2019) , they have underperformed compared to the standard pretraining plus fine-tuning setup, with a need for explicit task-specific prefixes.

In contrast, by narrowing the scope to tasks that stay within the boundaries of QA, we are able to demonstrate that the text-to-text paradigm can, in fact, be quite powerful for multi-task learning across QA formats. We find that out-of-format training can lead to a single pre-trained QA model that can be applied as-is to different QA tasks, takes in natural text inputs without explicitly specifying a task-specific prefix, generalizes better to other unseen datasets, and with further fine-tuning can achieve state-of-the-art results on many QA tasks.

Contributions. This work advocates for a unified view of different QA formats, and for building format-agnostic QA systems. To support this view, we present UnifiedQA, a single pre-trained QA system that works well on and generalizes to datasets with different formats ( §6.2), while performing on par with state-of-the-art dedicated systems tailored to each dataset ( §6.1). Additionally, fine-tuning UnifiedQA into specialized systems sets a new state of the art for 6 datasets ( §6.3), establishing it as a powerful starting point for QA research.

Our findings demonstrate that crossing QA format boundaries is not only qualitatively desirable but also quantitatively beneficial.

2 Related Work

Several QA efforts have studied generalization across datasets of a single format. For instance, in MultiQA, Talmor and Berant (2019) study generalization and transfer, but only across extractive span selection datasets. Further, while they show strong leave-one-out style results, they find a single system performs substantially worse than one tuned to each dataset. In ORB, Dua et al. (2019a) propose a multi-dataset evaluation benchmark spanning extractive and abstractive formats. However, that study is limited to an evaluation of systems, falling short of addressing how to build such generalizable models. Similarly, the MRQA shared task (Fisch et al., 2019) focuses on span-prediction datasets. Unlike all these efforts, our goal is to investigate transfer and generalization across different QA formats, as well as to build a single system that does this well.

Exploiting commonality across machine learning tasks has a rich history studied under transfer learning (Caruana, 1997; Clark et al., 2019b) . Mc-Cann et al. (2018) and Keskar et al. (2019) study transfer among various NLP tasks by casting them into a single QA format-an elegant transfer learning approach but orthogonal to the goal of this work. As noted earlier, Raffel et al. (2019) investigate the transfer between several diverse NLP tasks (machine translation, summarization, etc). Their key contribution is a text-to-text framework, and a powerful model called T5, that makes it easier to mix multiple tasks by encoding both inputs and outputs as text. They rely on textual prefixes to explicitly define the task corresponding to each input instance. While we build upon their framework, we narrow our focus to variations of QA. This allows us to achieve strong results while avoiding reliance on any format-specific prefixes. Our models learn to infer the format of each input question based on its content (e.g., whether the phrasing of the question demands a yes/no answer). Moreover, we are able to demonstrate generalization across QA tasks, which prior work failed to achieve presumably due to its focus on too broad a set of NLP tasks.

3 UnifiedQA: Multi-format Training Suppose we would like to train a unified QA model that can operate over k question-answering formats

F 1 , F 2 , . . . , F k . For each format F i , suppose we have k i datasets sets D i 1 , D i 2 , . . . , D i k i where D i j = (T i j , E i j )

includes both training and evaluation examples. In some cases, the training set T i j may be empty or we may want to ignore it in order to treat D i j as an 'unseen', evaluation-only dataset and assess a model's generalization to it.

We use the text-to-text paradigm to convert each training question q in format F i into a plain-text input representation enc i (q). This conversion uses a natural encoding process that will be described shortly ( §3.1) for four common QA formats, and is easily extensible to other formats as well. We follow a simple possible approach of creating a mixed training pool consisting of all available training instances:T

= k i=1 k i j=1 enc i (q) | q ∈ T i j

Training batches are drawn from this pooled data, T , by including each q ∈ T i j with a probability proportional 1/|T i j |. Each batch thus, on average, contains the same number of instances from each training set, regardless of its size. As we will see in the experiments section, our multi-format mixing approach works surprisingly well. It clearly highlights the value of training on out-of-format data and confirms our intuition that there are strong ties across QA formats in terms of the underlying reasoning abilities. 2 Our unified question-answering system is based on the T5 framework (Raffel et al., 2019) , a recent text-to-text transformer model. For all experiments, we use token-limits of size 512 and 100 for inputs and outputs sequences, respectively. All models 2 A more sophisticated teaching curriculum (Sachan and Xing, 2016) or approaches such as model distillation and teacher annealing (Clark et al., 2019b) are likely to further improve the performance of the resulting unified model, bolstering the strength of our advocacy for a unified view of all QA formats. We leave their exploration to future work. are trained for 100k steps, on top of the 1M steps pretraining of the T5 model. We first define a unifying encoding of the instances across various formats (in §3.1). We then introduce UnifiedQA (in §3.2) that is a QA system trained on datasets in multiple formats, indicating new state-of-the-art results on 7 datasets and generalization to unseen datasets.

3.1 Text-To-Text Encoding

We convert each of our target datasets into a textin/text-out format (Raffel et al., 2019; Radford et al., 2019) . The question always comes first, followed by some additional information (context paragraph or candidate answers, or both). We use "\n" separators between different parts of the input. This ensures having a humanlike encoding while not making it overly-specific to a certain format.

The unified model explored in this work incorporates the following four common questionanswering formats:

Extractive (EX) questions Q include a context paragraph C (typically a paragraph) and require models to extract the answer as a substring from the context. In some datasets, 'unanswerable' can sometimes be the correct response.

Abstractive (AB) questions Q require models to produce answers that are often not mere substrings of the provided context paragraph C.

Multiple-choice (MC) questions Q come with a set of candidate answers {A i }, of which generally exactly one is correct. In some cases, they also include a context paragraph C.

Yes/No (YN) questions Q expect a 'yes' or 'no' answer as the response, and may include a context paragraph C.

Further details of these formats and specific datasets within them are deferred to Section 4.1. Table 1 provides examples of the natural input and output encoding for each of these formats. Importantly, both input and output representations are raw text. There is no explicit information regarding a question being an MC question or having exactly four candidate answers. The expectation is that a model will learn to infer such notions from the raw text training data, just like humans are able to do.

Table 1: Example text-to-text encoding of instances.

Input

What does a drink from narcissus's spring cause the drinker to do? \n Mercury has awakened Echo, who weeps for Narcissus, and states that a drink from Narcissus's spring causes the drinkers to ``Grow dotingly enamored of themselves.'' ...

What does photosynthesis produce that helps plants grow? \n (A) water (B) oxygen (C) protein (D) sugar

Who was Billy? \n (A) The skinny kid (B) A teacher (C) A little kid (D) The big kid \n Billy was like a king on the school yard. A king without a queen. He was the biggest kid in our grade, so he made all the rules during recess. ...

Was America the first country to have a president? \n (President) The first usage of the word president to denote the highest official in a government was during the Commonwealth of England ... Specifically, MC questions without any context paragraph are encoded as question \n (A) c1 (B) c2 . . . where c1, c1, . . . are the set of candidate answers (see the example from ARC dataset in Table 1 ). If the question includes a context paragraph, it is appended after the candidate answers: question \n (A) c1 (B) c2 . . . \n paragraph, as shown in the example from the MCTest dataset in Table 1 . Questions in the other three formats (EX, AB, and YN) are encoded simply as question \n paragraph.

Mc

Dataset ARC-challenge

Output No

To re-emphasize, unlike prior work (Raffel et al., 2019), we do not specify any task-, dataset-, or format-specific prefixes in the input representation. Whether the answer should be extracted or abstracted, and whether from the provided context paragraph or candidate answers (or the fact that these even are candidate answers) is expected to be inferred by the system.

3.2 Unifiedqa: The Pre-Trained Model

The specific pre-trained QA model we provide and use in all our experiments is trained on representative datasets for each of the 4 formats discussed earlier. We empirically chose the following 9 datasets for training UnifiedQA, based on their effectiveness in our pilot study (details deferred to Section 5) assessing which datasets are most valuable for outof-format training:

• EX: SQuAD 1.1, SQuAD 2.0 • AB: NarrativeQA • MC: RACE, Regents, ARC, OBQA, MCTest • YN: BoolQ

One can obviously use other combinations of formats and datsets to create variants of our Uni-fiedQA model, or extend it as future datasets become available or new formats are introduced.

Unless otherwise noted, we use the largest available T5 model (11B parameters) as the starting point for training UnifiedQA. Similar to pre-trained language models, the resulting pre-trained QA model can be used as a starting point for fine-tuning on other QA datasets.

4.1 Datasets

We selected 17 existing datasets that target various formats, as well as various complex linguistic phenomena. Table 2 shows different properties for our datasets (whether it comes with a paragraph, whether the paragraph explicitly contains the answer, whether there are candidate-answers as part of the input, etc.) Most importantly, they are grouped into several formats/categories described below. Table 2 gives summary statistics of these datasets.

Table 2: Dataset Statistics. CQA, OBQA, and NQA refer to CommonsenseQA, OpenBookQA, and NarrativeQA, respectively. The CI column shows the 95% confidence interval for the evaluation set as a percentage, around a mean score of 50%. Input and output representation lengths are measured in the number of tokens and averaged across the dataset.

Extractive QA (EX). All the datasets in this format require models to extract the answer to a given question as a substring from a context paragraph. SQuAD 1.1 (Rajpurkar et al., 2016) contains questions about Wikipedia paragraphs. A later version of this dataset, SQuAD 2 (Rajpurkar et al., 2018) , includes unanswerable questions which empirically makes the task much harder. For our evaluation, we use the development sets of SQuAD 1.1 and SQuAD 2. NewsQA (Trischler et al., 2017) dataset focuses on paraphrased questions with predicate-argument structure understanding collected from news articles from CNN/DailyMail articles. Quoref (Dasigi et al., 2019) contains questions that require coreference resolution in Wikipedia articles and can even have disjoint spans as answers. ROPES (Lin et al., 2019) centers around situation understanding, where the model must under the causes and effects implicit in the given situation.

Abstractive QA (AB). All the datasets in this format require models to produce answers that are often not mere substrings of the given context paragraph. NarrativeQA (Kociský et al., focuses on understanding various events that happen in a given movie plot, based on summaries of their movie adaptations from various web resources. Many of the answers do not have high overlap with the context. DROP (Dua et al., 2019b) contains questions that involve rudimentary mathematical skills (such as counting, addition, subtraction, maximum, minimum, etc.) and questions query multiple parts of the paragraph. The answer can be either a number or a date that can be inferred from the paragraph, or several spans from the context paragraph.

Multiple-choice QA (MC). All the datasets in this format contain questions that come with candidate answers. MCTest (Richardson et al., 2013) contains questions about simple, fictional stories. RACE (Lai et al., 2017 ) is a challenging set of English comprehension multiple choice exams given in Chinese middle and high schools. Open-BookQA (Mihaylov et al., 2018) , ARC , Regents Science Exams (Clark et al., 2016) , QASC (Khot et al., 2019) are different MC tests focusing on elementary/high school-style science exams. A slightly different dataset within this format is CommonsenseQA which is geared towards activity/concept commonsense questions. Other than MCTest and RACE, the rest of the datasets do not come with accompanying paragraphs. On such datasets, occasionally a retrieval system is used to supplement each question with a relevant retrieved context paragraph. For most of this the work, we keep the questions as is with no additional retrieval (unless otherwise mentioned), except in §6.3 where we use IR to get numbers comparable to earlier work. One other variability among these datasets is their number of candidate answers. While many datasets have four candidates (see Figure 2) , others have more. Later, in §6.2 we will see that our approach generalizes to datasets with different number of candidates, even if it's not seen during training.

Yes/No QA (YN). All the datasets in this format contain questions that could be responded with yes/no answers. One can think of these as multiple-choice questions with 2 candidates; however, they're usually treated differently. Several examples we use are BoolQ (Clark et al., 2019a ) and a version of this dataset with natural-perturbations, BoolQ-NP (Khashabi et al., 2020) , the subset of MultiRC (Khashabi et al., 2018) that have binary(yes/no) answers.

Contrast-sets. Additionally, we use contrastsets (Gardner et al., 2020) corresponding to several of our datasets (indicated by "CS" tag): BoolQ-CS, ROPES-CS, Quoref-CS, DROP-CS. These evaluation sets are expert-generated perturbations that deviate from the patterns common in the original dataset.

4.2 Evaluation Metrics For Textual Output

We evaluate each dataset using the metric most often used for it in prior work. Specifically, for the EX format, we use the F1 score for the extracted span relative to the gold label. For the AB format, we use ROUGE-L metric (Lin et al., 2006; Min et al., 2019; Nishida et al., 2019) . For the MC format, we match the generated the text with the closest answer candidate (by token overlap) and measure how often this is the correct answer. For the YN format, we follow Clark et al. (2019a) to measure if the generated output matches the correct 'yes' or 'no' label. In rare cases where the output is longer than one word (e.g., 'yes it is'), we check if it contains the correct label but not the incorrect one. ) with training also on an out-of-format dataset denoted 'X'. Evaluation is on the anchor dataset as well as unseen datasets of that format. The last row identifies the out-of-format dataset that helped most on each evaluation dataset. All results are based on the "small" size T5 model. Color denotes QA format (see Table 2 ).

5 Pilot Study: Can Out-Of-Format Training Help?

We first answer the question: Is the broad idea of benefiting from out-of-format training even viable? For instance, is our intuition correct that an MC dataset can, in practice, benefit from training on an EX dataset? Before discussing our main experimental results, we briefly report on a pilot study that assesses the following basic question: Given a training set T i 1 (the anchor dataset) of QA format F i , is there an out-of-format training set T j 1 of format F j such that training jointly on T i 1 ∪ T j 1 improves performance relative to training only on T i 1 ? To this end, we evaluate both on the matching evaluation set E i 1 as well as on 'unseen' data E i 2 , E i 3 , . . . of the same format. The results are summarized in Table 3 . The two rows in each individual table correspond to training on T i 1 (the anchor dataset) and on T i 1 ∪ X, where X is an out-of-format dataset corresponding to T j 1 above. The columns represent various evaluation sets of format F i . For each column, 'X = . . .' at the very bottom indicates the out-of-format dataset X that was the most helpful in improving performance on the evaluation set in that column. 3 Consider, for example, the case of the anchor set T i 1 being BoolQ and the evaluation set being NP-BoolQ, both of format YN. Here, including out-of-format training data X = SQuAD2 boosts performance from 51% to as much as 59%. The gain in other cases is often not this extreme. Never-theless, across all anchor and evaluation datasets, we generally observe that there is at least one outof-format training set whose inclusion improves performance.

Table 3: Pilot study showing that out-of-format training can help improve performance. Each table compares training on just the anchor dataset (e.g., BoolQ in the top-left table) with training also on an out-of-format dataset denoted ‘X’. Evaluation is on the anchor dataset as well as unseen datasets of that format. The last row identifies the out-of-format dataset that helped most on each evaluation dataset. All results are based on the “small” size T5 model. Color denotes QA format (see Table 2).

This pilot study thus provides a proof of concept that out-of-format training can indeed help a QA model in nearly every case. Of course, this study only shows the existence of such an out-of-format dataset, rather than provide a single unified model. Nevertheless, it helps identify representative training sets from each format that were most helpful. As hinted to earlier, we used this empirical data to guide which training sets to include when building UnifiedQA in Section 3.2.

6 Experimental Results

We now discuss our main experimental results, evaluating our proposed UnifiedQA system on seen (used for training the system) and unseen datasets.

6.1 Unifiedqa Vs. 9 Dedicated Models

Is UnifiedQA, a single pre-trained multi-format QA system, as good as dedicated systems trained for individual datasets? We emphasize that the answer to this question is not as simple as it may seem, since earlier works have observed that a system addressing multiple tasks often underperforms a focused system (Raffel et al., 2019) . Figure 3 summarizes the results of the relevant experiment. As it can be observed from the figure, UnifiedQA performs almost as good as the best single dataset experts. In some cases Uni-fiedQA performs even better than than the singledataset experts (e.g., on OBQA or NQA.) On av- Table 4 : Generalization to unseen datasets: Multi-format training (UnifiedQA) often outperforms models trained the same way but solely on other in-format datasets (e.g., UnifiedQA [EX] , which is trained on all extractive training sets of UnifiedQA. When averaged across all evaluation datasets (last column), UnifiedQA shows strong generalization performance across all formats. Notably, the "Previous best" models (last row) were trained on the target dataset's training data, but are even then outperformed by UnifiedQA (which has never seen these datasets during training) on the YN tasks. Figure 3 : UnifiedQA is on-par with, and often outperforms, 11 different equally-sized T5-based systems tailored to individual datasets. The figure contains separate models for each of the two subsets of the ARC and Regents datasets.

Figure 3: UnifiedQA is on-par with, and often outperforms, 11 different equally-sized T5-based systems tailored to individual datasets. The figure contains separate models for each of the two subsets of the ARC and Regents datasets.

Table 4: Generalization to unseen datasets: Multi-format training (UnifiedQA) often outperforms models trained the same way but solely on other in-format datasets (e.g., UnifiedQA[EX], which is trained on all extractive training sets of UnifiedQA. When averaged across all evaluation datasets (last column), UnifiedQA shows strong generalization performance across all formats. Notably, the “Previous best” models (last row) were trained on the target dataset’s training data, but are even then outperformed by UnifiedQA (which has never seen these datasets during

erage (last column) UnifiedQA is doing much better dataset/format-specific systems. In conclusion, UnifiedQA offers flexibility across multiple QA formats while compromising almost nothing compared to dataset-specific experts.

6.2 Generalization To Unseen Datasets

The question we want to explore here is whether UnifiedQA generalizes well to other unseen datasets. Table 4 summarizes the results of experiments during which we evaluate various models on datasets that are not used to train them. The first few rows of the table shows T5 models trained for individual datasets, followed by Uni-fiedQA. For completeness, we include the highest previous scores for each dataset; one must be careful when reading these numbers as the best previous numbers follow the fully supervised protocol (for NewsQA (Zhang et al., 2020) , Quoref (Dasigi et al., 2019) , DROP (Dua et al., 2019b) , ROPES (Lin et al., 2019) , QASC (Khot et al., 2019) , CommonsenseQA (Zhu et al., 2020) and x-CS datasets (Gardner et al., 2020) .)

The key observations are: (1) On average (last column), UnifiedQA shows a much stronger generalization across a wide range of datasets. (2) on 5 (out of 12) datasets UnifiedQA shows a better generalization than any single-dataset experts. For example, while the system is trained on multiplechoice questions with 4 candidate answers, it does work pretty well on datasets with more than 4 candidate answers (QASC and CommonsenseQA have has 8 and 5 candidate ansers per question, respectively). (3) single-dataset experts are better at generalization only when the source and target datasets are pretty similar (for instance SQuAD and Quoref).

6.3 State-Of-The-Art Via Simple Fine-Tuning

Fine-tuning of pre-trained language models has become the standard paradigm for building datasetspecific stat-of-art systems . The question we address here is whether there a value in using UnifiedQA as a starting point for fine-tuning, as opposed to a vanilla language model that has not seen other QA datasets before?

To address question, we fine-tune both Uni-fiedQA and T5 on several datasets. Table 5 summarizes the results of the experiments. The two last rows of the table show the performance UnifiedQA and T5, both fine-tuned for the target task. The fine-tuning process involves selection of the best checkpoint on the dev set and evaluation on the test set.

Table 5: Simply fine-tuning UnifiedQA (last row) results in new state-of-the-art performance on 6 datasets. Further, it consistently improves upon fine-tuned T5 (2nd last row) by a margin ranging from 1% for CommonsenseQA (CQA) to as much as 13% for ARC-challenge. ‘(w/ IR)’ denotes relevant information is retrieved and appended as context sentences in the input encoding. Datasets marked with * are used in UnifiedQA’s original training.

The columns indicate the evaluation on the test set corresponding to the data that was used for training. For several multiple-choice datasets that do not come with evidence with paragraphs, we include two variants: use them as is and another variant which uses paragraphs fetched via an Information Table 6 : The results of a leave-one-out ablation. The first row indicates the performance of UnifiedQA on each dataset it was trained on. The rest of the rows exclude one dataset at a time. The rows are sorted based the last column: the dataset with biggest contribution appear first. The red highlights indicate the top 3 performance drops at each column.

Table 6: The results of a leave-one-out ablation. The first row indicates the performance of UnifiedQA on each dataset it was trained on. The rest of the rows exclude one dataset at a time. The rows are sorted based the last

Retrieval (IR) system as additional evidence, indicated with "w/ IR" tags. We use the same IR sentences as used by the baselines on these datasets: Aristo corpus for ARC and OBQA datasets (Clark et al., 2019c) , and 2-step IR for QASC (Khot et al., 2019) . Additionally, we show the best published scores on each dataset: ALBERT (Lan et al., 2019 ) (on RACE), RoBERTa (Clark et al., 2019c ) (on OBQA and ARC), KF+SIR (Banerjee and Baral, 2020) (on OBQA and QASC), FreeLB+RoBERTa (Zhu et al., 2020 ) (on ARC-easy and CommonsenseQA).

As it can be seen, fine-tuning on UnifiedQA consistently dominates fine-tuning on T5, as well as the best previous scores on each dataset. Intuitively, since UnifiedQA has seen different formats should be positioned to achieve higher scores after a bit of fine-tuning, comparing to fine-tuning on a vanilla T5. This could be especially effective when a user has limited training data for a target QA task.

6.4 Ablation: Training Set Contributions

In this experiment we would like to better understand the contribution of each dataset to UnifiedQA through a leave-one-out experiment.

We take the system from §3.2 and evaluate the model when individual models are dropped from the union. The result of this experiment is summarized in Table 6 compares the performance of UnifiedQA all the default datasets (the first row) followed with ablated versions which exclude one dataset at a time. The rows are sorted based the last column: the dataset with biggest contribution appear first. The top-few datasets have the highest contributions: BoolQ, SQuAD 2.0, OBQA, NQA are the top four contributing datasets, each with different format. SQuAD 1.1 has the least importance in the union, since its mostly covered by SQuAD 2.0.

The conclusion here is that to build an effective variant of UnifiedQA, one can just use a relatively small number of datasets, so long as that it is trained on representative members of each format.

7 Conclusion

The question-answering community has fruitfully explored the design of strong models, but while staying within the boundaries of individual QA formats. We argued that such boundaries are artificial and can even limit the performance of systems, because the desired reasoning abilities being taught and probed are not tied to specific formats. Train-ing data in one format should, in principle, help QA systems perform better even on questions in another format.

With this intuition in mind, we presented Uni-fiedQA, a single pre-trained QA system based on the T5 text-to-text language model, seeking to bring unification across four common QA formats. We showed that even with its simple multi-format training methodology, UnifiedQA achieves performance on par with nearly a dozen dataset-specific expert models ( §6.1), while also generalizing well to many unseen datasets (of seen formats) ( §6.2). At the same time, we demonstrated that UnifiedQA is a strong starting point for building QA systems: it can achieve state-of-the-art performance by simply fine-tuning on target datasets (6.3).

We hope this effort will inspire a future line of work in the QA and NLP communities, moving towards more general and broader system designs. We leave extensions of UnifiedQA to other formats such as to direct-answer questions (Kwiatkowski et al., 2019) as a promising avenue for future work.

A Appendices

A.1 UnifiedQA: Different Sizes For completeness we're also showing the scores of UnifiedQA of different sizes on each dataset. For these systems each row is a single system.

Table 7: UnifiedQA of different sizes on our datasets.

A.2 Comparison With The Dedicated Models: Extended Results

Here we summarize an extension of the results in §6.1. Table 8 summarizes the results of the relevant experiment. In the top portion of the table we have evaluations of T5 model fine-tuned for individual datasets, followed by UnifiedQA. As it can be observed from the table, UnifiedQA performs almost as good as the best single dataset experts. In some cases UnifiedQA performs even better than than the single-dataset experts (e.g., on OBQA or NQA.) On average (last column) UnifiedQA is doing much better dataset/format-specific systems. In conclusion, UnifiedQA offers flexibility across multiple QA formats while compromising almost nothing compared to dataset-specific experts. Table 8 : UnifiedQA is on-par with systems tailored to individual datasets (the diagonal cells vs the last row) while functioning across a wide range of datasets (the last column).

Table 8: UnifiedQA is on-par with systems tailored to individual datasets (the diagonal cells vs the last row) while functioning across a wide range of datasets (the last column).

A.3 Pairwise Mixing: Extended Results

Here we summarize an extension of the results in §5. The question addressed here is whether there is value in mixing datasets with different formats. We evaluated this by adding one dataset of a different format to four different datasets (one for each format). The results are summarized in Table 9 . The goal of each sub-table is to measure the within-format generalization one can gain via out-of-format training. Each sub-table has an anchor dataset, indicated in the first column. For example in the first table the anchor dataset is SQuAD. Rows of the table: Each table combines datasets of other formats with the anchor dataset (e.g., SQuAD + RACE, etc). The columns of the sub-tables contain evaluations on the dataset with the same format as the anchor dataset. For example, on the first table, the evaluation is done on SQuAD 1.1/2.0, NewsQA, Quoref which have the same format as SQuaD 1.1, the anchor dataset.

Table 9: Pairwise mixing of formats: mixing with QA of datasets with different formats helps.

The results show that one can achieve gains for question-answering in a certain format by incorporating resources in other formats. In the first two sub-tables, we see that NQA (AB) and OBQA (MC) help a SQuAD models generalize better to other EX datasets. In the third table where the anchor dataset is NQA (AB), EX datasets help a NQA model generalize better to other AB datasets. In the 4th/5th subtable, EX and AB datasets help a RACE/OBQA (MC) models generalize better to other MC datasets. Similarly, in the final sub- Table 9 : Pairwise mixing of formats: mixing with QA of datasets with different formats helps.

Appendix A.3 reports extended results, including the performance with various choices of X.