Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

GooAQ: Open Question Answering with Diverse Answer Types

Authors

  • Daniel Khashabi
  • Amos Ng
  • Tushar Khot
  • Ashish Sabharwal
  • Hanna Hajishirzi
  • Chris Callison-Burch
  • ArXiv
  • 2021
  • View in Semantic Scholar

Abstract

While day-to-day questions come with a variety of answer types, the current questionanswering (QA) literature has failed to adequately address the answer diversity of questions. To this end, we present GOOAQ, a large-scale dataset with a variety of answer type. This dataset contains with over 5 million questions and 3 million answers collected from Google. GOOAQ questions are collected semi-automatically from the Google search engine using its autocomplete feature. This results in naturalistic questions of practical interest that are nonetheless short and expressed using simple language. GOOAQ answers are mined from Google’s responses to our collected questions, specifically from the answer boxes in the search results. This yields a rich space of answer types, containing both textual answers (short and long) as well as more structured ones such as collections. We benchmark T5 models on GOOAQ and observe that: (a) in line with recent work, LM’s strong performance on GOOAQ’s short-answer questions heavily benefit from annotated data; however, (b) their quality in generating coherent and accurate responses for questions requiring long responses (such as ‘how’ and ‘why’ questions) is less reliant on observing annotated data and mainly supported by their pre-training. We release GOOAQ to facilitate further research on improving QA with diverse response types.1

1 Introduction

Figure 1: Examples from GOOAQ showing different types of the questions considered in this study. Each input is a natural language question, mapped to textual answer(s). The questions/answers come with answer type which are inferred from meta information of the search results.

Research in "open" question answering (also referred to as open-response, open-domain, or direct answer QA) has resulted in numerous datasets and powerful models for answering questions without a specified context, by using background knowledge either stored in the QA model or retrieved from large corpora or knowledge bases. Open QA datasets, however, focus mainly on short responses (Berant et al., 2013; Joshi et al., 2017; Lee et al., 2019; Lewis et al., 2021; Bhakthavatsalam et al., 2021) , and so do the models designed for them Lewis et al., 2020b; Min et al., 2021; Lewis et al., 2021) . Further, the short responses considered typically inquire about people, dates, and counts (e.g., 65% of Natural Questions (Kwiatkowski et al., 2019) begin with 'who', 'when', or 'how many'; cf. Fig 2) .

Figure 2: The distribution of common bigrams in questions of GOOAQ (a,b,c) vs. Natural-Questions (d).

In contrast, many of the everyday questions that humans deal with and pose to search engines have a more diverse set of responses, as illustrated in Fig. 1 . Their answer can be a multi-sentence description (a snippet) (e.g., 'what is' or 'can you' questions), a collection of items such as ingredients ('what are', 'things to') or of steps towards a goal such as unlocking a phone ('how to'), etc. Even when the answer is short, it can have richer types, e.g., unit conversion, time zone conversion, or various kinds of knowledge look-up ('how much', 'when is', etc.). Such answer type diversity is not represented in any existing dataset.

We propose GOOAQ, the first open QA benchmark containing questions with all of the above answer types within a unified dataset, collected using the same, coherent process. Specifically, GOOAQ contains 3 million questions with short, snippet, or collection answers, such as the ones shown in Fig. 1 .

GOOAQ questions are automatically mined from Google's search-autocomplete feature and thus, we hypothesize, represent popular queries of realworld interest. Such questions also trigger 'answer boxes' in the search results, containing responses deemed best by Google, which we extract and use as the gold answer.

Having a variety of answer types within a single, coherent, open QA benchmark enables a quantitative study of the inherent differences across them. Specifically, we use GOOAQ to ask whether mod- els for different answer types:

(Q 1 ) benefit similarly from pre-training? (Q 2 ) benefit similarly from labeled data? (Q 3 ) benefit similarly from larger models?

We explore these questions in the context of generative pre-trained language models (LMs) (Lewis et al., 2020a; as self-contained reasoners, without explicit access to additional retrieved information. In particular, we benchmark the powerful T5 language model on GOOAQ, with both automatic metrics and human evaluation ( §4).

Figure 3: Comparison of question length distributions
Figure 4: Crowdsourcing interface used for our human evaluation.

To study (Q 1 ), we train models (separately for each response type) with little labeled data (2k questions), mainly to convey the nature of the task. While LMs struggle, as expected, in this setting on short response questions, they perform surprisingly well in generating snippet and collection responses (e.g., humans prefer T5-11B's response to Google's response in 30% of such questions; Fig. 5 , bottom-right plots). We hypothesize this is because response fluency and coherence have a much higher weight in such questions, and these factors remarkably benefit from the LM pre-training objective. On (Q 2 ), we observe the opposite trend: short response questions benefit consistently from increasing amounts of supervised (labeled) data, whereas both snippet and collection response questions show minimal gains. On (Q 3 ), larger models, as expected, are more effective in all cases, but the gains are much more pronounced for snippet and collection response generation (20+%) as compared to short responses (5-10%), under human evaluation.

Figure 5: Evaluation of T5 (small,11B) models on different sub-tasks of GOOAQ via automatic metrics (top) and human judgements (bottom). For human evaluation, 50% is the border at which the model output and the ground truth responses are indistinguishable. The short-answer sub-tasks (Tshort; left) have a relatively low performance when supervised with 2k instances. However, benefit more then the long-answer sub-tasks (Tsnippet & Tcollection) from availability of more labeled data. On contrary, long-answer sub-tasks benefit very little from more labeled data. Additionally, we observe that the gap between the two systems is bigger in terms of human evaluation (compared to the corresponding gap in terms of automatic evaluation), especially in the long response tasks (middle & right).

We hope GOOAQ will facilitate research towards improving models to answer snippet and col-lection response questions. While the largest models we consider achieve surprisingly high scores on these questions, they still lag behind gold responses in both automated and human evaluations. Importantly, due to little benefit observed from more annotated data on such questions, human parity requires rethinking the approach towards better models.

We find GOOAQ to be a valuable resource for training models. E.g., T5 trained on our snippet and collection questions shows strong zero-shot generalization to ELI5 (Fan et al., 2019) , a longanswer dataset, achieving a score within X% of the state-of-the-art models trained using ELI5 data.

Contributions. (A)

We present GOOAQ, a collection of 3 million question-answer pairs with a diverse set of answers, along with a crowdsourced assessment of its quality. (b) We benchmark state-ofart models on GOOAQ, both in terms of automatic and human judgments, and observe remarkable differences in how models behave w.r.t. various response types. (c) We demonstrate that GOOAQ is also a valuable model training resource by showing strong zero-shot generalization to ELI5 (Fan et al., 2019) . We hope this datatset will spur further research into open QA under diverse response types.

2 Related Work

A closely related work is the Natural-Questions (NQ) dataset (Kwiatkowski et al., 2019; Lee et al., 2019) which contains common questions written by Google users. While our questions (extracted via autocomplete) likely approximate questions commonly asked on Google, our dataset represents a different and wider distribution of ques-tions ( §3.2), likely because it encompasses different classes of answers, particularly snippet and collection responses. Specifically, while NQ is dominated by 'who', 'when', and 'how many' questions (cf. Fig. 2(d) ), GOOAQ has notably few 'who' questions and a substantial portion of questions starting with 'how to', 'what is', 'what does', 'can you'.

One notable QA dataset with long-form responses is ELI5 (Fan et al., 2019; Krishna et al., 2021) , containing questions/answers mined from Reddit forums. In contrast, GOOAQ is collected differently and is several orders of magnitude larger than ELI5. Empirically, we show that models trained on GOOAQ transfer surprisingly well to ELI5 ( §5.3), indicating GOOAQ's broad coverage.

It is worth highlighting that there is much precedent for using search engines to create resources for the analysis of AI systems. Search engines harness colossal amounts of click information to help them effectively map input queries to a massive collection of information available in their index (Brin and Page, 1998; Joachims, 2002; Joachims et al., 2017) . Although academic researchers do not have direct access to information collected from the users of search engines, the data gathered from search results can act as a proxy for them and all the complex engineering behind them. In particular, the GOOAQ dataset used in this study probably is not representative of a single QA system in Google; on the contrary, we hypothesize, this data is produced by a complex combination of many systems, various forms of user feedback, as well as expert annotation/verification of highly popular responses.

3 Gooaq Dataset

We start by describing how GOOAQ was collected, followed by key dataset statistics and quality assessment.

3.1 Dataset Construction

Construction this dataset involved two main steps, extracting questions from search auto-complete and extracting answers from answer boxes.

3.1.1 Query Extraction

To extract a rich yet natural set of questions we use Google auto-completion. 2 A similar strategy was also used by Berant et al. (2013) , albeit in the context of a slightly different study. We start with a seed set of question terms (e.g., "who", "where", etc.; the complete list is in Appendix A.) We bootstrap based on this set, by repeatedly querying prefixes of previously extracted questions, in order to discover longer and richer sets of questions. Such questions extracted from the autocomplete algorithm are highly reflective of popular questions posed by users of Google. We filter out any questions shorter than 5 tokens as they are often incomplete questions. This process yields over ∼5M questions, which were collected over a span of 6 months. The average length of the questions is about 8 tokens.

3.1.2 Answer Extraction

To mine answers to our collected questions, we rely on the Google answer boxes shown on top of the search results when the questions are issued to Google. There are a variety of answer boxes. The most common kind involves highlighted sentences (extracted from various websites) that contain the answer to a given question. These form the snippet and collection answers in GOOAQ. In some cases, the answer box shows the answer directly, possibly in addition to the textual snippet. These form the short answers in GOOAQ. 3 We first scrape the search results for all of our questions (collected as described in §3.1.1). This is the main extraction bottleneck, which was done over a span of 2 months. Subsequently, we extract answer strings from the HTML content of the search results. Answer types are also inferred at this stage, based on the HTML tags around the answer.

After the answer extraction step, we have all the necessary information to create a question in GOOAQ, such as the examples in Fig. 1 . Table 1 summarizes various statistics about GOOAQ broken down into different question/answer types. Of the 5M collected questions, about half resulted in successful answer extraction from answer boxes. The largest type of questions received 'snippet' answers with over 2.7M responses (examples shown in the left-most column of Fig. 1 ). The other major category is 'collection' answers with 329k questions (examples shown on the right-most column of Fig. 1 ). To better understand the content of GOOAQ, we present several distributions from the data. Fig. 3 shows the length distribution of the GOOAQ questions and that of Natural-Questions (Kwiatkowski et al., 2019) . While a vast majority of NQ questions contain 8-10 tokens, GOOAQ questions have a somewhat wider length span. To better understand the type of questions, we show in Fig. 2 the distribution of the most frequent opening bigrams of the questions. Among the short answer questions, the majority are informationseeking questions about counts ('how many'), places ('where is'), values ('how much'), and people ('who is'). Fig. 2; right) highlights that GOOAQ represents a different and wider class of questions. Specifically, NQ has many 'who', 'when', and 'how many' questions, while GOOAQ dominantly contains 'how' and 'what' questions, which typically require explanatory responses.

Table 1: GOOAQ statistics

3.3 Quality Assessment Of Gooaq

We perform a crowdsourcing experiment to assess the quality of the extracted questions and their answers. We use Amazon Mechanical Turk (AMT) to annotate about 2.5k randomly selected questionanswer pairs. The annotators were asked to annotate (1) whether a given question makes sense and, if so, (2) whether the provided answer is clear and complete.

Since our task is focused on English, we required workers to be based in a country with a population predominantly of native English speakers (e.g., USA, Canada, UK, and Australia) and have completed at least 5000 HITs with ≥ 99% assignment approval rate. Additionally, we have a qualification test with half-a-dozen questions all of which need to be answered correctly by our annotators. To prevent biased judgements, we also ask the annotators to avoid using Google search (which is what we used to we mined GOOAQ) when annotating the quality of shown instances.

We compute aggregate measurements for (1) average rating of questions and (2) average rat- ing of the answer quality, among valid questions.

As can be seen in the results in [If the question makes sense] read the following answers and indicate the one that best addresses the previous "question"? (the answer that is more correct and complete)

4 Experimental Setup

GOOAQ naturally forms a dataset for the task of open QA, where the input is a question and the output is the answer to that question. Unlike the reading comprehension setting, the context for answering the question is not provided as part of the input. Further, we consider the so-called 'closed-book' setup the model (e.g., a lan-guage model) is expected to use background knowledge stored within it, without access to any additional explicit information retrieval mechanism.

Setup. We split GOOAQ into three sub-tasks: (T short ) short responses questions (cf. footnote 3). (T snippet ) snippet responses questions, and (T collection ) collection response questions. We train and evaluate models for each of these sub-tasks separately. We define them as different sub-tasks since by merely reading the questions it might not be clear whether its response should be short, a snippet, or a collection, Data splits. For each sub-task, we randomly sample test and dev sets such that each evaluation split contains at least 500 instances from each question type. We experiment with variable training data size to better understand the value of the labeled data. Prior work has shown that leakage from training data to the evaluation sets often result in unrealistically high scores (Lewis et al., 2020b) . To minimize this issue, we create training splits by selecting the most dissimilar instances to our evaluation splits. The measure of similarity for each training instance is computed as the maximum amount of token-overlap with any of the instances in the test/dev set (computed over both questions and answers). Using the most dissimilar subset of the training instances, we create training splits of the following sizes: 2k, 20k, 200k. For T snippet , we also have a 2M training set since this sub-task has more data.

Models. For our evaluation, we use the T5 model , a recent text-to-text framework that has achieved state-of-art results on a variety of tasks, including open QA . We use two model sizes that capture the two extremes: the smallest model ('small') and the largest model ('11B'). Both models were trained for 20k steps on the training splits, dumping checkpoints every 2k steps (with 196,608 tokens per batch on v3-128 TPUs) with the default hyperparameters. We select the checkpoint with the highest score on the 'dev' set and report its corresponding 'test' score.

Automatic evaluation. We use the ROUGE-L metric (Lin, 2004) as, despite its shortcomings, it is a common metric for assessing the quality of models for text generation tasks. The results of the automatic evaluation for each sub-task are shown Figure 5 : Evaluation of T5 (small,11B) models on different sub-tasks of GOOAQ via automatic metrics (top) and human judgements (bottom). For human evaluation, 50% is the border at which the model output and the ground truth responses are indistinguishable. The short-answer sub-tasks (T short ; left) have a relatively low performance when supervised with 2k instances. However, benefit more then the long-answer sub-tasks (T snippet & T collection ) from availability of more labeled data. On contrary, long-answer sub-tasks benefit very little from more labeled data. Additionally, we observe that the gap between the two systems is bigger in terms of human evaluation (compared to the corresponding gap in terms of automatic evaluation), especially in the long response tasks (middle & right).

in the top row of Fig. 5 . For short answer questions, we show the automatic evaluation of each sub-type (unit conversion, time conversion, etc.) in Fig. 6 .

Figure 6: Automatic evaluation of T5 (small: top, 11B: bottom) models on different types of the questions included in short-answer sub-task (Tshort). ‘unitconversion’ questions benefit the most from more labeled data, while ‘knowledge’ lookup questions are the opposite.

Human evaluation. Automatic evaluation is well-known to be sub-optimal for text generation models, especially for larger text. Hence, in addition to the automatic metric, we also perform human evaluation of the generated responses, using the judgments of AMT crowdworkers. Specifically, we ask the crowdworkers to indicate their preferred answers (among gold answer and the model prediction). The annotation interface is shown in Fig. 4 , which is essentially the same template used for the quality assessment of the dataset ( §3.3), except that here the crowdworkers are shown a pair of responses for each question-the gold answer and the one generated by the model-turning the task into a comparative one. Before annotating every instance, we remind the annotators to avoid using Google. Then we ask them to check if the provided question is clear enough and it makes sense. Upon indicating 'yes' to question quality, they are shown two answers labeled 'A' and 'B' (one Google's answer and one generated by our models, ordered randomly). We ask them to indicate the answer they prefer (if any).

Each question is annotated by 5 independent annotators and aggregated via a majority vote of the 5 labels. If annotators consistently prefer model predictions, we assign 1 credit to the prediction. Otherwise, the model receives 0 credit. To compute an overall accuracy score for a given model, we average the instance-level scores, after discarding the questions indicated as invalid ('this question makes no sense').

The resulting human-evaluation metric indicates the percentage of the cases where model predictions were preferred over ground-truth answers. In this evaluation, 50% is the margin where the annotators are not able to distinguish the model's responses from the ground-truth responses (Google's answers) in any meaningful way. The results of human evaluation are shown in the bottom row of Fig. 5 . 4

5 Empirical Findings

We start with our main findings, then discuss an error analysis, and finally show extrinsic value of GOOAQ as a model training resource.

5.1 Main Results

Model pre-training is surprisingly effective on the snippet and collection answer sub-tasks Both automatic and human evaluations of these two classes of questions ( Fig. 5; middle & right) demonstrate that the T5-11B model is surprisingly effective at answering them in a few-shot setting, with only 2k training examples in each case. For example, crowdworkers even prefer the model's answer over Google's in 30% of the cases (50% would be a tie). This is in contrast with short answer questions, where the model's accuracy is only around 10% and crowdworkers prefer Google's answers in about 90% of the cases.

To understand this observation, one needs to put into perspective several factors that are at play. First, short answer questions typically ask for encyclopedic knowledge and, therefore, correctness of the answers matter the most. In snippet and collection questions, we suspect coherence of the response carry a heavier weight. This is partly due to the nature of the collected questions, which typically refer to broad notions that can be responded to in a variety of ways. For example, the snippet response to the question of how many calories burned 30 minutes crossfit? (Fig. 1 ) could refer to a range of calorie consumption, depend on the choice of activity during crosssfit, or vary by the attributes of the person working out. All of these responses would be equally correct.

Labeled data is more helpful for short answer questions. Based again on both the automatic and human evaluations ( Fig. 5; left) , the performance of both small and 11B parameter models on the short response questions quickly improves as we increase the amount of training data, especially beyond 20k. This is contrast with snippet and collection questions, where even 200k labeled instances don't appear to help much, indicating that in these question types, model pre-training carries most of the weight.

The breakdown of the automatic evaluation for different types of short response questions is shown in Fig. 6 . As expected, certain niche question types (such as 'unit-conversion') benefit the most from : Automatic evaluation of T5 (small: top, 11B: bottom) models on different types of the questions included in short-answer sub-task (T short ). 'unitconversion' questions benefit the most from more labeled data, while 'knowledge' lookup questions are the opposite.

labeled data. On the other hand, open-ended question types (such as 'knowledge' lookup) benefit less from more labeled data.

Human evaluation accentuates the gap between the 'small' and '11B' models, especially on snippet and collection response questions. This is visually evident from the gap between the blue and red curves in the bottom row vs. the top row of Fig. 5 . This is compatible with recent work of Min et al. (2021) , who also observed that the gap between two reasonably different systems is bigger when using human evaluation. We hypothesize this is due to the crudeness of automatic evaluation metrics, and an indication of human evaluation's necessity to distinguish between nuanced differences among generated responses. What is perhaps more interesting (and not evident from prior work) is that the gap between automatic and human evaluation is larger for the snippet and collection questions than short answer questions, especially for the T5-small model. This is, at least partly, due to the inaccuracy of automatic metrics in evaluating long text.

Few-shot '11B' models achieve high performance, but not yet comparable with the gold annotations. As mentioned earlier, our human evaluation measures the comparative quality of the model predictions and the ground truth responses (Google's answers). Hence, a value of 50% in this evaluation is an indication that the predictions are on par with (i.e., indistinguishable from) the ground-truth responses.

As can be seen in the bottom row of Fig. 5 , the T5-11B model comes quite close to Google's answers but is still not quite at par with it. We hope this gap will encourage further research in building stronger models, especially for the snippet and collection answer questions where more labeled data doesn't appear to be a promising way to increase accuracy.

5.2 Error Analysis

To gain an intuition about the mistakes made by the models, we conducted a small-scale errors analysis of model predictions. For each model, we (one of the authors) annotated predictions, and labeled them with the following error categories inspired from existing evaluations of text summarization (Chaganty et al., 2018) : incompleteness, indicating the lack of expected substance in the prediction; redundancy, indicating repeated content; hallucination, indicating existence of madeup statements; and incoherence indicating the existence of grammatical errors. The results of our error analysis are summaized in Fig. 7 . As expected, the 'small' model has higher error percentages across all categories. It suffers particularly from a lot of redundancy and incompleteness. Overall, both models have very little incoherence, which is to be expected from strong pre-training of these language models.

Figure 7: Error distribution for the two models

5.3 Extrinsic Utility Of Gooaq

We next assess the value of GOOAQ as a model training resource. In particular, we train our models on questions from GOOAQ and evaluate them in a zero-shot way on ELI5 (Fan et al., 2019) , a relatively recent dataset with long-answer questions extracted from Reddit posts. ishna et al., 2021) . T5 fine-tuned on GOOAQ performs well on ELI5, another long-answer dataset.

Table 2: Summary of GOOAQ quality evaluation by crowdworkers. According to human ratings, a very small percentage of the questions are invalid (first column). Among the valid questions, a substantial portion are deemed to have valid answers.

Our evaluation, summarized in Table 3 , shows that both our small and 11B T5 models trained on GOOAQ's snippet-answer subset perform quite well (21.8 and 22.9, resp.) when evaluated on ELI5 in a zero-shot setting. They are even better than the same architectures trained with ELI5's own training data (19.0 and 22.7, resp.) and on par with the state-of-the-art models that use retrieval engines (23.4).

Table 3: Evaluation of our models on ELI5 dataset. Results indicated with * are reported from prior work (Krishna et al., 2021). T5 fine-tuned on GOOAQ performs well on ELI5, another long-answer dataset.

We hypothesize that despite the fact that GOOAQ is collected differently than ELI5, a notable portion of ELI5 questions are covered by GOOAQ, indicating its good coverage of common questions posed by ordinary users.

6 Discussion

Knowledge leakage in the evaluation. One recent finding in the field is about knowledge leakage between train and evaluation sets (Lewis et al., 2020b; Emami et al., 2020) . Similar concerns have motivated our careful train/evaluation splits ( §4) and experiments with varying training set sizes. Nevertheless, we found it challenging to define (and assess) the amount of leakage from the training data to evaluation. We welcome such studies on GOOAQ.

Are we mimicking Google's QA? A reader might question the value of this work by noting that the website from which GOOAQ was mined had likely also used a QA system to begin with.

In other words, are we trying to simple reverseengineer Google's internal QA system? While we (the authors) are not aware of how Google answer box system works, we suspect that it is much more than a single QA system built using current AI technology. The system, besides incorporating one or more QA models, likely makes heavy use of implicit user feedback (e.g., information contained in billions of clicks, the structure of web links, etc.), in addition to explicit feedback from users and possibly some expert curation of answers to common questions. Thus, the data in Google's answer boxes likely captures a variety of signals that contribute towards its high-quality. We believe aiming for a 'standard' NLP QA system that's on par with Google QA is therefore a challenging and worthwhile goal.

Replicating human evaluation. One challenge with respect to the progress on the long-form QA task is the evaluation of responses. To facilitate future work on GOOAQ and replicability of our human evaluations, we have released the templates we have used to crowdsource human judgements. There are also active proposals for streamlining the evaluation of text generation tasks which we might adopt.

Scope of our conclusions. It is worth emphasizing that one must be careful in taking these conclusions out of the context of this study (i.e., the dataset at hand, the models, the evaluation metrics used, etc.). While we hypothesize that these findings are relatively general, it migth be possible to come up with a different long-form QA dataset on which the same baselines show a wildly different behavior.

7 Conclusion

We studied open question-answering under diverse response types. For this purpose, we collected GOOAQ, a very large collection of QA pairs mined from Google, with a variety of (short and long) answer types, all of which are collected using a unified, coherent process, enabling a cross-type comparison. The questions are collected from the auto-complete system, which likely reflects a natural distribution of questions asked by users.

We benchmarked two variants of a state-of-theart self-contained text generation model (T5, with no retrieval) on the three different sub-tasks of GOOAQ (short, snippet, and collection), and using both automatic and human evaluations. Our analysis brings out the distinct behavior of LMs on long and short response questions. For example, while short response models benefit heavily from more labeled data, the surprisingly strong performance of long response models is driven mostly by their pre-training. We also demonstrate that GOOAQ is a valuable model training resource, via high zeroshot performance on an extrinsic task, ELI5.

We hope our analysis and the released data will benefit the understanding of and further development of QA systems for dealing with diverse response types.

A Query Terms Used For Query

The list of terms used for bootstrapping questions: "who", "whom", "whose", "what", "which", "when", "where", "why", "how", "should", "would", "wouldn't", "can", "can't", "will", "won't", "aren't", "do", "does", "has", "have", "am", "are", "is", "shouldn't", "isn't", "could", "couldn't", "does", "don't", "must", "may", "ought".

B Error Analysis

To gain better intuitions about the mistakes made by the models, we conducted a small-scale errors analysis of the model predictions. For each model, we (one of the authors) annotated 30 predictions, and labeled them with the following error categories that are inspired by the existing evaluations of summarization (Chaganty et al., 2018 ):

• Incompleteness: the prediction has some relevant points but not as detailed as expected. • Redundancy: the prediction ha repeated content.

• Hallucination: the prediction is making up new facts. • Incoherence: the prediction lack of fluency and coherency.

Here are examples of such errors in model predictions:

Question: how much does it cost to get court married? Google ans: Your application must be notarized. Please include at least one phone number where you can be reached during the day. Enclose a check or money order for $69 payable to King County. The check or money order must include your address. Predicted ans: The cost of a court marriage is between $40,000 and $50,000. The cost of a court marriage is between $40,000 and $50,000. The cost of a court marriage is between $40,000 and $50,000. Explanation: In this example, the model has made up numbers that do not seem reasonable (hence, a 'hallucination' error.) Additionally, it is repeating the same point (hence, a 'redundancy' error.)

Question: who went to the superbowl last year? Google ans: Here's your answer: The Eagles won Super Bowl LII on Feb. 4, 2018, beating the Patriots 4133 to win the league championship. Predicted ans: The Super Bowl was a national championship game between the New York Giants and the New York Giants. The team was a member of the New York Giants, who starred in the Super Bowl in the 1970s. Explanation: In this example, there is repetition in the model's prediction (hence, a 'redundancy' error.) Question: what courses do you need to be a kinesiologist? Google ans: To become a kinesiologist, you typically need either a bachelor's or a master's degree in Kinesiology or a related field, such as Human Kinetics, Exercise Science, or Exercise Physiology. Predicted ans: To become a kinesiologist, you will need to complete a four-year degree in kinesiology, biology, chemistry, chemistry, physics. Explanation: In this example, the model provides some relevant points, but the response is not as detailed as the gold response (hence, an 'incompleteness' error.)

The results of our error analysis are shown below in Fig. 7 . As expected the 'small' models have higher bars across all error categories. They particularly suffer from a lot of 'redundancy' and 'incompleteness'. Overall, all the models have very little 'incoherence' mainly because this category is directly addressed during the model pre-training.

https://github.com/allenai/gooaq

http://google.com/complete/search?client=chrome&q=...

We define 'short' response questions to be the union of 'knowledge', 'unit-conversion', 'time-conversion', and short answers from the 'snippet' responses (cf.Fig. 1).

The templates used for human evaluation are available at the URL in Footnote 1.

Figure 8: More examples from GOOAQ. Instances of questions with the same type share background colors.