A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Pradeep Dasigi
Kyle Lo
Iz Beltagy
Arman Cohan
Noah A. Smith
Matt Gardner
NAACL
2021
View in Semantic Scholar

Abstract

Readers of academic research papers often read with the goal of answering specific questions. Question Answering systems that can answer those questions can make consumption of the content much more efficient. However, building such tools requires data that reflect the difficulty of the task arising from complex reasoning about claims made in multiple parts of a paper. In contrast, existing information-seeking question answering datasets usually contain questions about generic factoid-type information. We therefore present Qasper, a dataset of 5049 questions over 1585 Natural Language Processing papers. Each question is written by an NLP practitioner who read only the title and abstract of the corresponding paper, and the question seeks information present in the full text. The questions are then answered by a separate set of NLP practitioners who also provide supporting evidence to answers. We find that existing models that do well on other QA tasks do not perform well on answering these questions, underperforming humans by at least 27 F1 points when answering them from entire papers, motivating further research in document-grounded, information-seeking QA, which our dataset is designed to facilitate.

1 Introduction

Machines built to assist humans who engage with texts to seek information ought to be designed with an awareness of the information need. Abstractly, the human's need should define the lens through which the system views the text in order to find desired information. Existing information-seeking machine reading datasets (e.g., Kwiatkowski et al., 2019; have led to significant progress in reading at scale (e.g., Asai et al., 2020; Guu et al., 2020; . However, most of those benchmarks focus on an "open domain" setting where the questions are not anchored in any particular user context. The result is an emphasis

Quasar: Datasets For Question Answering By Search And Reading

Abstract We present two new large-scale datasets aimed at evaluating systems designed to comprehend a natural language query and extract its answer from a large corpus of text. The QUASAR-S dataset consists of 37000 clozestyle (fill-in-the-gap) queries constructed from definitions of software entity tags on the popular website Stack Overflow. We evaluate several baselines on both datasets, ranging from simple heuristics to powerful neural models, and show that these lag behind human performance by 16.4% & 32.1% for Quasar-S and -T respectively.

3 Dataset Construction Each dataset consists of a collection of records with one QA problem per record. For each record, we include some question text, a context document relevant to the question, a set of candidate solutions, and the correct solution.

3.2 Context Retrieval

The context document for each record consists of a list of ranked and scored pseudodocuments relevant to the question.

Q. Which retrieval system was used for the baselines?

4.4 Results

Several baselines rely on the retrieved context to extract the answer to a question. For these, we refer to the fraction of instances for which the correct answer is present in the context as Search Accuracy. The performance of the baseline among these instances is referred to as the Reading Accuracy.

A:

The dataset comes with a ranked set of relevant documents. Hence the baselines do not use a retrieval system.

Evidence Paragraphs

Question and Answer Title and Abstract Figure 1 : An example instance taken from QASPER. A question about the paper is written after reading only the title and the abstract. To arrive at the answer, one finds relevant evidence, which can be spread across multiple paragraphs. In this example, to answer the question about "baselines", the reader must realize from evidence from Sections 3 and 4 that "context documents" come pre-ranked in the dataset and the paper's "baselines" select from these "context documents." on generic factoid questions, rather than the full range of information needs people have.

We present QASPER, 1 an information-seeking question answering (QA) dataset over academic research papers. Each question is written as a followup to the title and abstract of a particular paper, and the answer, if present, is identified in the rest of the paper, along with evidence required to arrive at it. This setup results in questions requiring more complex document-level reasoning than prior datasets, because (i) abstracts provide rich prompts for questions that can be asked as follow-up and (ii) academic research papers naturally trigger ques-tions by their target readers that require supporting or refuting claims. This evidence may be spread across the paper, including tables and figures, often resulting in complex entailment problems. The example in Figure 1 illustrates one such case where we need to retrieve information from paragraphs in three different sections to answer the question.

QASPER contains 5,049 questions over 1,585 natural language processing (NLP) papers, asked by regular readers of NLP papers, and answered by a separate set of NLP practitioners. Each paper has an average of 3.2 questions, up to a maximum of 12 questions for a single paper. In addition to providing answers when the questions are answerable, the annotators were asked to select text, tables, or figures as evidence required for answering the questions. 55.5% of the questions require evidence from multiple paragraphs in the paper and 13% require tables or figures. To the best of our knowledge, QASPER is the first QA dataset in the academic research domain focusing on entire papers, and not just abstracts.

To quantify the difficulty of the tasks in QASPER, we apply state-of-the-art document-level Transformer (Vaswani et al., 2017) models to the tasks of selecting evidence and generating answers, and show that the best model performance lags behind humans by 27 F 1 points at answering questions from entire papers, and 32 F 1 points at selecting the paragraphs that provide evidence to answer the questions, indicating that these are both unsolved problems. Additionally, we experiment with oracles that answer questions from gold evidence and find that better pretraining and domain-adaptation might be helpful.

2 Building The Qasper Dataset

We now describe our process for constructing the dataset. We began with a set of open-access NLP papers, recruited NLP practitioners who are regular readers of research papers, and designed two different data collection interfaces: one for collecting follow-up questions given titles and abstracts, and another for obtaining evidence and answers to those questions.

2.1 Papers

We filtered S2ORC (Lo et al., 2020) , 2 a collection of machine-readable full text for open-access pa-pers, to (i) those from arXiv with an associated LaTeX source file, 3 and (ii) are in the computational linguistics domain. 4 We limited our domain to computational linguistics to ensure high quality as we have access to realistic users through our research network; broader domain collection is left to future work and should be enabled by the proof-of-concept of our protocols given in this paper. We used the S2ORC parser (which normalizes multi-file LaTeX sources and resolves comments and macros) to convert LaTeX markup to full text while preserving section and paragraph breaks and math equations. We supplemented the paper text with extracted images of figures and tables associated with their captions; these were crawled from Semantic Scholar. 5 The result of this process was a collection of 18K full text papers for annotation.

2.2 Decoupled Data Collection

To ensure that our questions are realistic, we decoupled the question-writing and question-answering phases. For both tasks we recruited graduate students studying NLP and freelancers practicing NLP through professional networks and Upwork 6 . All the workers were regular readers of NLP papers, and were paid US$25 per hour on average ($20-$40 based on experience). We paid them on a per-hour basis and not a per-question basis to prioritize data quality over quantity. A total of 25 workers wrote questions while 51 answered them.

Questions To ensure that annotators were actually interested in the paper they are reading, we provided them with a lightweight search interface to search papers from the aforementioned collection to focus on their papers of interest. The interface supports entering manual queries and examples of the queries annotators used include general (e.g., "computer vision") or specific (e.g., "question answering", "information extraction") areas of study, specific tasks (e.g., "language identification"), entities (e.g., "bert", "transformers") or concepts (e.g., "commonsense", "interpretability"), or domain specifications (e.g., "medical", "wikipedia"). Annotators also had the option to not enter any search queries; in this case, they were shown random papers. Annotators were displayed only the title and abstracts of relevant papers and asked to write any number of questions they had about the paper. Annotators were instructed to only write questions that are not answerable from the title and abstract but expected to be answered somewhere in the paper. Annotators also provided basic information about their expertise in NLP and how familiar they already were with the paper for which they asked questions. Most workers (about 70%) had some experience in NLP, with 20% having more than five years of experience. A vast majority (94%) of the abstracts were seen by the questionwriters for the first time.

Answers Annotators were randomly assigned papers with all the corresponding questions written for that paper. They were shown the paper title, abstract, question, full text, and all associated figures and tables to answer the questions. After reading these, annotators were were asked to:

• Make a binary decision as to whether the question is answerable given the paper.

• If the question is answerable, select the minimal set of evidence snippets that contains the answer to the question. This could be (possibly discontiguous) paragraphs from the text and/or figures or tables. Annotators were asked to prioritize text over figures and tables, unless the information required was present only in figures or tables. When multiple paragraphs could serve as evidence, annotators were asked to first prioritize evidence that adequately answered the question, and then paragraphs that occurred earlier in the text.

• If the question is answerable, also provide a concise answer to the question. Annotators were also asked to also indicate whether their concise answer was (i) extracted from the evidence, (ii) "yes" or "no", or (iii) abstractively written.

Annotators were allowed to skip any questions they did not feel comfortable answering. Since the answering task is significantly more complex than the question-writing task, we designed interactive tutorials and qualification exams for the workers for this task using CrowdAQ (Ning et al., 2020) . Workers who scored well were invited to work on the task. If the test performance indicated that the workers did not have sufficient NLP knowledge, or were not used to reading papers we did not let them work on the task. In cases where the workers misunderstood the task, but had sufficient background knowledge, we provided additional training before letting them work on the task. Question types We first analyze whether our annotation setup results in questions that are anchored in the context of the papers. To answer this question, we manually 7 categorized a set of 200 questions as being applicable to most papers in the domain (general) vs. being applicable only to the paper that the question is written about (specific). Table 1 shows that most of the questions (67%) are specific to the papers they are written about. This result indicates the advantage of viewing the QASPER task as a question answering problem, instead of an information extraction problem since a fixed schema would not be able to handle the long tail of paper-specific information needs.

Table 1: Examples of questions (top), answers (middle), and evidence (bottom) sampled from QASPER. % are relative frequencies of the corresponding type over all examples in QASPER. The percentages for evidence types sum over 100% due to double-counting of 446 answers with both Table/Figure and Text evidence.

3 Qasper Analysis

Answer types As shown in Table 1 , most of the answers in the dataset are extractive. The average length of the extractive answers is 14.4 words (including all spans), and that of abstractive spans is 15.6 words.

Evidence types Evidence can include one or more paragraphs from the paper, a figure, or a table, or a combination of these. Table 1 shows the distribution of these types. Among the answerable questions with text-only evidence, 55.5% of the answers have multi-paragraph evidence (Figure 1 is one example). Unanswerable questions do not have any evidence. Among the answerable ones, (3.0%) have no evidence when the answer is No, and the evidence is the lack of a mention of something specific. The last question in Table 4 is one example of such a case.

Table 4: Model performance on the QASPER test set on answering questions given gold evidence. We do not show performance on Yes/No and Unanswerable types because they can be trivially predicted to a large extent from the absence of gold evidence.

Distribution Of Evidence Paragraphs

We perform an analysis to identify the main sections of a paper that contain textual evidence. We assign each evidence paragraph to its containing top-level 8 section, and perform some section name normalization. We find that among the frequently used section names such as "Experiments" and "Introduction," there was not a single section name that contained a majority of evidence spans, indicating that the distribution of evidence over section in the paper was more or less uniform.

Inter-annotator agreement 44% of the questions in QASPER have multiple annotated answers. On average, each question is answered by 1.6 annotators (up to a maximum of 6 annotators for the same question). Using these multiple annotations, we compute some measures of agreement between annotators. First, we found that there is a high level of agreement (90%) regarding answerability of questions. Second, we find that annotators agreed on the type of the evidence (text vs. figure) in 84.0% of the cases. Papers often provide the same information both in tables and text, and agreement over the evidence types could be a consequence of our clear annotation guidelines regarding selecting evidence.

Correctness To estimate the correctness of the answer annotations in QASPER, we manually analyzed 100 randomly sampled questions with multiple answer annotations (averaging 2.73 answers per question). We found that 207 (75.8%) of the answers were correct. 98% of the questions had at least one correct answer, and 77% had most of the answers correct.

4 Modeling Qasper

This section explains the task, evaluation metrics, and a model addressing QASPER tasks.

4.1 Task Setup

We formally define the QASPER tasks as follows: Given a paper, and a question about it, the primary task is to determine if the question is answerable, and output a predicted answer, that is one or more spans in the full-text of the paper, yes, no or other free-form text. A system built for this will be eval-uated based on the correctness of the predicted answer measured against the reference answers.

Since QASPER also provides labeled evidence for all questions, the system may also use auxiliary supervision provided by the evidence. One such auxiliary task is to predict the evidence required for the question. The inputs are the same as that of the primary task, but the outputs are expected to be one or more paragraphs in the fulltext, figures, or tables, and they will be evaluated against labeled evidence spans.

Evaluation metrics As an automatic proxy for the measure of correctness of all types of answers, we use the span-level F 1 measure proposed by Rajpurkar et al. (2016) . We convert answers that are multiple selected spans into single commaseparated strings. For questions with multiple reference answers, we compute the max span-F 1 of the predictions over all the references. We evaluate the performance of a system over the auxiliary task by computing a F 1 score over the set of paragraphs, figures, and tables chosen by the system against the reference evidence, considering a max when there are multiple references. We refer to these metrics as Answer-F 1 and Evidence-F 1 , respectively.

Data Splits

We split the dataset into train, validation, and test sets, so that each paper appears in only one of them. Our analysis of correctness of annotations presented in Section 3 indicates a high likelihood (98%) of evaluating against a correct reference when evaluation is aggregated over multiple references. Hence we ensure that most of the questions in validation and test sets have multiple references (98% in test, and 74% in validation). This resulted in 2,593, 1,005, and 1,451 questions in the three sets, respectively.

Estimating human performance To estimate an upper bound on model performance given our data splits and metrics, we assess the performance of the workers when evaluated against each other using the same metrics on a sample of the test set. Since model performance is evaluated by aggregating over multiple references, we consider a subset of the test set containing questions with at least three references (40% of the test set), evaluate each reference against the remaining, and compute an average over all such combinations. This procedure estimates the human performance to be 60.9 Answer-F 1 , and 71.6 Evidence-F 1 . Note that given the disagreements among the workers estimated in Section 3, this is a lower bound on human performance for two reasons: first, because only two annotations are used to compute the metric, while systems are evaluated against all three; and second, because the annotators are NLP practitioners, not expert researchers, and it is likely that an expert would score higher. Hence we report these numbers, along with a breakdown over answer types in Table 2 and Table 3 as human performance lower bounds.

Table 2: LED-base and lower-bound human performance on answering questions in QASPER, measured in AnswerF!. The top three rows are heuristic baselines that try to predict answers without encoding entire papers. w/ scaff. refers to the inclusion of the evidence selection scaffold during training.

Table 3: Model and lower-bound human performance on selecting evidence for questions in QASPER

4.2 Qasper Model

We base our model on pretrained Transformer (Vaswani et al., 2017 ) models which currently produce state-of-the-art results on a majority of QA tasks. 9 Recall that QASPER introduces two main modeling challenges -different answer types and long input documents.

First, QASPER includes a variety of answer types, including extractive, abstractive, yes/no, and unanswerable questions, which means a typical spanselection BERT-based QA model is not sufficient to support all these answer types. We address this by converting all answer types into a single task: generating answer text (Raffel et al., 2020; Khashabi et al., 2020) . 10 This is a sequence-to-sequence formulation that requires an encoder-decoder Transformer model where the encoder reads the question and the document and the decoder generates the answer text.

Second, research papers are much longer than the typical 512 or 1024 token limit of most BERTlike models, so we need a Transformer model that can process long inputs. We use the Longformer-Encoder-Decoder (LED; Beltagy et al., 2020), an encoder-decoder Transformer model that can efficiently process input sequences thousands of tokens long. With LED's support for input sequence length of 16K tokens, we can encode 99% of the paper full texts in the QASPER dataset without truncation.

Longformer-Encoder-Decoder (LED) LED (Beltagy et al., 2020) is a variant of the original Transformer encoder-decoder model that replaces the Transformer's full self-attention in the encoder with the efficient local+global attention pattern of Longformer. This allows each token to attend to only its local window and a pre-specified set of global locations of interest, thereby scaling self-attention computation linearly with the input size (as opposed to quadratically with full context self-attention). LED has a similar architecture to BART (Lewis et al., 2020) in terms of number of layers and hidden state sizes, with the distinction that it has a larger position embeddings matrix, allowing it to process inputs of up to 16K tokens long (up from 1K tokens in the original BART model). In practice, LED's parameters are initialized from a pretrained BART model, and LED copies BART's position embeddings 16 times to fill the entire 16K position embeddings matrix. For all experiments we use the LED-base sized model, which uses BART-base weights.

Input and Output Encoding For the input, we follow the Longformer QA models (Beltagy et al., 2020) and encode the question and context in one concatenated string with "global attention" over all the question tokens. For the output, all answer types are encoded as single strings. The string is the text of the abstractive answer, a comma separated concatenation of the extractive spans, "Yes", "No", or "Unanswerable".

Evidence extraction To support extracting evidence paragraphs, we prepend each paragraph with a token and add a classification head over these tokens on LED's encoder side. We also add Longformer's global attention over these tokens to facilitate direct information flow across the paragraphs. We then train LED using both loss functions (teacher-forced text generation and paragraph classification) in a multi-task training setup. For the answer generation, we use a cross-entropy loss function over the vocabulary. For the evidence paragraph extraction, we use a cross-entropy loss function with binary 0 or 1 gold labels for evidence/nonevidence paragraph. To account for class imbalance, we use loss scaling with weights proportional to the ratio of positive to negative gold paragraphs in the batch, which we found to be crucial for the model to train. One benefit of multi-task training of evidence extraction along with answer selection is that tasks can benefit each other (see Section 5.2).

5 Experiments

We evaluate model performance on question answering and evidence selection tasks, and compare them to estimated lower bounds on human performance. These human performance estimates are calculated by comparing the answers of questions for which we have multiple human annotations. For each question, we choose one annotation as if it were a prediction, and evaluate it against the rest of the annotations, and consider as human performance the average over all annotations chosen as predictions. We restrict our experiments to the subset of questions in QASPER that can be answered from text in the paper, ignoring those that require figures or tables as evidence (13% of the dataset; see Section 3) to avoid having to deal with multimodal inputs. We leave multimodal question answering to future work.

5.1 Training Details

We train all models using the Adam optimizer (Kingma and Ba, 2014) and a triangular learning rate scheduler (Howard and Ruder, 2018) with 10% warmup. To determine number of epochs, peak learning rate, and batch size, we performed manual hyperparameter search on a subset of the training data. We searched over {1, 3, 5} epochs with learning rates {1e −5 , 3e −5 , 5e −5 , 9e −5 }, and found that smaller batch sizes generally work better than larger ones. Our final configuration was 10 epochs, peak learning rate of 5e −5 , and batch size of 2, which we used for all reported experimental settings. When handling full text, we use gradient checkpointing (Chen et al., 2016) to reduce memory consumption. We run our experiments on a single RTX 8000 GPU, and each experiment takes 30-60 minutes per epoch. Table 2 shows the overall performance of the LED-base model 11 on question answering, as well as the performance breakdown on the different answer types. The table also compares LED-base variants when the input is heuristically limited to smaller parts of the paper (i.e., no context, abstract, introduction). We generally observe that, by using more context, the performance improves. Specifically, as we observe in row 5 encoding the entire context results in significant overall performance improvement (∆ = +9.5) over the best heuristic ("introduction"). This signifies the importance of encoding the entire paper. Comparing rows 4 and 5, we observe that using the Table 2 : LED-base and lower-bound human performance on answering questions in QASPER, measured in Answer-F ! . The top three rows are heuristic baselines that try to predict answers without encoding entire papers. w/ scaff. refers to the inclusion of the evidence selection scaffold during training.

Question Answering

evidence prediction as a multi-task scaffolding objective helps, improving the results by ∆ = +0.8 points.

Evidence selection Table 3 illustrates the evidence selection performance of the LED-large and LED-base models compared with simpler baselines. We observe that LED variants outperform the simple TF-IDF baseline but there still remains a large gap to human performance.

Varying amounts of training Figure 2 shows the learning curve that measures the validation Answer-F 1 and Evidence-F 1 of the LED-base variants based on training data size. The learning curve suggests that performance has not reached a plateau, and future data collection could be useful.

Figure 2: Learning curves showing Answer-F1 and Evidence-F1 on the dev. set while varying training data size.

Answer prediction from gold evidence To better isolate the question answering (as opposed to evidence selection) task performance, we perform oracle experiments where models are given the gold evidence. For these experiments, we are able to use larger (T5-large; Raffel et al., 2020) or better task-adapted pretrained models (UnifiedQA-large; Khashabi et al., 2020) , which perform significantly better in the oracle setting. We did not use them in the non-oracle setting, however, as Longformer versions of these models are not available, and LED's ability to handle the full document without the need for a pipelined retrieval system was more important. These experiments show that (1) the human lower bound is in fact a lower bound, as large models exceed it for span answers in this setting;

(2) the majority of the large headroom in the non-oracle setting can be closed with better evidence selection; and (3) research into making large pretrained models able to better scale to long documents would be beneficial. Error analysis To gain insight into the model's errors, we sample 67 test questions with predicted Answer-F 1 scores below 0.10 from the LED model trained with evidence prediction scaffolding. We remove four cases in which the predicted answers are actually correct. Examining gold answers of the remaining 63, we find 31 are extractive, 24 are abstractive, 3 are "yes", 3 are "no," and 2 are unanswerable. We observe that LED often predicts shorter spans than the gold answers (9.5 words shorter than gold counterparts, on average). Focusing only on the 55 questions with either extractive or abstractive gold answers, we manually categorize error types in Table 5 .

Table 5: Error analysis of our best model (LED from row 5 from Table 2) on 55 test examples with low F1 score (excluding those with “yes,” “no,” or “unanswerable” gold answers). “Quotations” denote extractive gold answers. We note Lacks domain knowledge errors are not always solved by better entity type resolution (see †).

6 Related Work

Information-Verifying QA A large body of work on question answering follows the information-verifying paradigm where the writer of the question already knows its answer, and the questions are written solely for evaluating the knowledge or understanding capabilities of machines. Some examples include SQuAD (Rajpurkar et al., 2016) , TriviaQA (Joshi et al., 2017) , NarrativeQA (Kočiský et al., 2018) , WikiHop (Welbl et al., 2018) , HotpotQA (Yang et al., 2018) , CoQA (Reddy et al., 2019) , DROP (Dua et al., 2019) , QUOREF . Most datasets for QA on academic research papers also fall within the information-verifying paradigm as they automatically construct QA examples using extracted entities and relations and structured knowledge resources, like DrugBank. Some examples include emrQA (Pampari et al., 2018) , BioRead (Pappas et al., 2018) , BioMRC (Pappas et al., 2020) , MedHop (Welbl et al., 2018) . While these datasets enabled significant progress in machine comprehension, they include biases in questions that may not reflect real-world settings (Kwiatkowski et al., 2019) .

Information-Seeking QA in General Domain Recognizing this challenge, others have followed an information-seeking paradigm where the writer of questions is genuinely interested in finding the answer to the question, or at least does not have access to the answer. Examples of such datasets include WikiQA (Yang et al., 2015) , NewsQA (Trischler et al., 2017 ), MsMarco (Campos et al., 2016 , QuAC (Choi et al., 2018) , Natural Questions (Kwiatkowski et al., 2019) , TyDiQA , and IIRC (Ferguson et al., 2020) . Un-like QASPER, Natural Questions and TyDiQA 12 questions are not grounded in any contexts, and the associated documents are linked to the questions after they are written. In contrast, QASPER's questions are real follow-up questions about a paper that a reader of appropriate domain expertise would have after reading the title and the abstract. The priming lets the readers ask detailed questions that are specific to the papers in context, those that require a deeper understanding of the contexts, like those shown in Figure 1 and Table 1 . QuAC used similar data collection method but with focus on entities, which QASPER does not impose.

Domain-Specific Information-seeking QA Some work has been done on information-seeking QA on academic research papers. PubmedQA (Jin et al., 2019) derives Yes/No/Maybe questions from PubMed paper titles answered from the conclusion sections of the corresponding abstracts. BioAsq benchmarks (Balikas et al., 2013; Nentidis et al., 2018; Krallinger et al., 2020) focus on open-domain QA over PubMed abstracts. Like QASPER, BioAsq answers can take different forms (e.g., yes/no, extracted span(s)). QASPER differs from BioAsq in that questions are grounded in a single paper of interest. Furthermore, QASPER uses the paper full text, not just the abstract. To the best of our knowledge, QASPER is the first information-seeking QA dataset in a computer science domain, while most prior work using academic research papers has been in biomedicine. Furthermore, with over 5K annotated questions, QASPER is also larger than other comparable human-annotated QA datasets -PubmedQA and BioAsq contain 1K and 3.2K questions, respectively. Finally, QASPER poses a challenging full document-level task while other related datasets are abstract-level. Beyond the domain of academic research, realistic QA datasets have also been built in the privacy policy domain (Ravichander et al., 2019; Ahmad et al., 2020) . These tasks are similar to our evidence selection task.

7 Conclusion

We presented QASPER, an information-seeking QA dataset over NLP research papers. With natural questions asked as follow-up to titles and abstracts, the task presented by QASPER requires evidence "ZeroR, Naïve Bayes, J48, and random forest" Weka classifiers Lacks numeracy 7.3% How many tags are included in the ENE tag set? "200 fine-grained categories" 1 Table 5 : Error analysis of our best model (LED from row 5 from Table 2 ) on 55 test examples with low F 1 score (excluding those with "yes," "no," or "unanswerable" gold answers). "Quotations" denote extractive gold answers. We note Lacks domain knowledge errors are not always solved by better entity type resolution (see †).

from multiple paragraphs and/or figures and tables within the full text of the papers. Our empirical results show plenty of room for improvement when compared to the estimated human performance, and suggest that QASPER could serve as a test-bed for evaluating document-grounded QA research.

Ethical Considerations

We present a new dataset that uses papers authored by other researchers. To adhere to copyright, we have restricted ourselves to arXiv papers released under a CC-BY-* license, as identified via Unpaywall, which was used in the S2ORC (Lo et al., 2020) dataset construction. Due to our choice to use arXiv as the source of papers, QASPER is almost entirely an English-language dataset, and QA systems built on QASPER would not be expected to work well on non-English language research papers.

We have determined the amount we paid the annotators to be well-above the minimum wage in our local area. While we do collect information about annotator background in NLP and familiarity with the papers they are annotating, we have not collected personal identifiable information without their permission except for payment purposes, and do not include any such information in the released dataset.

We accessed both release versions 20190928 and 20200705v1.

LaTeX allows us to avoid quality issues with PDF parsing.4 We chose those either tagged with the cs.CL arXiv category or published with an ACL Anthology identifier.5 http://semanticscholar.org 6 https://www.upwork.com/

Two domain-experts independently judged these, and achieved a Cohen's κ of 0.94.8 S2ORC provides section hierarchy derived from LaTeX source

https://paperswithcode.com/task/ question-answering10 We tried a model that predicts answer type, then based on the type uses a different head to predict the corresponding answer. This model performed much worse than the proposed seq2seq formulation.

We trained an LED-large model as well, but it performed much worse than the base model on the QA task.

TyDiQA uses short snippets to prime annotators to write questions of interest, but the annotation process does not require workers to write questions grounded in those snippets.