QASC: A Dataset for Question Answering via Sentence Composition

Tushar Khot
Peter Clark
Michal Guerquin
Peter A. Jansen
Ashish Sabharwal
AAAI
2020
View in Semantic Scholar

Abstract

Composing knowledge from multiple pieces of texts is a key challenge in multi-hop question answering. We present a multi-hop reasoning dataset, Question Answering via Sentence Composition(QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question. QASC is the first dataset to offer two desirable properties: (a) the facts to be composed are annotated in a large corpus, and (b) the decomposition into these facts is not evident from the question itself. The latter makes retrieval challenging as the system must introduce new concepts or relations in order to discover potential decompositions. Further, the reasoning model must then learn to identify valid compositions of these retrieved facts using common-sense reasoning. To help address these challenges, we provide annotation for supporting facts as well as their composition. Guided by these annotations, we present a two-step approach to mitigate the retrieval challenges. We use other multiple-choice datasets as additional training data to strengthen the reasoning model. Our proposed approach improves over current state-of-the-art language models by 11% (absolute). The reasoning and retrieval problems, however, remain unsolved as this model still lags by 20% behind human performance.

1 Introduction

Several multi-hop question-answering (QA) datasets have been proposed to promote research on multi-sentence machine comprehension. On one hand, many of these datasets (Mihaylov et al. 2018; Clark et al. 2018; Welbl, Stenetorp, and Riedel 2018; Talmor and Berant 2018) do not come annotated with sentences or documents that can be combined to produce an answer. Models must thus learn to reason without direct supervision. On the other hand, datasets that come with such annotations involve either single-document questions (Khashabi et al. 2018a) leading to a strong focus on coreference resolution and entity tracking, or multi-document questions (Yang et al. 2018) whose decomposition into simpler single-hop queries is often evident from the question itself.

We propose a novel dataset, Question Answering via Sentence Composition (QASC; pronounced kask) of 9,980 multi-hop multiple-choice questions (MCQs) where simple syntactic cues are insufficient to determine how to decom-Question: Differential heating of air can be harnessed for what? (A) electricity production (D) reduce acidity of food (B) erosion prevention . . . (C) transfer of electrons . . .

Annotated Facts:

fS: Differential heating of air produces wind .

fL: Wind is used for producing electricity .

Composed fact fC : Differential heating of air can be harnessed for electricity production . Training data includes the associated facts f S and f L shown above, as well as their composition f C . The term wind connects f S and f L , but appears neither in f C nor in the question. Further, decomposing the question relation "harnessed for" into f S and f L requires introducing the new relation "produces" in f S . The question can be answered by using broad knowledge to compose these facts together and infer f C . (Colors are used for illustration purposes only.)

pose the question into simpler queries. Fig. 1 gives an example, where the question is answered by decomposing its main relation "harnessed for" (in f C ) into a similar relation "used for" (in f L ) and a newly introduced relation "produces" (in f S ), and then composing these back to infer f C . While the question in Figure 1 can be answered by composing the two facts f S and f L , that this is the case is unclear based solely on the question. This property of relation decomposition not being evident from reading the question pushes reasoning models towards focusing on learning to compose new pieces of knowledge, a key challenge in language understanding. Further, f L has no overlap with the question, making it difficult to retrieve it in the first place.

Figure 1: A sample 8-way multiple choice QASC question. Training data includes the associated facts fS and fL shown

Let's contrast this with an alternative question formulation: "What can something produced by differential heating of air be used for?" Although awkwardly phrased, this variation is easy to syntactically decompose into two simpler queries, as well as to identify what knowledge to retrieve. In fact, multi-hop questions in many existing datasets (Yang et al. 2018; Talmor and Berant 2018) often follow this syntactically decomposable pattern, with questions such as: "Which Table 1 : QASC has several desirable properties not simultaneously present in any single existing multihop QA dataset. Here "available" indicates that the dataset comes with a corpus that is guaranteed to contain supporting facts, while "annotated" indicates that these supporting facts are additionally annotated.

Table 1: QASC has several desirable properties not simultaneously present in any single existing multihop QA dataset. Here “available” indicates that the dataset comes with a corpus that is guaranteed to contain supporting facts, while “annotated” indicates that these supporting facts are additionally annotated.

government position was held by the lead actress of X?" All questions in QASC are human-authored, obtained via a multi-step crowdsourcing process (Section 3). To better enable development of both the reasoning and retrieval models, we also provide the pair of facts that were composed to create the question. 1 We use these annotations to develop a novel two-step retrieval technique that uses question-relevant facts to guide a second retrieval step. To make the dataset difficult for fine-tuned language models using our proposed retrieval (Section 5), we further augment the answer choices in our dataset via a multi-adversary distractor choice selection method (Section 6) that does not rely on computationally expensive multiple iterations of adversarial filtering (Zellers et al. 2018) .

In summary, we make the following contributions: (1) a dataset QASC of 9,980 8-way multiple-choice questions from elementary and middle school level science, with a focus on fact composition; (2) a pair of facts f S ,f L from associated corpora annotated for each question, along with a composed fact f C entailed by f S and f L , which can be viewed as a form of multi-sentence entailment dataset; (3) a novel two-step information retrieval approach designed for multi-hop QA that improves the recall of gold facts (by 43 pts) and QA accuracy (by 14 pts); and (4) an efficient multimodel adversarial answer choice selection approach.

QASC is challenging for current large pre-trained language models (Peters et al. 2018a; Devlin et al. 2019) , which exhibit a gap of 20% (absolute) to a human baseline of 93%, even when massively fine-tuned on 100K external QA examples in addition to QASC and provided with relevant knowledge using our proposed two-step retrieval. Table 1 summarizes how QASC compares with several existing datasets along five key dimensions (discussed below), which we believe are necessary for effectively developing retrieval and reasoning models for knowledge composition.

2 Comparison With Existing Datasets

Existing datasets for the science domain require different reasoning techniques for each question 2018) . The dataset most similar to our work is OpenBookQA (Mihaylov et al. 2018) , which comes with multiple-choice questions and a book of core science facts used as the seed for question generation. Each question requires combining the seed core fact with additional knowl-edge. However, it is unclear how many additional facts are needed, or whether these facts can even be retrieved from any existing knowledge sources. QASC, on the other hand, explicitly identifies two facts deemed (by crowd workers) to be sufficient to answer a question. These facts exist in an associated corpus and are provided for model development.

MultiRC (Khashabi et al. 2018a ) uses passages to create multi-hop questions. However, MultiRC and other singlepassage datasets (Mishra et al. 2018; Weston et al. 2015) have a stronger emphasis on passage discourse and entity tracking, rather than relation composition.

Multi-hop datasets from the Web domain use complex questions that bridge multiple sentences. We discuss 4 such datasets. (a) WikiHop (Welbl, Stenetorp, and Riedel 2018) contains questions in the tuple form (e, r, ?) based on edges in a knowledge graph. However, WikiHop lacks questions with natural text or annotations on the passages that could be used to answer these questions. (b) ComplexWebQuestions (Talmor and Berant 2018) was derived by converting multi-hop paths in a knowledge-base into a text query. By construction, the questions can be decomposed into simpler queries corresponding to knowledge graph edges in the path. (c) HotPotQA (Yang et al. 2018 ) contains a mix of multihop questions authored by crowd workers using a pair of Wikipedia pages. While these questions were authored in a similar way, due to their domain and task setup, they also end up being more amenable to decomposition. (d) A recent dataset, DROP (Dua et al. 2019) , requires discrete reasoning over text (such as counting or addition). Its focus is on performing discrete (e.g., mathematical) operations on extracted pieces of information, unlike our proposed sentence composition task.

Many systems answer science questions by composing multiple facts from semi-structured and unstructured knowledge sources (Khashabi et al. 2016; Khot, Sabharwal, and Clark 2017; Jansen et al. 2017; Khashabi et al. 2018b ). However, these often require careful manual tuning due to the large variety of reasoning techniques needed for these questions (Boratko et al. 2018) and the large number of facts that often must be composed together (Jansen 2018; Jansen et al. 2016) . By limiting QASC to require exactly 2 hops (thereby avoiding semantic drift issues with longer paths (Fried et al. 2015; Khashabi et al. 2019) ) and explicitly annotating these hops, we hope to constrain the problem enough so as to enable the development of supervised models for identifying and composing relevant knowledge.

2.1 Implicit Relation Decomposition

As mentioned earlier, a key challenge in QASC is that syntactic cues in the question are insufficient to determine how one should decompose the question relation, r Q , into two sub-relations, r S and r L , corresponding to the associated facts f S and f L . At an abstract level, 2-hop questions in QASC generally exhibit the following form:

Q r Q (x q , z ? a ) r ? S (x q , y ? ) ∧ r ? L (y ? , z ? a ) ⇒ r Q (x q , z ? a )

where terms with a '?' superscript represent unknowns: the decomposed relations r S and r L as well as the bridge concept y. (The answer to the question, z ?

a , is an obvious unknown.) To assess whether relation r Q holds between some concept x q in the question and some concept z a in an answer candidate, one must come up with the missing or implicit relations and bridge concept. In our previous example, r Q ="harnessed for", x q ="Differential heating of air", y ="wind", r S ="produces", and r L ="used for".

In contrast, syntactically decomposable questions in many existing datasets often spell out both r S and r L :

Q r S (x q , y ? ) ∧ r L (y ? , z ?

Table 2: Examples of questions generated via the crowd-sourcing process along with the facts used to create each question.

a ). The example from the introduction, "Which government position was held by the lead actress of X?", could be stated in this notation as: lead-actress(X, y ? ) ∧ held-govt-posn(y ? , z ? a ). This difference in how the question is presented in QASC makes it challenging to both retrieve relevant facts and reason with them via knowledge composition. This difficulty is further compounded by the property that a single relation r Q can often be decomposed in multiple ways into r S and r L . We defer a discussion of this aspect to later, when describing QASC examples in Table 3 . Figure 2 gives an overall view of the crowdsourcing process. The process is designed such that each question in QASC is produced by composing two facts from an existing text corpus. Rather than creating compositional questions from scratch or using a specific pair of facts, we provide workers with only one seed fact f S as the starting point. They are then given the creative freedom to find other relevant facts that could be composed with this seed fact. This allows workers to find other facts compose naturally with f S and thereby prevent complex questions that describe the composition explicitly.

Figure 2: Crowd-sourcing questions using the seed corpus FS and the full corpus FL.

Table 3: These examples of sentence compositions result in the same composed relation, causes, but via two different composition rules: located + causes⇒ causes and causes + causes⇒ causes These rules are not evident from the composed fact, requiring a model reasoning about the composed fact to learn the various possible decompositions of causes.

3 Multihop Question Collection

Once crowd-workers identify a relevant fact f L ∈F L that can be composed with f S , they create a new composed fact f C and use it to create a multiple-choice question. To ensure that the composed facts and questions are consistent with our instructions, we introduce automated checks to catch any inadvertent mistakes. E.g., we require that at least one intermediate entity (marked in red in subsequent sections) must be dropped to create f C . We also ensure that the intermediate entity wasn't re-introduced in the question.

These questions are next evaluated against baseline systems to ensure hardness, i.e., at least one of the incorrect answer choices had to be be preferred over the correct choice by one of two QA systems (IR or BERT; described next), with a bonus incentive if both systems were distracted.

3.1 Input Facts

We noticed that the quality of the seed facts can have a strong correlation with the quality of the question. So we created a small set of 928 good quality seed facts F S from clean knowledge resources (more details in Appendix B.1). To ensure that the workers are then able to find any potentially composable fact, we used a large web corpus of 17M cleaned up facts F L (more details in Appendix B.2). Note that the distinction between F S and F L is only used to create the dataset; the models use the fact corpus

F QASC = F S ∪F L .

We believe that this is a more general setting, since a special "book" of core facts may not always be available.

3.2 Baseline Qa Systems

Our first baseline is the IR system (Clark et al. 2016) designed for science QA with its associated corpora. It retrieves sentences for each question and answer choice from the associated corpora, and returns the answer choice with the highest scoring sentence (based on the retrieval score).

Our second baseline uses the language model BERT of Devlin et al. (2019) . We follow their QA approach for the multiple-choice situation inference task SWAG (Zellers et al. 2018) . We refer to this model as BERT-MCQ in subsequent sections (model details in Appendix C). We used the bert-large-uncased model and fine-tuned it sequentially on two datasets: (1) RACE (Lai et al. 2017 ) with context; (2) SCI questions (ARC-Challenge+Easy ) + OpenBookQA (Mihaylov et al. 2018 ) + Regents 12th Grade Exams 2 ).

3.3 Question Validation

We validated these questions by having 5 crowdworkers answer them. Any question answered incorrectly or considered unanswerable by at least 2 workers was dropped, reducing the collection to 7,660 questions. The accuracy of the IR and BERT models used in Step 4 was 32.25% and 38.73%, resp., on this reduced subset. 3 By design, every question has the desirable property of being annotated with two sentences from F QASC that can be composed to answer it.

We next analyze the retrieval and reasoning challenges associated with these questions. Based on these analyses, we will propose a new baseline model for multi-hop questions that substantially outperforms existing models on this task. We use this improved model to adversarially select additional distractor choices to produce the final QASC dataset. Table 2 shows sample crowd-sourced questions along with the associated facts. Consider the first question: "What can trigger immune response?". One way to answer it is to first retrieve the two annotated facts (or similar facts) from the corpus. But the first fact, like many other facts in the corpus, overlaps only with the words in the answer "transplanted organs" and not with the question, making retrieval challenging. Even if the right facts are retrieved, the QA model would have to know how to compose the "found on" relation in the first fact with the "trigger" relation in the second fact. Unlike previous datasets (Yang et al. 2018; Talmor and Berant 2018) , the relations to be composed are not explicitly mentioned in the question, making reasoning also challenging. We next discuss these two issues in detail.

4.1 Retrieval Challenges

We analyze the retrieval challenges associated with finding the two supporting facts associated with each question. Note that, unlike OpenBookQA, we consider the more general setting of retrieving relevant facts from a single large corpus F QASC = F S ∪ F L instead of assuming the availability of a separate small book of facts (i.e., F S ).

Standard IR approaches for QA retrieve facts using question + answer as their IR query (Clark et al. 2016; Khot, Sabharwal, and Clark 2017; Khashabi et al. 2018b) . While this can be effective for lookup questions, it is likely to miss important facts needed for multi-hop questions. In 96% of our crowd-sourced questions, at least one of the two annotated facts had an overlap of fewer than 3 tokens (ignoring stop words) with this question + answer query, making it difficult to retrieve such facts. 4 Note that our annotated facts form one possible pair that could be used to answer the question. While retrieving these specific facts isn't necessary, these crowd-authored questions are generally expected to have a similar overlap level to other relevant facts in our corpus.

Neural retrieval methods that use distributional representations can help mitigate the brittleness of word overlap measures, but also vastly open up the space of possibly relevant sentences. We hope that our annotated facts will be useful for training better neural retrieval approaches for multi-hop reasoning in future work. In this work, we focused on a modified non-neural IR approach that exploits the intermediate concepts not mentioned in the question ( red words in our examples), which is explained in Section 5.1.

4.2 Reasoning Challenges

As described earlier, we collected these questions to require compositional reasoning where the relations to be composed are not obvious from the question. To verify this, we analyzed 50 questions from our final dataset and identified the key relations in f S , f L , and the question, referred to as r S , r L , and r Q , respectively (see examples in Table 3 ). 7 of the 50 questions could be answered using only one fact and 4 of them didn't use either of the two facts. We analyzed the remaining 39 questions to categorize the associated reasoning challenges. In only 2 questions, the two relations needed to answer the question were explicitly mentioned in the question itself. In comparison, the composition questions in HotpotQA had both the relations mentioned in 47 out of 50 dev questions in our analysis.

Since there are a large number of lexical relations, we focus on 16 semantic relations in our analysis such as causes, performs, etc. These relations were defined based on previous analyses on science datasets (Clark et al. 2014; Jansen et al. 2016; Khashabi et al. 2016) . We found 25 unique relation composition rules (i.e.,

r S (X, Y ), r L (Y, Z) ⇒ r Q (X, Z)).

On average, we found every query relation r Q had 1.6 unique relation compositions. Table 3 illustrates two different relation compositions that lead to the same causes query relation. As a result, models for QASC have a strong incentive to learn various possible compositions that lead to the same semantic relation, as well as extract them from text.

5 Question Answering Model

We now discuss our proposed two-step retrieval method and how it substantially boosts the performance of BERT-based QA models on crowd-sourced questions. This will motivate the need for adversarial choice generation.

5.1 Retrieval: Two-Step Ir

Consider the first question in Table 2 . An IR approach that uses the standard q + a query is unlikely to find the first fact since many irrelevant facts would also have the same overlapping words -"transplanted organs". However, it is likely to retrieve facts similar to the second fact, i.e., "Antigens trigger immune response". If we could recognize antigen as an important intermediate entity that would lead to the answer, we can then query for sentences connecting this intermediate entity ("antigens" ) to the answer ("transplanted organs" ) which is then likely to find the first fact ("antigens are found on transplanted organs" ). One potential way to identify such an intermediate concept is to consider the new entities introduced in the first retrieved fact that are absent from the question, i.e., f 1 \ q.

Based on this intuition, we present a simple but effective two-step IR baseline for multi-hop QA: (1) Retrieve K (=20 for efficiency) facts F 1 based on the query Q=q + a; (2) For each f 1 ∈ F 1 , retrieve L (=4 to promote diversity) facts F 2 each of which contains at least one word from Q \ f 1 and Table 3 : These examples of sentence compositions result in the same composed relation, causes, but via two different composition rules: located + causes ⇒ causes and causes + causes ⇒ causes These rules are not evident from the composed fact, requiring a model reasoning about the composed fact to learn the various possible decompositions of causes.

from f 1 \ Q;

(3) Filter {f 1 , f 2 } pairs that do not contain any word from q or a; (4) Select top M unique facts from {f 1 , f 2 } pairs sorted by the sum of their individual IR score. Each retrieval query is run against an ElasticSearch 5 index built over F QASC with retrieved sentences filtered to reduce noise (Clark et al. 2018) . We use the set-difference between the stemmed, non-stopword tokens in q + a and f 1 to identify the intermediate entity. Generally, we are interested in finding facts that connect new concepts introduced in the first fact (i.e., f 1 \ Q) to concepts not yet covered in question+answer (i.e., Q\f 1 ). Training a model on our annotations or essential terms (Khashabi et al. 2017) could help better identify these concepts. Recently, Khot, Sabharwal, and Clark (2019) proposed a span-prediction model to identify such intermediate entities for OpenBookQA questions. Their approach, however, assumes a gold fact is known.

The single step retrieval approach (using only f 1 but still requiring overlap with q and a) has an overall recall of only 2.9% (i.e., both f S and f L were in the top 10 sentences for 2.9% of the questions). The two-step approach, on the other hand, has a recall of 45.4%-a 15X improvement (also limited to top M=10 sentences). Even if we relax the recall metric to finding f S or f L , the single step approach underperforms by 30% compared to the two-step retrieval (42.0 vs 71.3%). We will show in the next section that this improved recall also translates to improved QA scores. This shows the value of our two-step approach as well as the associated annotations: progress on the retrieval sub-task enabled by our fact-level annotations can lead to progress on the QA task.

5.2 Reasoning: Bert Models

We primarily use BERT-models fine-tuned on other QA datasets and with retrieved sentences as context, similar to prior state-of-the-art models on MCQ datasets (Sun et al. 2018; Pan et al. 2019) . 6 There is a large space of possible configurations to build such a QA model (e.g., fine-tuning datasets, corpora) which we will explore later in our experimental comparisons. For simplicity, the next few sections will focus on one particular model: the bert-large-cased model fine-tuned on the RACE + SCI questions (with retrieved context 7 ) and then fine-tuned on our dataset with single-step/two-step retrieval. For consistency, we use the same hyper-parameter sweep in all finetuning experiments (cf. Appendix G).

5.3 Results On Crowd-Sourced Questions

To enable fine-tuning models, we split the questions them into 5962/825/873 questions in train/dev/test folds, resp. To limit memorization, any two questions using the same seed fact, f S , were always put in the same fold. Since multiple facts can cover similar topics, we further ensure that similar facts are also in the same fold. (See Appendix D for details.)

While these crowd-sourced questions were challenging for the baseline QA models (by design), models fine-tuned on this dataset perform much better. The BERT baseline that scored 38.7% on the crowd-sourced questions now scores 63.3% on the dev set after fine-tuning. Even the basic singlestep retrieval context can improve over this baseline score by 14.9% (score: 78.2%) and our proposed two-step retrieval improves it even further by 8.2% (score: 86.4%). This shows that the distractor choices selected by the crowdsource workers were not as challenging once the model is provided with the right context. This can be also seen in the incorrect answer choices selected by them in Table 2 where they used words such as "Pain" that are associated with words in the question but may not have a plausible reasoning chain. To Figure 3 : Generating QASC questions using adversarial choice selection. make this dataset more challenging for these models, we next introduce adversarial distractor choices.

Figure 3: Generating QASC questions using adversarial choice selection.

6 Adversarial Choice Generation

To make the crowdsourced dataset challenging for finetuned language models, we use model-guided adversarial choice generation to expand each crowdsourced question into an 8-way question. Importantly, the humanauthored body of the question is left intact (only the choices are augmented), to avoid a system mechanically reverseengineering how a question was generated.

Previous approaches to adversarially create a hard dataset have focused on iteratively making a dataset harder by sampling harder choices and training stronger models (Zellers et al. 2018; 2019a) . While this strategy has been effective, it involves multiple iterations of model training that can be prohibitively expensive with large LMs. In some cases (Zellers et al. 2018; 2019b) , they need a generative model such as GPT-2 (Radford et al. 2019) to produce the distractor choices. We, on the other hand, have a simpler setup where we train only a few models and do not require a model to generate the distractor choices.

6.1 Distractor Options

To create the space of distractors, we follow Zellers et al. (2019a) and use correct answer choices from other questions. This ensures that a model won't be able to predict the correct answer purely based on the answer choices (one of the issues with OpenBookQA). To reduce the chances of a correct answer being added to the set of distractors, we pick them from the most dissimilar questions. We further filter these choices down to ∼30 distractor choices per question by removing the easy distractors based on the fine-tuned BERT baseline model. Further implementation details are provided in Appendix E.

This approach of generating distractors has an additional benefit: we can recover the questions that were rejected earlier for having multiple valid answers (in § 3.3). We add back 2,774 of the 3,361 rejected questions that (a) had at least one Dev Accuracy Single-step retr. Two-step retr.

Original Dataset (4-way)

78.2 86.4 Random Distractors (8-way) 74.9 83.3 Adversarial Distractors (8-way) 61.7 72.9 Table 4 : Results of the BERT-MCQ model on the adversarial dataset using bert-large-cased model and pre-trained on RACE + SCI questions.

Table 4: Results of the BERT-MCQ model on the adversarial dataset using bert-large-cased model and pre-trained on RACE + SCI questions.

worker select the right answer, and (b) were deemed unanswerable by at most two workers. We, however, ignore all crowdsourced distractors for these questions since they were considered potentially correct answers in the validation task. We use the adversarial distractor selection process (to be described shortly) to add the remaining 7 answer choices.

To ensure a clean evaluation set, we use another crowdsourcing task where we ask 3 annotators to identify all possible valid answers from the candidate distractors for the dev and test sets. We filter out answer choices in the distractor set that were considered valid by at least one turker. Additionally, we filter out low-quality questions where more than four distractor choices were marked valid or the correct answer was not included in the selection. This dropped 20% of the dev and test set questions and finally resulted in train/dev/test sets of size 8134/926/920 questions with an average of 30/26.9/26.1 answer choices (including the correct one) per question.

6.2 Multi-Adversary Choice Selection

We first explain our approach, assuming access to K models for multiple-choice QA. Given the number of datasets and models proposed for this task, this is not an unreasonable assumption. In this work, we use K BERT models, but the approach is applicable to any QA model.

Our approach aims to select a diverse set of answers that are challenging for different models. As described above, we first create ∼30 distractor options, D for each question. We then sort these distractor options based on their relative difficulty for these models, defined as the number of models fooled by this distractor:

k I m k (q, d i ) > m k (q, a)

where m k (q, c i ) is the k-th model's score for the question q and choice c i . In case of ties, we then sort these distractors based on the difference between the scores of the distractor choice and the correct answer: q, a) . 8 We used BERT-MCQ models that were fine-tuned on the RACE +SCI dataset as described in the previous section. We additionally fine-tune these models on the training questions with random answer choices added from the the space of distractors to make each question an 8-way multiple-choice question. This ensures that our models have seen answer choices from both the human-authored and algorithmically selected space of distractors. Drawing inspiration from bootstrapping (Breiman 1996) , we create two such datasets with randomly selected distractors from D and use the models Table 5 : QASC scores for previous state-of-the-art models on multi-hop Science MCQ(OBQA), and BERT models with different corpora, retrieval approaches and additional fine-tuning. While the simpler models only show a small increase relative to random guessing, BERT can achieve upto 67% accuracy by fine-tuning on QASC and using the two-step retrieval. Using the BERT models pre-trained with whole-word masking and first fine-tuning on four relevant MCQ datasets (RACE and SCI(3)) improves the score to 73.2%, leaving a gap of over 19.8% to the human baseline of 93%. ARC refers to the corpus of 14M sentences from Clark et al. 2018, BERT-LC indicates 'bert-large-cased' and BERT-LC[WM] indicates whole-word masking.

Table 5: QASC scores for previous state-of-the-art models on multi-hop Science MCQ(OBQA), and BERT models with different corpora, retrieval approaches and additional fine-tuning. While the simpler models only show a small increase relative to random guessing, BERT can achieve upto 67% accuracy by fine-tuning on QASC and using the two-step retrieval. Using the BERT models pre-trained with whole-word masking and first fine-tuning on four relevant MCQ datasets (RACE and SCI(3)) improves the score to 73.2%, leaving a gap of over 19.8% to the human baseline of 93%. ARC refers to the corpus of 14M

k m k (q, d i )−m k (

fine-tuned on these datasets as m k (i.e, K = 2). There is a large space of possible models and scoring functions that may be explored, 9 but we found this simple approach to be effective at identifying good distractors. This process of generating the adversarial dataset is depicted in Figure 3 .

6.3 Evaluating Dataset Difficulty

We select the top scoring distractors using the two BERT-MCQ models such that each question is converted into an 8-way MCQ (including the correct answer and humanauthored valid distractors). To verify that this results in challenging questions, we again evaluate using the BERT-MCQ models with two different kinds of retrieval. Table 4 compares the difficulty of the adversarial dataset to the original dataset and the dataset with random distractors (used for fine-tuning BERT-MCQ models). The original 4-way MCQ dataset was almost solved by the two-step retrieval approach and increasing it to 8-way with random distractors had almost no impact on the scores. But our adversarial choices drop the scores of the BERT model given context from either of the retrieval approaches.

6.4 Qasc Dataset

The final dataset contains 9,980 questions split into [8134|926|920] questions in the [train|dev|test] folds. Each question is annotated with two facts that can be used to answer the question. These facts are present in a corpus of 17M sentences (also provided). The questions are similar to the examples in Table 2 but expanded to an 8-way MCQ and with shuffled answer choices. E.g., the second example there was changed to "What forms caverns by seeping 9 For example, we evaluated the impact of increasing K, but didn't notice any change in the fine-tuned model's score. Table 6 : QASC dataset statistics through rock and dissolving limestone? (A) pure oxygen (B) Something with a head, thorax, and abdomen (C) basic building blocks of life (D) carbon dioxide in groundwater (E) magma in groundwater (F) oxygen in groundwater (G) At the peak of a mountain (H) underground systems". Table 6 gives a summary of QASC statistics, and Table 7 in the Appendix provides additional examples.

Table 7: Example of questions, facts and composed facts in QASC. In the first question, the facts can be composed through the intermediate entity “antigen” to conclude that transplanted organs can trigger an immune response.

7 Experiments

While we used large pre-trained language models first finetuned on other QA datasets (∼100K examples) to ensure that QASC is challenging, we also evaluated BERT models without any additional fine-tuning, as well as several other models developed for multiple-choice science questions in OpenBookQA as briefly described in Appendix F. All models were fine-tuned on the QASC dataset.

As shown in Table 5 , OpenBookQA models, that had close to the state-of-the-art results on OpenBookQA, perform close to the random baseline on QASC. Since these mostly rely on statistical correlations between questions and across choices, 10 this shows that this dataset doesn't have any easy shortcuts that can be exploited by these models.

Second, we evaluate BERT models with different corpora and retrieval. We show that our two-step approach always out-performs the single-step retrieval, even when given a larger corpus. Interestingly, when we compare the two single-step retrieval models, the larger corpus outperforms the smaller corpus, presumably because it increases the chances of having a single fact that answers the question. On the other hand, the smaller corpus is better for the two-step retrieval approach, as larger and noisier corpora are more likely to lead a 2-step search astray.

Finally, to compute the current gap to human performance, we consider a recent state-of-the-art model on multiple leaderboards: AristoBertV7 that uses the BERT model trained with whole-word masking, 11 fine-tuned on the RACE +SCI questions and retrieves knowledge from a very large corpus. Our two-step retrieval based model outperforms this model and improves even further with more fine-tuning. Replacing the pre-trained bert-large-cased model with the whole-word masking based model further improves the score by 4.7%, but there is still a gap of ∼20% to the human score of 93% on this dataset.

8 Conclusion

We present QASC, the first QA dataset for multi-hop reasoning beyond a single paragraph where two facts needed to answer a question are annotated for training, but questions cannot be easily syntactically decomposed into these facts. Instead, models must learn to retrieve and compose candidate pieces of knowledge. QASC is generated via a crowdsourcing process, and further enhanced via multi-adversary distractor choice selection. State-of-the-art BERT models, even with massive fine-tuning on over 100K questions from previous relevant datasets and using our proposed two-step retrieval, leave a large margin to human performance levels, thus making QASC a new challenge for the community.

A.1 Quality Checking

Each step in the crowdsourcing process was guarded with a check implemented on the server-side. To illustrate how this worked, consider a worker eventually arriving at the following 4-way MC question:

Term

Value f S "pesticides cause pollution" f L "Air pollution harms animals" f C "pesticides can harm animals" Question "What can harm animals?" Answer A "pesticides" Choice B

"manure" Choice C "grain" Choice D "hay"

Searching F L . The worker was presented with f S "pesticides cause pollution" and asked to to search through F L for a candidate f L , which was compared to f S by a quality checker to make sure it was appropriate. For example, searching for "pollution harms animals" would surface a candidate f L "Tigers are fierce and harmful animals" which isn't admissible because it lacks overlap in significant words with f S . On the other hand, the candidate f L "Air pollution harms animals" is admissible by this constraint.

Combining facts. When combining f S with f L , the worker had to create a novel composition that omits significant words shared by f S and f L . For example, "pollutants can harm animals" is inadmissible because it mentions "pollutants", which appears in both f S and f L . On the other hand, "pesticides can harm animals" is admissible by this constraint. After composing f C , the worker had to highlight words in common between f S and f C , between f L and f C , and between f S and f L . The purpose of this exercise was to emphasize the linkages between the facts; if these linkages were difficult or impossible to form, the worker was advised to step back and compose a new f C or choose a new f L outright.

Creating a question and answer. After the worker proposed a question and answer, a quality check was performed to ensure both f S and f L were required to answer it. For example, given the question "What can harm animals" and answer "pollution", the quality check would judge it as inadmissible because of the bridge word "pollution" shared by f S and f L that was previously removed when forming f C . The answer "pesticides" does meet this criteria and is admissible.

Creating distractor choices. To evaluate proposed distractor choices, two QA systems were presented with the question, answer and the choice. The distractor choice was considered a distraction if the system preferred the choice over the correct answer, as measured by the internal mechanisms of that system. One of the systems was based on information retrieval (IR), and the other was based on BERT; these were playfully named Irene AI and Bertram AI in the interface.

A.2 Crowdsourcing Task Architecture

Creating each item in this dataset seemed overwhelming for crowd workers, and it wasn't clear how to compensate for the entire work. Initially we presented two tasks for independent workers. First, we collected a question and answer that could be answered by combining a science fact with one more general fact. Second, we collected distractor choices for this question that fooled our systems. However, the quality of submissions for both tasks was low. For the first task, we believe that the simple checklist accompanying the task was insufficient to prevent unreasonable submissions. For the second task, we believe the workers were not in the mindset of the first worker, so the distractor choices too far removed from the context of the question and its facts. Ultimately we decided to combine the two simpler tasks into one longer and more complex task. Additional serverside quality checks gave instant feedback in each section, and numerous examples helped workers intuit the desire. We believe this allowed workers to develop fluency to provide many submissions of a consistent quality.

The crowd work was administered through Amazon Mechanical Turk (https://www.mturk.com). Participating workers neeeded to be in the United States, hold a Master's qualification, have submitted thousands of HITs and had most of them approved. The work spanned several batches, each having its results checked before starting the next. See table 8 for collection statistics.

Table 8: Questions collected. Batches A-F required workers to have completed 10,000 HITs and had 99% of them accepted, while Batch G and H relaxed qualifications and required only 5,000 HITs with 95% accepted.

A.3 User Interface

Figures 4, 5, 6, and 7 at the end of this document show the interface presented to crowd workers. Each step included guidance from external quality checkers and QA systems, so the workers could make progress only if specific criteria were satisfied.

Starting with the 928 seed facts in F S , this process lead to 11,021 distinct human-generated questions. Workers were compensated US$1.25 for each question, and rewarded a bonus of US$0.25 for each question that distracted both QA systems. On average workers were paid US$15/hour. We next describe the baseline systems used in out process.

B.1 Seed Corpus Details

To obtain the set of seed facts F S , we start with two medium size corpora of grade school level science facts: the WorldTree corpus (Jansen et al. 2018 ) and a collection of facts from the CK-12 Foundation. 12 Since the WorldTree corpus contains only facts covering elementary science, we used their annotation protocol to expand it to middle-school science. We then manually selected facts from these three sources that are amenable to creating 2-hop questions. 13 The resulting corpus F S contains a total of 928 facts: 356 facts from WorldTree, 123 from our middle-school extension, and 449 from CK-12. Organisms that live in marine biomes must be adapted to the salt in the water.

Estuaries display characteristics of both marine and freshwater biomes .

Organisms that live in estuaries must be adapted to the salt in the water. Table 7 : Example of questions, facts and composed facts in QASC. In the first question, the facts can be composed through the intermediate entity "antigen" to conclude that transplanted organs can trigger an immune response.

B.2 Large Corpus Details

The corpus of 73M web documents (281GB) originated from Clark et al. (2016) . The procedure to process this corpus involved using spaCy 14 to segment documents into sentences, a Python implementation of Google's langdetect 15 to identify English-language sentences, ftfy 16 to correct Unicode encoding problems, and custom heuristics to exclude sentences with artifacts of web scraping like HTML, CSS and JavaScript markup, runs of numbers originating from tables, email addresses, URLs, page navigation fragments, etc. The conclusion of this process were 17M sentences (1GB).

B.3 Other Corpora

Both the IR scorer and the Regent's subset of the Science questions use the default science corpus used in prior work (Clark et al. 2016) containing 280 GB of plain text extracted from Web pages and 80,000 sentences from domain-targeted sources. For the OpenBookQA questions, we used the associated openbook facts and used the ARC corpus for the ARC questions.

C Bert Mcq Model

Given question q and an answer choice c i , we create

[CLS] q [SEP] c i [SEP]

as the input to the model with the q being assigned to segment 0 and c i to segment 1. 17 The model learns a linear layer to project the representation of the [CLS] token to a score for each choice c i . We normalize the scores across all answer choices using softmax and train the model using the cross-entropy loss. When context/passage is available, we append the passage to 0 th segment, i.e. given a retrived passage p i , we provide

[CLS] p i q [SEP] c i [SEP]

as the input to the BERT model.

D Dataset Split

We use the seed fact f S to compute the similarity between two questions. We use the tf-idf weighted overlap score (not normalized by length) between the two seed facts to compute this similarity score. Creating the train/dev/test split then can be viewed as finding the optimal split in this graph of connected seed facts. The optimality is defined as finding a split such that the similarity between any pair of seed facts in different folds is low. Additionally the train/dev/test splits should contain about 80/10/10% of the questions. Note that each seed fact can have a variable number of associated questions. Given these constraints, we defined a small Integer Linear Program to find the train/dev/test split. We defined a variable n ij for each seed fact f i being a part of each fold t j , where only one of the variable per seed fact can be set to one ( j n ij = 1). We defined an edge variable e ik between every pair of seed fact that is set to one, if the two facts f i and f k are in different folds. The objective function, to be minimized, is computed as the sum of the similarities between the facts in different folds i.e i,k e ik sim(f i , f k ). Additionally, the train/dev/test fold were constrained to have 78/11/11% of the questions (additional dev/test questions to account for the validation filtering step) with a slack of 1% (e.g. for t j =train, 0.77 × Q ≤ i n ij q i ≤ 0.79 × Q where Q is the total number of questions and q i is the number of questions using f i as the seed fact.). To make the program more efficient, we ignore edges with a low tf-idf similarity (set to 10 in our experiments to ensure completion within an hour using GLPK).

E Distractor Options

We create the candidate distractors from the correct answers of questions from the same fold. To ensure that most of these distractors are incorrect answers for the question, we pick the correct answers from questions (within the same fold) most dissimilar to this question. Rather than relying on the question's surface form (which does not capture topical similarity), we use the underlying source facts used to create the question. Intuitively, questions based on the fact "Metals conduct electricity" are more likely to have similar correct answers (metal conductors) compared to questions that use the word "metal". Further, we restrict the answer choices to those that are similar in length to the correct answer, to ensure that a model cannot rely on text length to identify the correct answer.

We reverse sort the questions based on the word-overlap similarity between the two source facts and then use the correct answer choices from the top 300 most-dissmimilar questions. We only consider answer candidates that have at most two additional/fewer tokens and at most 50% additional/fewer characters compared to the correct answer. We next use the BERT Baseline model used for the dataset creation, fine-tuned on random 8-way MCQ and evaluate it on all of the distractor answer choices. We pick the top 30 most distracting choices (i.e. highest scoring choices) to create the final set of distractors.

F Openbookqa Baseline Models

We use three baseline models used by Mihaylov et al. (2018) for the OpenBookQA dataset:

• Odd-one-out: Answers the question based on just the choices by identifying the most dissimilar answer • ESIM Question-Choice + Elmo: Uses the ESIM (Chen et al. 2017) Each retrieval query is run against an ElasticSearch 18 index built over F QASC with retrieved sentences filtered to reduce noise as described by Clark et al. (2018) . The retrieval time of this approach linearly scales with the breadth of our search defined by K (set to 20 in our experiments). We set L to a small value (4 in our experiments) to promote diversity 18 https://www.elastic.co in our results and not let a single high-scoring fact from F 1 overwhelming the results.

G.2 Bert Models

For BERT-Large models, we were only able to fit sequences of length 184 tokens in memory, especially when running on 8-way MC questions. We fixed the learning rate to 1e-5 as it generally performed the best on our datasets. We only varied the number of training epochs: {4, 5} and effective batch sizes: {16, 32}. We chose this small hyper-parameter sweep to ensure that each model was fine-tuned using the same hyper-paramater sweep while not being prohibitively expensive. Each model was selected based on the best validation set accuracy. We used the HuggingFace implementation 19 of the BERT models within the AllenNLP (Gardner et al. 2017) repository in all our experiments. On publication, we will release the code and models to enable reproducibility of our results.

H Overlap Statistics

% Questions min. sim(fact, q +a) < 2 48.6 min. sim(fact, q +a) < 3 82.5 min. sim(fact, q +a) < 4 96.3 Avg. tokens sim(q + a, fS)

3.17 sim(q + a, fL)

1.98 Table 9 : Overlap Statistics computed over the crowd-source questions. In the first three rows, we compute the % of questions with at least one fact have < k token overlap with the q +a. In the last two rows, we compute the average number of tokens that overlap with q +a from each fact. We use stop-word filtered stemmed tokens for these calculations.

Table 9: Overlap Statistics computed over the crowd-source questions. In the first three rows, we compute the % of questions with at least one fact have < k token overlap with the q +a. In the last two rows, we compute the average number of tokens that overlap with q +a from each fact. We use stop-word filtered stemmed tokens for these calculations.

Creating Multiple-Choice Questions To Fool An Ai

We are building a question-answering system to answer simple questions about elementary science, as part of an artificial intelligence (AI) research project. We are planning to eventually make the system available as a free, open source resource on the internet. In particular, we want the computer to answer questions that require two facts to be combined together (i.e., reasoning), and not just simple questions that a sentence on the Web can answer.

To test our system, we need a collection of "hard" (for the computer) questions, i.e., questions that combine two facts. The HIT here is to write a test question that requires CHAINING two facts (a science fact and some other fact) to be combined.

For this HIT, it's important to understand what a "chain" is. Chains are when two facts connect together to produce a third fact. Some examples of chains are:

pesticides cause pollution + pollution can harm animals pesticides can harm animals a solar panel converts sunlight into electricity + sunlight comes from the sun a solar panel converts energy from the sun into electricity running requires a lot of energy + doing a marathon involves lots of running doing a marathon requires a lot of energy There should be some part of the first and second fact which is not mentioned in the third fact. Above, these would be the red words "pollution", "sunlight", and "running" respectively.

Step-by-step example of this HIT

1. You will be given a science fact (Fact 1), e.g., pesticides cause pollution 2. Search for a second fact (Fact 2) that you can combine with it, e.g., pollution can harm animals 3. Write down the combination, e.g., pesticides can harm animals 4. Highlight: in green, words that Fact 1 and the Combined Fact share in blue, words that Fact 2 and the Combined Fact share in red, words that Fact 1 and Fact 2 share For example: pesticides cause pollution (Fact 1) + pollution can harm animals (Fact 2) pesticides can harm animals (Combined Fact) This highlighting checks that you have a valid chain. When you highlight words, they don't have to match exactly, e.g., survival and survive match. For each color, you only need to highlight one word that overlaps, but feel free to highlight more The combined fact doesn't need to just use words from fact 1 and fact 2you can use other words too and vary the wording so the combined fact sounds fluent. There may be several ways of making a question out of your combined question -you can choose! It's typically just a simple rewording of the combined fact. The answer options can be a word, phrase, or even a sentence.

5. Flip Your Combined

Thank you for your help! You rock! Figure 4 : MTurk HIT Instructions. Additional examples were revealed with a button.

Part 1: Find a fact to combine with a science fact

Here is your science fact ("Fact 1"): pesticides cause pollution Now, use the search box below to search for a second fact ("Fact 2") which can be combined with it. This fact should have at least one word in common with Fact 1 above so it can be combined, so use at least one Fact 1 word in your search query.

Enter your search words:

Now select a Fact 2 that you can combine with Fact 1 (or search again). It will be checked automatically.

As a reminder, you are looking for a Fact 2 like those bolded below, which you can combine with Fact 1 to form a new fact in In your search query, use at least one word from Fact 1 so that Fact 1 and Fact 2 can be combined. Search queries with two or more words work best Imagine what a possible Fact 2 might be, and see if you can find it, or something similar to it, in the text corpus. It's okay if your selected sentence also includes some irrelevant information, providing it has the information you want to combine with Fact 1.

About Results:

We search through sentences gathered during a web crawl with minimal filtering. Some things will not look like facts, or may be offensive. Please ignore these.

Quality Check

Here's an automated check of your chosen Fact 2 as it relates to Fact 1:

Evaluation of Fact 2: Air pollution also harms plants and animals.

Good job! You selected a Fact 2 that overlaps with Fact 1 on these significant words: "pollution"

Questions, annotated facts, and corpora are available at https://github.com/allenai/qasc.

http://www.nysedregents.org/livingenvironment

The scores are not 0% as crowdworkers were not required to distract both systems for every question.4 SeeTable 9in Appendix H for more details.

https://www.elastic.co

Experiments section contains numbers for other QA models.7 We use the same single-step retrieval as used by other BERTbased systems on ARC and OpenBookQA leaderboards.

Since we use normalized probabilities as model scores, we do not normalize them here.

Their knowledge-based models do not scale to our corpus of 17M sentences.

https://www.ck12.org 13 Note that while this is a subjective decision, our main goal was to get a reasonable set of seed facts for this task.

https://spacy.io/ 15 https://pypi.org/project/spacy-langdetect/ 16 https://github.com/LuminosoInsight/python-ftfy

We assume familiarity with BERT's notation such as [CLS],[SEP], uncased models, and masking(Devlin et al. 2019).

https://github.com/huggingface/pytorch-pretrained-BERT

Figure 5. Not extracted; please refer to original document.

Figure 6: MTurk HIT. Combining fS and fL to make fC .

Figure 7: MTurk HIT. Creating question, answer and distractor choices.