Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering

Todor Mihaylov
Peter Clark
Tushar Khot
Ashish Sabharwal
EMNLP
2018
View in Semantic Scholar

Abstract

We present a new kind of question answering dataset, OpenBookQA, modeled after open book exams for assessing human understanding of a subject. The open book that comes with our questions is a set of 1326 elementary level science facts. Roughly 6000 questions probe an understanding of these facts and their application to novel situations. This requires combining an open book fact (e.g., metals conduct electricity) with broad common knowledge (e.g., a suit of armor is made of metal) obtained from other sources. While existing QA datasets over documents or knowledge bases, being generally self-contained, focus on linguistic understanding, OpenBookQA probes a deeper understanding of both the topic—in the context of common knowledge—and the language it is expressed in. Human performance on OpenBookQA is close to 92%, but many state-of-the-art pre-trained QA methods perform surprisingly poorly, worse than several simple neural baselines we develop. Our oracle experiments designed to circumvent the knowledge retrieval bottleneck demonstrate the value of both the open book and additional facts. We leave it as a challenge to solve the retrieval problem in this multi-hop setting and to close the large gap to human performance.

1 Introduction

Open book exams are a common mechanism for assessing human understanding of a subject, where test takers are allowed free access to a relevant book, study guide, or class notes when answering questions. In this context, the goal is not to evaluate memorization but a deeper understanding of the material and its application to new situations (Jenkins, 1995; Landsberger, 1996) . The application, in turn, often requires combining a fact in the book (e.g., metals conduct electricity) with additional common knowledge the test taker is ex-Question: Which of these would let the most heat travel through? A) a new pair of jeans. B) a steel spoon in a cafeteria. C) a cotton candy at a store. D) a calvin klein cotton hat.

Science Fact:

Metal is a thermal conductor.

Common Knowledge:

Steel is made of metal. Heat travels through a thermal conductor. pected to have acquired by this stage (e.g., a suit of armor is made of metal).

Motivated by this setting, we present a new kind of question answering dataset, OpenBookQA, 1 that consists of two parts: Q, a set of 5957 multiple-choice questions, and F, a set of 1326 diverse facts about elementary level science. F has three key characteristics of an 'open book': (a) it forms the basis for generating Q; (b) it has been deemed central to scientific explanations (Jansen et al., 2018) ; and (c) by itself, F is generally insufficient to answer questions in Q. Faced with a question q ∈ Q, a student or system S is expected retrieve a relevant fact f ∈ F, and appeal to their own common knowledge, K S , when applying f to answer q. Figure 1 provides an example. Here, metals are thermal conductors is a core scientific fact available in F. One way to apply this fact to decide whether a steel spoon would let the most heat travel through is to appeal to common knowledge that steel is metallic and heat travels through thermal conductors. In general, the expected common knowledge is relatively simple (taxonomic facts, definitions, object properties, etc.); the difficulty lies in identifying it and meaningfully combining it with a core fact from F to answer the question.

Figure 1: An example for a question with a given set of choices and supporting facts.

OpenBookQA questions are challenging as they require multi-hop reasoning with partial context provided by F. Specifically, unlike existing datasets for reading comprehension (RC), answering questions on the back of a textbook (TQA), 2 as well as question answering over structured knowledge-bases (KBQA) , the open book F that comes with OpenBookQA is not self-contained. A successful system must therefore go beyond the typical challenges such as paraphrase matching and coreference resolution, without benefiting from the canonicalized and complete information in KBQA. Generating interesting open book questions is a difficult task. We used a multi-stage process starting with F, using crowd-sourcing to generate (noisy) questions based on F that probe novel situations, using an automatic filter to ensure hardness for retrieval and association based systems, using a crowd filter to ensure answerability by a lay person, and further using an expert filter to ensure higher quality in Dev and Test sets.

We evaluate a number of existing QA systems for science (without retraining) on OpenBookQA, finding that they perform surprisingly close to the random guessing baseline of 25%. Human performance, on the other hand, is close to 92%. 3 Motivated by recent findings of gameability of NLP datasets (Gururangan et al., 2018) , we also develop and evaluate simple, attention-based, neural baselines including a plausible answer detector (which ignores the question text completely) and an odd-one-out solver. These highlight inevitable human bias in any crowdsourced dataset, increasing performance on OpenBookQA to 48%.

Building upon a recent neural model for incorporating external knowledge in the story cloze setting (Mihaylov and Frank, 2018) , we propose a knowledge-aware neural baseline that can utilize both the open book F and common knowledge retrieved from sources such as ConceptNet (Speer et al., 2017) . While retrieving the most useful pieces of knowledge remains an open challenge, our 'oracle' experiments with the fact f used while generating a question q and an interpretation (by the question author) of the additional knowledge k needed for q, provides valuable insight into the nature of this dataset: Facts from the open book F are valuable (5% improvement) but not sufficient. Using both f and k increases the accuracy to 76%, but is still far from human level performance, suggesting the need for non-trivial reasoning to combine these facts.

To encourage further research on this new task, for each Train and Dev question q, OpenBookQA also includes f as intermediate supervision signal, which may be viewed as a partial explanation for q. We leave closing the large gap to human performance as a challenge for the NLP community.

2 Related Work

By construction, answering OpenBookQA questions requires (i) some base science facts from a provided 'open book', (ii) broader understanding about the world (common or commonsense knowledge), and (iii) an ability to combine these facts (reasoning). This setup differs from several existing QA tasks, as summarized below.

Reading Comprehension (RC) datasets have been proposed as benchmarks to evaluate the ability of systems to understand a document by answering factoid-style questions over this document. These datasets have taken various forms: multiple-choice (Richardson et al., 2013) , clozestyle (Hermann et al., 2015; Onishi et al., 2016; Hill et al., 2016) , and span prediction (Rajpurkar et al., 2016; Trischler et al., 2017; Joshi et al., 2017) However, analysis (Chen et al., 2016; Sugawara et al., 2017) of these datasets has shown that many of the questions can be solved with context token matching (Chen et al., 2017a; Weissenborn et al., 2017) or relatively simple paraphrasing.

To focus on the more challenging problem of reasoning across sentences, new datasets have been proposed for multi-step RC. QAngaroo (Welbl et al., 2018) have used a knowledgebase to identify entity pairs (s, o) with a known relation, r, which is also supported by a multihop path in a set of documents. They use structured tuple queries (s, r, ?) and use all the documents along the path as the input passage. Nar-rativeQA (Kociský et al., 2017) is an RC dataset that has been shown to require an iterative reasoning about the narrative of a story. Similar to Open-BookQA, the questions were generated to ensure that the answer is not a direct match or paraphrase that can be retrieved with an IR approach. Most recently, Khashabi et al. (2018) proposed Mul-tiRC, a multiple-choice RC dataset that is designed to require multi-sentence reasoning and can have multiple correct answers. Again, like most RC datasets, it is self-contained.

Tasks with external knowledge. While many of the RC datasets could benefit from commonsense or background knowledge, they are designed to be self-contained, i.e., solvable by the document context alone. Datasets such as the Story Cloze Test (Mostafazadeh et al., 2016) , MCScript, 4 and ProPara (Mishra et al., 2018) do require additional domain knowledge about everyday events, scripts, and processes, respectively. However, these datasets need domain-specific modeling of events, whereas OpenBookQA appeals to broad common knowledge cutting across a variety of types and topics. Stasaski and Hearst (2017) explore the creation of multi-hop questions and propose generating stronger distractors for the multiple-choice setting. Their work, however, starts with structured knowledge, specifically a Biology ontology.

Lastly, many Science Question Answering datasets (e.g. have been released that need broad external knowledge to answer the questions. However, these questions are not associated with a core set of facts, i.e., an "open book" used to define these questions. As a result, the questions vary widely in style and complexity . In contrast, Open-BookQA focuses on a more well-defined subset of science QA, appealing to one core fact from the open book and one (or few) relatively simple commonly known supporting facts.

3 Openbookqa Dataset

The OpenBookQA dataset consists of about 6,000 4-way multiple-choice questions, each associated with one core fact from a "book" F of 1326 such facts, and an auxiliary set K of about 6000 additional facts. The questions were created via a multi-stage crowdsourcing and partial expert filtering process, discussed in Section 3.1.

The small "book" F consists of recurring science themes and principles, each of which can be (and here is) instantiated into multiple questions.

For F, we use a subset of the WorldTree corpus which Jansen et al. (2018) have analyzed for sufficiency for elementary level science. The subset we use is taken from the 2287 WorldTree facts that were marked as "central" by the original authors in at least one explanation. We further filter them down to 1326 that appear general enough to be applicable to multiple situations.

OpenBookQA additionally requires broad common knowledge, which is expected to come from large corpora, such as ConceptNet, Wikipedia, or a corpus with 14M science-related sentences used by some existing baselines. The crowdsourcing process below also asks workers to mark a second fact, k, needed for each question q, in addition to f . These second facts, unfortunately, were often incomplete, over-complete, or only distantly related to q. We thus include in OpenBookQA the set K of such second facts only as auxiliary data for optional use. We emphasize that K should not be viewed as 'gold' additional facts, or as a substitute for broad common knowledge.

3.1 Crowdsourcing Process

The overall question generation and filtering pipeline is summarized in Figure 2 . Given the "book" F of core facts, the process proceeds as follows, starting with an empty question set Qs and an empty 'second facts' set K:

Figure 2: OpenBookQA question generation pipeline

1. A crowd-worker 5 w is shown a random science fact f from the set F.

2. w is asked to think of a second common fact, k, that may be combined with f to derive a new, valid assertion s.

3. w then converts s into a question-answer pair and extends this into a 4-way multiple choice question by adding 3 incorrect answer choices,

q mc = (q, {c 1 , c 2 , c 3 , c 4 })

, where one of the c i 's is the unique correct answer. 4. The system verifies q mc passes basic checks such as uniformity of answer choices. 6 5. w then feeds the multiple-choice question q mc to an information retrieval solver (Clark et 6. Question q mc is then shown to 5 new crowdworkers, who are asked to answer it.

7. If at least 4 out of 5 workers answer q mc correctly, it is deemed answerable and the process continues. If not, q mc is discarded.

8. The answer choices of q mc are randomly shuffled to avoid unintended bias. 7 9. q mc is associated with f as the core science fact and added to the question set Q. k is added to the set K of additional (noisy) facts.

The Dev and Test splits were further filtered by an in-house expert to ensure higher quality.

3.2 Human Performance

To assess human accuracy on this dataset, we consider the following model: Each question q ∈ Q has some (unknown) human accuracy p q , defined as the probability that a random human subject, chosen uniformly from a large pool H, would answer q correctly. Thus, we can think of this as defining a Bernoulli random variable, X q ∼ B(p q ), whose mean is (unknown) p q . The average human accuracy on Q under this model is:

H(Q) = 1 |Q| q∈Q p q

where {p q | q ∈ Q} are unknown. With H as the set of crowd-workers (cf. Footnote 5), step 6 of the above question generation process is equivalent to obtaining 5 independent samples, X q,i , i ∈ I, |I| = 5, from B(p q ). We must, however, be careful when using this data to estimate p q , as the same 5 samples were used to decide whether q makes it into the question set Q or not. For instance, if we had kept only those questions that all 5 workers answered correctly, it would clearly be inaccurate to claim that the human accuracy on Q is 100%. Nevertheless, it is possible to re-use the judgments from Step 6 to approximate H(Q) with high confidence, without posing the questions to new workers.

Intuitively, if all questions in Q were difficult to answer (i.e., all p q were small), it would be unlikely that all |Q| questions would pass the test in

Step 6. We can use the contrapositive of this observation to conclude that p q , on average, must have been high for q ∈ Q.

Formally, aggregating across all questions gives the following empirical estimate of H(Q):

H(Q) = 1 |Q| q∈Q 1 |I| i∈I X q,i = 1 |Q||I| q∈Q,i∈|I| X q,i

For analysis, we assume all samples X q,i are independent, i.e., every answer is obtained independently. 8 An application of Hoeffding's Inequality (Hoeffding, 1963) shows thatH(Q) converges to H(Q) very rapidly as n = |Q||I| grows; specifically,H(Q) ≤ H(Q) + t with probability at least 1−exp(−2nt 2 ); similarly forH(Q) ≥ H(Q)−t. In our Dev and Test sets, where |Q| = 500 and |I| = 5, this translates into H(Q) being at least H(Q) − 3% with probability over 98.8% and at leastH(Q) − 2.5% with prob 95.6%; we report the former as our conservative estimate on human performance.

3.3 Question Set Analysis

OpenBookQA consists of 5957 questions, with 4957/500/500 in the Train/Dev/Test splits. 9 Table 1 summarizes some statistics about the full dataset. Each question has exactly four answer choices and one associated fact used in the creation process. We report the average length of questions, candidate choices, and associated facts, as well as how often is the longest/shortest choice the correct one. We analyzed 100 questions in the Train set to capture the kind of common knowledge and reasoning needed. For each, we wrote down the additional common knowledge needed to answer this question in addition to the original science fact. In 21% of the cases, the crowdsourced question actually tests for a fact that doesn't necessarily need the original science fact. For example, the question: "On a rainy day the clouds are (A) low (B) white (C) small (D) gray" was written based on the science fact "clouds produce rain" but doesn't need this fact to answer it. We ignore such questions in our analysis. For the remaining questions, we categorized the additional facts into five high-level categories (and collapsed the remaining facts into a catch-all OTHERS category) based on previous approaches on similar science questions Jansen et al., 2016 living thing), isa(granite, rock). 2. PROPERTY: Properties of objects such as madeof(belt buckle, metal), has(mammals, four legs), contains(lemon juice, citric acid). 3. DEFINITION: Definitions of objects that may be based on their appearance (tape is a plastic with markings), working mechanism (telescope is a device that uses mirrors to view objects), etc. 4. CAUSAL: Causal facts such as causes(adding lemon juice to milk, milk to break down). 5. BASIC: General scientific fact that did not fit above, e.g. squirrels eat nuts for food. Table 2 presents the proportions of these facts in our analyzed question set. For each type of fact, we calculate the percentage of questions that need at least one such fact (shown as % Questions). We also calculate the overall percentage of each fact type across all the common knowledge facts (shown as % Facts). Most of our questions need simple facts such as isa knowledge and properties of objects, further confirming the need for simple reasoning with common knowledge. Apart from these five major categories of facts, the catch-all OTHERS category contains commonsense facts (e.g., it is dark at night), world knowledge (e.g., Japan is often hit by earthquakes) and lexical rewrites 10 (e.g., ad infinitum means over and over).

Table 1: Statistics for full OpenBookQA dataset. Parenthetical numbers next to each average are the max.

Table 2: Percentage of questions and facts for the five most common type of additional facts. Note that % Questions does not add up to 100% since we count the percentage of questions where at least one such fact is needed.

Most of our questions need simple facts that should be easily retrievable from any knowledgebase/textual corpora. On an average, each question needed 1.16 additional facts ignoring any linguistic variations. Despite the simplicity of the knowledge needed for these questions, as we show empirically, most baseline approaches achieve a relatively low score on this dataset (even when the core fact is provided). We claim that this is due to the fact that the reasoning needed to answer these questions is non-trivial. Table 3 shows few questions with the associated facts and high-level reasoning needed to answer these questions. Assuming a model can extract the described relations (e.g. defn, contains), the QA system still needs to be able to chain these facts together, identify the resulting relation and verify its expression for each choice. In the extreme case (as shown in the last example), even though only one additional fact is needed to answer the question, it needs a system to apply the core "general" science fact to a "specific" situation.

Table 3: Example training questions (with their correct choices marked) along with the facts and reasoning needed. In the last example, the science fact states that lhs=“source of light becomes closer” implies rhs=“source will appear brighter”. Grounding this rule based on the common-knowledge fact, produces a new rule: “As headlights of the car come closer, headlights will appear brighter”

4 Baseline Models

We evaluate the performance of several baselines systems on the Dev and Test subsets of Open-BookQA. For each question, a solver receives 1 point towards this score if it chooses the correct answer, and 1/k if it reports a k-way tie that includes the correct answer. The "Guess All" baseline, which always outputs a 4-way tie, thus achieves a score of 25%, same as the expected performance of a uniform random baseline.

4.1 No Training, External Knowledge Only

Since OpenBookQA is a set of elementary level science questions, one natural baseline category is existing systems that have proven to be effective on elementary-and middle-school level science exams. These pre-trained systems, however, rely only on their background knowledge and do not take the set F of core facts into account. Further, their knowledge sources and retrieval mechanism are close to those used by the IR solver that, by design, is guaranteed to fail on OpenBookQA. These two aspects place a natural limit on the effectiveness of these solvers on OpenBookQA, despite their excellent fit for the domain of multiplechoice science questions. We consider four such solvers.

PMI uses pointwise mutual information (PMI) to score each answer choice using statistics based on a corpus of 280 GB of plain text. It extracts unigrams, bigrams, trigrams, and skip-bigrams from the question q and each answer choice c i . Each answer choice is scored based on the average PMI across all pairs of question and answer n-grams.

TableILP is an Integer Linear Programming (ILP) based reasoning system designed for science questions. It operates over semi-structured relational tables of knowledge. It scores each answer choice based on the optimal (as defined by the ILP objective) "support graph" connecting the question to that answer through table rows. The small set of these knowledge tables, however, often results in missing knowledge, making TableILP not answer 24% of the OpenBookQA questions at all.

TupleInference (Khot et al., 2017) , also an ILP-based QA system, uses Open IE tuples (Banko et al., 2007) as its semi-structured representation. It builds these subject-verb-object tuples on-the-fly by retrieving text for each question from a large corpus. It then defines an ILP program to combine evidence from multiple tuples.

DGEM ) is a neural entailment model that also uses Open IE to produce a semi-structured representation. We use the adaptation of this model to multiple-choice question answering proposed by , which works as follows: (1) convert q and each c i into a hypothesis, h i , and each retrieved fact into a premise p j ; and (2) return the answer choice with the highest entailment score, arg max i e(p j , h i ).

4.2 No Training; F And Extr. Knowledge

We also consider providing the set F of core facts to two existing solvers: the IR solver of (to assess how far simple wordoverlap can get), and the TupleInference solver.

4.3 Trained Models, No Knowledge

We consider several neural baseline models that are trained using Train set of OpenBookQA. For ease of explanation, we first define the notation used in our models. For a given question q mc = (q, {c 1 , c 2 , c 3 , c 4 }), we define the set of token sequences , S = {q, c 1 , c 2 , c 3 , c 4 }. For each token sequence s ∈ S, w s j is the j th and e s j = Emb(w s j ) is the embedding for this token. We use n s to indicate the number of tokens in s and d for the dimensionality of the embeddings. 11 We model multiple-choice QA as multi-class classification: Given q mc , predict one of four class labels L = Table 3 : Example training questions (with their correct choices marked) along with the facts and reasoning needed. In the last example, the science fact states that lhs="source of light becomes closer" implies rhs="source will appear brighter". Grounding this rule based on the common-knowledge fact, produces a new rule: "As headlights of the car come closer, headlights will appear brighter" {1, 2, 3, 4}, where the true label is the correct answer index.

Embeddings + Similarities as Features. We first experiment with a simple logistic regression model (Mihaylov and Nakov, 2016; Frank, 2016, 2017 ) that uses centroid vectors r emb s of the word embeddings of tokens in s, and then computes the cosine similarities between the question and each answer choice, r cos q,c i :

r emb s = 1 n s ns j=1 e s j ∈ R d r cos q,c i = cos(r emb q , r emb c i ) ∈ R 1

For each training instance, we build a feature representations f by concatenating these vectors and train an L2 logistic regression classifier:

f = [r emb q ; r emb c 1..4 ; r cos q,c 1..4 ] ∈ R 5d+4 BiLSTM Max-Out Baselines.

As a simple neural baseline, we adapt BiLSTM max-out model (Conneau et al., 2017) to our QA task. That is, we first encode the question tokens and choice tokens w s 1...ns , independently with a bi-directional context encoder (LSTM) to obtain a context (ctx) representation h ctx s 1...ns = BiLSTM(e s 1...ns ) ∈ R ns×2h Next, we perform an element-wise aggregation operation max on the encoded representations h ctx s 1..ns to construct a single vector:

EQUATION (1): Not extracted; please refer to original document.

Given the contextual representations for each token sequence, we experiment with three configurations for using these representations for QA: (a) Plausible Answer Detector. This baseline goes to the extreme of completely ignoring q and trying to learn how plausible it is for c i to be the correct answer to some question in this domain. This captures the fact that certain choices like 'a magical place' or 'flying cats' are highly unlikely to be the correct answer to a science question without negation (which is the case for OpenBookQA).

We implement a plausible answer detector using a choice-only model for predicting the answer by obtaining a score α c i as:

α c i = W T c r ctx c i ∈ R 1 , where W T c ∈ R 2h

is a weights vector optimized during training, i = {1..4} is the index of the choice.

To obtain the answer choice from the set of choice scores α c 1..4 using arg max(softmax(α c 1.

.4 )), where softmax(α c i ) = exp(αc i ) 4 j=1 exp(αc j ) as usual.

(b) Odd-One-Out Solver. It considers all 4 answer options jointly and selects the one that is least similar to the others. This captures bias in human authored questions arising from the fact that creating good quality incorrect answers is difficult. Workers generally start with the correct answer, and then come up with three incorrect ones. The latter often tend to be homogeneous or share other common properties (e.g., non-scientific terms) uncharacteristic of the correct answer.

We implement this using a choice-to-choices attention model. For each choice c i , we calculate the attention to the other choices as α c i ,c j . We then sum these attention values to compute the attention for c i to the rest of the choices, α c i 2c r(est) , and return the choice with the lowest sum. The atten-tion is computed as α c i ,c j = Att(r ctx c i , r ctx c j ) where

Att(u, v) = W T ([u; v; u • v; |u − v|]) ∈ R 1

is a linear attention function and W ∈ R 8h is a weight vector. We then compute α c i 2c r(est) = 4 j=1 α c i ,c j (j = i) and select the answer with the index a c2cr = arg min(softmax(α c 1..4 2cr )).

(c) Question Match. This solver tries to predict which choice best matches the question , without relying on external knowledge. To achieve that, we compute an attention score α q,c i between q and each of the choices q i as α q,c i = Att(r ctx q , r ctx c i ), and select the one with the highest score. We also experiment with a model where r ctx q and r ctx c i are obtained using token-wise interaction proposed in ESIM (Chen et al., 2017b) .

4.4 Trained Model With External Knowledge

Lastly, we implement a two stage model for incorporating external common knowledge, K. The first module performs information retrieval on K to select a fixed size subset of potentially relevant facts K Q,C for each instance in the dataset (see Appendix A). The second module is a neural network that takes (Q, C, K Q,C ) as input to predict the answer a q,c to a question Q from the set of choices C.

Knowledge-Enhanced Reader. As a base knowledge-aware model, we use a variant of the model of Mihaylov and Frank (2018) , implemented by extending our BiLSTM max-out question-match baseline (c). For each instance the model reads the question q and answers c 1..4 independently and attends to the set of retrieved external knowledge facts K Q,C . We encode each fact k j from K Q,C = k 1..N k (N k is the number of facts) with same BiLSTM as used for q and c 1..4 and construct a single vector r ctx k j ∈ R 2h using Eq. 1. Having such representations for each k j results in knowledge memory matrix M k = r ctx k 1..N k ∈ R N k ×2h . Note that M k is dynamic memory, specific for each instance in the batch and is encoded in each step during training. This memory is used to calculate a knowledge-aware representation, q and c i as a linear combination of the ctx, kn and ctx + kn representations as

r kn s = ((M T k r ctx s ).M k ) ∈ R

α q,c i = W T [Att(r ctx s , r ctx c i ); Att(r kn s , r kn c i ); Att(r ctx+kn s , r ctx c i ); Att(r ctx s , r ctx+kn c i ); Att(r ctx s , r kn c i ); Att(r kn s , r ctx c i ); Att(r ctx+kn s , r kn c i ); Att(r kn s , r ctx+kn c i ); Att(r ctx+kn s , r ctx+kn c i )],

where W ∈ R 9 is a weight vector initialized with the ones vector and optimized during training. We then select the answer c i with the highest score.

5 Baseline Performance

The results for various baseline models are summarized in Table 4 , grouped by method category. We make a few observations: First, the task is largely solvable by a layperson, as evidenced by the 92% score of crowdworkers. This is measured as described in Sec-tion 3.2. We use annotations from Step 6 of the question generation process and reportH(Q)−3% as a conservative lower estimate. As an additional assessment, we also obtained 5 new annotations for 100 randomly chosen questions from each of Train, Dev, and Test sets. The performance remained similar at 88.6%, 90.2%, and 91.6%, resp.

Table 4: Scores obtained by various solvers on OpenBookQA, reported as a percentage± the standard deviation across 5 runs with different random seeds. Other baselines are described in the corresponding referenced section. For oracle evaluation, we use the gold science fact f associated with each question, and optionally the additional fact k provided by the question author. Bold denotes the best Test score in each category.

The second group shows that pre-trained state-of-the-art solvers for multiple-choice science questions perform poorly. One explanation is their correlation with the the IR method used for question filtering, as mentioned in Section 4.1.

The third group of results suggests that adding F to pre-trained models has a mixed effect, improving TupleInference by 8.7% but not changing DGEM. 12 Unlike DGEM, TupleInference relies on brittle word-overlap similarity measures very similar to the ones used by IR. Since IR (KB) gets 0% by design, TupleInference (KB) also has poor performance and adding F helps it find better support despite the brittle measures.

The fourth group demonstrates that carefully designed trainable neural models-even if simplistic and knowledge-free-can be surprisingly powerful. For example, the "plausible answer detector" can predict the correct answer with 49.6% accuracy without even looking at the question. The "odd-one-out" solver, by considering other answer choices, raises this to 50.2%. The "question match" solver, which simply compares the BiLSTM max-out encoding of the question with that of various answer choices, also achieves 50.2%. 13 Similar findings have been reported for several recent datasets (Gururangan et al., 2018) , making it imperative to perform such tests early.

Interestingly, all of these neural knowledge-free baselines simultaneously succeed on 34.4% of the Dev questions, and simultaneously fail on 23.6%. For Question Match and ESIM we also experiment with ElMo (Peters et al., 2018) which improved their score on Test with 0.4% and 1.8%.

The final group demonstrates the need for external knowledge and deeper reasoning. When the "oracle" science fact f used by the question author is provided to the knowledge-enhanced reader, it improves over the knowledge-less models by about 5%. However, there is still a large gap, showing that the core fact is insufficient to answer the question. When we also include facts retrieved from WordNet (Miller et al., 1990) , the score improves by about 0.5%. Unlike the WordNet gain, adding ConceptNet (Speer et al., 2017) introduces a distraction and reduces the score. This suggests that ConceptNet is either not a good source of knowledge for our task, or only a subset of its relations should be considered. Overall, external knowledge helps, although retrieving the right bits of knowledge remains difficult. In the last row of Table 4 , we use the oracle core fact along with question author's interpretation of the additional fact k. This increases the scores substantially, to about 76%. This big jump shows that improved knowledge retrieval should help on this task. At the same time, we are still not close to the human performance level of 92% due to various reasons: (a) the additional fact needed can be subjective, as hinted at by our earlier analysis; (b) the authored facts K tend to be noisy (incomplete, over-complete, or only distantly related), also as mentioned earlier; and (b) even given the true gold facts, performing reliable "reasoning" to link them properly remains a challenge.

Sample predictions and analysis of questions from Dev are provided in Appendix D.

6 Conclusion

We present a new dataset, OpenBookQA, of about 6000 questions for open book question answering. The task focuses on the challenge of combining a corpus of provided science facts (open book) with external broad common knowledge. We show that this dataset requires simple common knowledge beyond the provided core facts, as well as multihop reasoning combining the two. While simple neural methods are able to achieve an accuracy of about 50%, this is still far from the human performance of 92% on this task. We leave closing this gap for future research, and illustrate, via oraclestyle experiments, the potential of better retrieval and reasoning on this task.

The dataset and the code for the models are available at http://data.allenai.org/OpenBookQA.

Only ∼5% of the TQA questions ofKembhavi et al. (2017) require additional common knowledge.3 To avoid ambiguity in the term 'human performance', Section 3.2 describes the specific randomized model we use.

SemEval-2018 Task 11: Machine Comprehension using Commonsense Knowledge https://competitions. codalab.org/competitions/17184

We used Amazon Mechnical Turk, with workers from North America and with a 'masters' level qualification.6 Specifically, it looks for: 1) exactly 4 answer choices; 2) no negation words to trivially fool baselines (no, none, not, isn't, doesn't, aren't, don't, won't, except, can't, shouldn't, wouldn't, couldn't, mustn't); 3) uniform answer choice length: all with at most 3 or at least 4 words.

Choice 'A' was the correct answer in 69% of the questions at the end of Step 4.

Realistically, there is some dependence across questions as a single worker may answer multiple questions. We leave a formal analysis of this setting as future work.

Of course, every question had lexical variations. We marked it when this was the only change to the core fact.

For all experiments we use d = 300 GloVe(Pennington et al., 2014) embeddings pre-trained on 840B tokens from Common Crawl (https://nlp.stanford.edu/projects/glove/).

By design, IR with its default corpus gets 0% on Open-BookQA. Hence we don't consider the effect of adding F, which appears artificially magnified.13 This model also achieves the current best score, 33.87%, on the ARC Reasoning Challenge. When adapted for the textual entailment task by comparing BiLSTM max-out encodings of premise and hypothesis, it achieves 85% on the SciTail dataset.