Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Peter Clark
Isaac Cowhey
Oren Etzioni
Tushar Khot
Ashish Sabharwal
Carissa Schoenick
Oyvind Tafjord
ArXiv
2018
View in Semantic Scholar

Abstract

We present a new question set, text corpus, and baselines assembled to encourage AI research in advanced question answering. Together, these constitute the AI2 Reasoning Challenge (ARC), which requires far more powerful knowledge and reasoning than previous challenges such as SQuAD or SNLI. The ARC question set is partitioned into a Challenge Set and an Easy Set, where the Challenge Set contains only questions answered incorrectly by both a retrieval-based algorithm and a word co-occurence algorithm. The dataset contains only natural, grade-school science questions (authored for human tests), and is the largest public-domain set of this kind (7,787 questions). We test several baselines on the Challenge Set, including leading neural models from the SQuAD and SNLI tasks, and find that none are able to significantly outperform a random baseline, reflecting the difficult nature of this task. We are also releasing the ARC Corpus, a corpus of 14M science sentences relevant to the task, and implementations of the three neural baseline models tested. Can your model perform better? We pose ARC as a challenge to the community.

Introduction

Datasets are increasingly driving progress in AI, resulting in impressive solutions to several question-answering (QA) tasks (e.g., Rajpurkar et al., 2016; Joshi et al., 2017) . However, many of these datasets focused on retrieval-style tasks, where surface-level cues alone were usually sufficient to identify an answer. This has not encouraged progress on questions requiring reasoning, use of commonsense knowledge, or other advanced methods for deeper text comprehension. The challenge presented here, called ARC (AI2 Reasoning Challenge), aims to address this limitation by posing questions that are hard to answer with simple baselines.

The ARC Dataset consists of a collection of 7787 natural science questions, namely questions authored for use on standardized tests. Standardized tests have previously been proposed as a Grand Challenge for AI (Brachman, 2005; as they involve a wide variety of linguistic and inferential phenomena, have varying levels of difficulty, and are measurable, motivating, and ambitious. However, making this challenge a reality is difficult, as such questions are difficult obtain (most examination boards release only limited practice tests to the public).

For ARC we have addressed this through several months of extensive search and investigation.

In addition, to encourage focus on advanced phenomena, we have partitioned ARC into a Challenge Set (2590 questions), containing questions answered incorrectly by both a retrieval-based algorithm and a word co-occurrence algorithm, and an Easy Set (5197 questions), containing the remainder. For example, two typical challenge questions are:

Which property of a mineral can be determined just by looking at it? (A) luster [correct] (B) mass (C) weight (D) hardness A student riding a bicycle observes that it moves faster on a smooth road than on a rough road. This happens because the smooth road has (A) less gravity (B) more gravity (C) less friction [correct] (D) more friction Both these questions are difficult to answer via simple retrieval or word correlation. For example, there are no Web sentences of the form "luster can be determined by looking at something"; similarly, "mineral" is strongly correlated with "hardness" (an incorrect answer option). Rather, they require more advanced QA methods. We provide more example questions, and a catagorization of the knowledge and reasoning types that the questions appeal to, in Tables 4 and 5. To help the community to engage with this dataset, we are also releasing a science text corpus and two baseline neural models as part of the ARC challenge:

1. The ARC Corpus, containing 14M science-related sentences with knowledge relevant to ARC. A sampled analysis suggests the corpus mentions knowledge relevant to 95% of the Challenge questions. (Use of the Corpus for the Challenge is optional).

2. Three neural baseline models, DecompAttn, BiDAF, and DGEM, for QA. These are multiple-choice QA adaptations of three neural models: the decomposable attention model (Parikh et al., 2016) , a top performer on SNLI; Bidirectional Attention Flow (Seo et al., 2017b) , a top performer on SQuAD; and the decomposed graph entailment model, a top performer on SciTail (Khot et al., 2018) . While these score well on the Easy Set, they are unable to perform significantly better than a random baseline on the Challenge Set, illustrating its challenging nature. This challenge differs from the Kaggle-hosted 2016 Allen AI Science Challenge (Schoenick et al., 2017) in three important ways 1 . First, the creation of a Challenge partition is to avoid scores being dominated by the performance of simple algorithms, and thus encourage research on methods that the more difficult questions demand. Second, we provide a science corpus along with the questions to help get started (use of the corpus is optional, and systems are not restricted to this corpus). Finally, the questions, corpus, and models are all publically available.

The paper is organized as follows. We first discuss related work, and then describe how the ARC dataset was collected and partitioned. We also provide an analysis of the types of problems present in the Challenge Set. We then describe the supporting ARC Corpus, and illustrate how it contains knowledge relevant to Challenge questions. Finally, we describe several baselines and their performance on ARC. Most notably, although some baseline algorithms perform well on the Easy Set (scoring up to 61%), none are able to perform significantly above random on the Challenge Set. We conclude by posing ARC as a challenge to the community. The ARC Dataset, Corpus, Models, and Leaderboard can be accessed at http://data.allenai.org/arc.

Related Work

There are numerous datasets available to drive progress in question-answering. Earlier reading comprehension datasets, e.g., MCTest (Richardson, 2013) , SQuAD (Rajpurkar et al., 2016) , NewsQA (Trischler et al., 2016) , and CNN/DailyMail (Hermann et al., 2015) , contained questions whose answers could be determined from surface-level cues alone (i.e., answers were "explicitly stated"). TriviaQA (Joshi et al., 2017) broadened this task by providing several articles with a question, and used questions authored independently of the articles. Again, though, the questions were largely factoid-style, e.g., "Who won the Nobel Peace Prize in 2009?". Although systems can now perform well on these datasets, even matching human performance (Simonite, 2018), they can be easily fooled (Jia and Liang, 2017) ; the degree to which they truly understand language or domain-specific concepts remains unclear.

To push towards more complex QA tasks, one approach has been to generate synthetic datasets, the most notable example being the bAbI dataset (Weston et al., 2015) . bAbi was generated using a simple world simulator and language generator, producing data for 20 different tasks. It has stimulated work on use of memory network neural architectures (Weston et al., 2014) , supporting a form of multistep reasoning where a neural memory propagates information from one step to another (e.g. Henaff et al., 2016; Seo et al., 2017a) . However, its use of synthetic text and a synthetic world limits the realism and difficulty of the task, with many systems scoring a perfect 100% on most tasks (e.g. Weston et al., 2014) . In general, a risk of using large synthetic QA datasets is that neural methods are remarkably powerful at "reverse-engineering" the process by which a dataset was generated, or picking up on its idiosyncrasies to excel at it, without necessarily advancing language understanding or reasoning.

More recently, Welbl et al. (2017b) created the WikiHop dataset, containing questions that appear to require more than one Wikipedia document to answer ("multihop questions"). The dataset takes a step towards a more challenging task, but has several limitations: questions are binary predicates (e.g., date of birth("jeanne c. stein",?X)); the intended inference is typically a simple two-step chain (commonly a geographical substitution of a city for a country); and in many cases the correct answer can be guessed from the passage, without requiring multi-hop inference (∼44% of the answerable questions are single-hop, according to the authors (Welbl et al., 2017b) ).

Datasets based on human standardized tests have also been used in AI for several years (e.g. Seo et al., 2014; Fujita et al., 2014; Strickland, 2013) , and as part of the NTCIR QALab (NII, 2017), and for the 2016 Allen AI Science Challenge described in the Introduction (Schoenick et al., 2017) . However, there are two potentially significant challenges with these datasets. First, these datasets are often small (e.g., hundreds of questions), due to the scarcity of public, real-world test data. Second, because tests were designed for people rather than machines, large portions of these tests can be easily solved by simple AI methods (e.g. Davis, 2016; . The result of this is that scores become dominated by the performance of simple algorithms (information retrieval, statistical correlations). This then biases research towards incrementally improving those algorithms, rather than exploring the larger AI challenges that the more difficult questions demand. Indeed, it is easy to mistake progress on these datasets as implying equal progress on easy and hard questions, while in reality progress may be heavily concentrated on easy questions alone (Gururangan et al., 2018) , leaving more difficult challenges unaddressed. The ARC Dataset addresses both of these limitations.

The Arc Dataset

The ARC dataset consists of 7787 science questions, all non-diagram, multiple choice (typically 4-way multiple choice). They are drawn from a variety of sources, and sorted into a Challenge Set of 2590 "hard" questions (those that both a retrieval and a co-occurrence method fail to answer correctly) and an Easy Set of 5197 questions. Table 1 summarizes the sizes of the train/dev/test partitions in the ARC dataset. The question vocabulary uses 6329 distinct words (stemmed).

Table 1: Number of questions in the ARC partitions.

Questions vary in their target student grade level (as assigned by the examiners who authored the questions), ranging from 3rd grade to 9th, i.e., students typically of age 8 through 13 years. Table 2 shows a break-down of the set based on grade level with absolute counts(#) and percentage(%) of the Challenge and Easy set. In practice, there is substantial overlap in difficulty among grade levels (also 1 / 1.8 / 11 1 / 1.6 / 9 Answer option (# words) 1 / 4.9 / 39 1 / 3.7 / 26 # answer options 3 / 4.0 / 5 3 / 4.0 / 5 Table 3 : Properties of the ARC Dataset Summary statistics of ARC are provided in Table 3 , showing questions and answers vary considerably in length. Finally, Table 7 in the Appendix lists the variety of sources the questions were drawn from.

Table 2: Grade-level distribution of ARC questions

Table 7: The various question sources for ARC.

Identifying Challenge Questions

Operationally, we define a Challenge question as one that is answered incorrectly by both of two baseline solvers, described below. Although this only approximates the informal goal of it being a "hard" question, this definition nevertheless serves as a practical and useful filter, as reflected by the low scores of various baselines on the Challenge Set.

Information Retrieval (IR) Solver. The first filter we apply is the IR solver from Clark et al. (2016), briefly described here for completeness. The IR solver uses the Waterloo corpus from , a Web-based corpus of 5×10 10 tokens (280GB). The solver searches to see if the question q along with an answer option is explicitly stated in the corpus, and returns the confidence that such a statement was found. To do this, for each answer option a i , it sends q + a i as a query to a search engine (we use Elasticsearch), and returns the search engine's score for the top retrieved sentence s where s also has at least one non-stopword overlap with q, and at least one with a i ; this ensures s has some relevance to both q and a i . This is repeated for all options a i to score them all, and the option with the highest score selected.

The Pointwise Mutual Information (PMI) Solver. The second filter we apply is the PMI solver, also from Clark et al. (2016), again described here for completeness. This uses the same corpus as the IR solver, and formalizes a way of computing and applying associational knowledge. Given a question q and an answer option a i , it uses PMI or pointwise mutual information (Church and Hanks, 1989) to measure the strength of the associations between parts of q and parts of a i . Given a large corpus C, the PMI for two n-grams x and y is defined as

PMI(x, y) = log p(x, y) p(x)p(y)

Here p(x, y) is the joint probability that x and y occur together in C, within a certain window of text (we use a 10 word window). The term p(x)p(y), on the other hand, represents the probability with which x and y would occur together if they were statistically independent. The ratio of p(x, y) to p(x)p(y) is thus the ratio of the observed cooccurrence to the expected co-occurrence. The larger this ratio, the stronger the association between x and y.

The solver extracts unigrams, bigrams, trigrams, and skipbigrams from the question q and each answer option a i .

It outputs the answer with the largest average PMI, calculated over all pairs of question n-grams and answer option n-grams.

The Challenge Set

To illustrate the impact of using these algorithms as filters when defining the Challenge Set, consider the following example:

Which property of air does a barometer measure? (A) speed (B) pressure [correct] (C) humidity (D) temperature

The question was excluded from the Challenge Set because it is correctly answered by (here) both the IR and PMI algorithms (note that it would have been excluded even if it was answered correctly by just one of the solvers). The IR algorithm finds multiple sentences supporting the correct answer, e.g., • Air pressure is measured with a barometer.

• Air pressure will be measured with a barometer.

• The aneroid barometer is an instrument that does not use liquid in measuring the pressure of the air. • A barometer measures the pressure of air molecules.

and similarly the PMI algorithm finds that "barometer" and "pressure" (and also "air" and "pressure") co-occur unusually frequently (high PMI) in its corpus. In contrast, consider the following question:

Which property of a mineral can be determined just by looking at it? (A) luster [correct] (B) mass (C) weight (D) hardness This is incorrectly answered by both algorithms: There are no corpus sentences similar to "a material's luster can be determined by looking at it". Similarly, "mineral" co-occurs unusually frequently with several incorrect answer options

Question Types

ARC questions appeal to both different styles of knowledge, and different styles of reasoning. In Tables 4 and 5, we enumerate the broad classes of each that we observe in the ARC challenge, based on a sample of 100 questions. The relative sizes of these categories are shown in Figures 1 and 2. These sizes are necessarily approximate, as it requires a subjective judgement about the main challenge exhibited by different questions. Nevertheless, it helps to provide a rough atlas of the knowledge and reasoning space underlying ARC.

The Arc Corpus

In addition to the ARC question set, we are also releasing the ARC Corpus, a large corpus of science-related sentences mined from the Web. It contains 14M sentences (1.4GB of text), and mentions much of the knowledge required to answer the Challenge Questions. Although some of these mentions are indirect, and exploiting them is not trivial, it nevertheless provides a starting point for attacking the ARC Challenge. Note that use of the corpus is optional, and also that systems are not restricted to this corpus. The ARC Corpus was created by utilizing a major search engine to run a large series of search queries relevant to science. Queries were automatically constructed by instantiating ∼100 hand-written templates for 80 science topics covered by US elementary and middle schools, the subject areas of ARC. For example, for the topic "celestial phenomena", two templates "[astronomical-term] astronomy" and "[astronomical-term] astrophysics" were authored, and a list of (here, 360) terms for astronomical-term collected and used, resulting in 720 queries. The top several documents from each search were collected and de-duplicated, and then the content of these documents was stripped down to capture just the text in each document. The resulting text was then chunked into sentences. This was repeated for all templates. Note that some templates were parameterized by more than one parameter. From an informal analysis of a random sample of 805 documents that were collected, approximately 75% were judged as "science relevant". The corpus was then augmented with the AristoMini corpus 2 , an earlier corpus containing dictionary definitions from Wiktionary, articles from Simple Wikipedia tagged as science, and additional science sentences collected from the Web. From a vocabulary analysis, 99.8% of the ARC question vocabulary is mentioned in the ARC Corpus 3 . In our baseline experiments discussed shortly, we find that if we change the corpus behind the IR solver from Waterloo to the ARC Corpus, this changes its Challenge Set score from near zero (by definition, Challenge questions are those that IR with the Waterloo corpus gets wrong) to a score similar to random guessing 4 . However, from an informal, sampled analysis, we find that this is more a limitation of the IR methodology than of the coverage of the ARC Corpus. The ARC Corpus, in fact, appears to mention knowledge relevant to approximately 95% of the ARC Challenge questions (from an analysis of a random sample of questions), even if simple retrieval methods are not able to exploit it to correctly answer the questions. For example, consider:

Scientists launch a rocket into space for a mission.

Once the rocket escapes the gravitational pull of Earth, how will the mass and weight of the rocket be affected? (A) The mass and weight will change. (B) The mass and weight will stay the same. (C) The mass will stay the same, but the weight will change.

[correct] (D) The mass will change, but the weight will stay the same. While this particular scenario is of course not mentioned explicitly in the ARC Corpus, there are several statements about the relation between mass, weight, and gravity, for example:

• The main difference is that if you were to leave the Earth and go to the Moon, your weight would change but your mass would remain constant. • Astronauts in orbit experience weightlessness just like objects in the falling aircraft. • Weight is the force that something feels due to gravity: so the brick would have a much larger weight near the earth's surface than it does in deep space. Such sentences provide evidence that weight but not mass will change when in space. Similarly, consider the question:

Which factor will prompt an animal's fight-or-flight response? (A) population size (B) competition for food and that (C) is distinct for other eukaryotic organisms:

• Animals are multicellular eukaryotes; they are chemosynthetic heterotrophs that ingest their food.

Although these descriptions of reasoning paths are informal, and clearly many others are possible, they illustrate that the ARC Corpus mentions knowledge relevant to a question, even if no single sentence alone provides the answer. Of course, this does not address the challenge of correctly identifying and reasoning with this knowledge, nor the challenge of injecting unstated commonsense knowledge that may also be required. Rather, our claim is that the Corpus contains substantial linguistic signal relevant to most of the ARC questions, and is a useful starting point for corpusbased attacks on the Challenge.

Baseline Performance Baseline Systems

We ran several baseline QA systems on the Challenge and Easy Sets, including two neural models, DecompAttn and BiDAF (details below), that have near state-of-the-art performance on the well-known SNLI and SQuAD datasets respectively. We scored systems using the following scoring rubric: For each question, a system receives 1 point if it chooses the correct answer and 1/k if it reports a k-way tie (i.e., chooses multiple answers) that includes the correct answer. For a question set, the overall score of the system is the sum of the points it receives for all questions, divided by the number of questions and reported as a percentage. We report performance of the following systems:

1. IR (dataset definition). IR method, described earlier.

2. PMI (dataset definition). PMI method, described earlier.

3. Guess-all ("random"). A naïve baseline that selects all answer options as equally valid, thereby scoring 1/k for each question with k answer choices. A system that chooses a single answer at random will also converge to this score after enough trials.

4. Ir (Arc Corpus)

. The IR algorithm, rerun with the ARC Corpus. Note that changing the original corpus is expected to result in a different score, unless the two corpora are highly correlated. A corpus containing random strings, for instance, will have very low correlation with the original corpus and will result in a random-guessing score of around 25%.

5. TableILP , which performs matching and reasoning using a semi-structured knowledge base of science knowledge, expressed in tables.

6. TupleInference (Khot et al., 2017) , which performs semistructured matching of the question with retrieved sentences, where the structure consists of Open IE tuples.

7. DecompAttn, DGEM, and DGEM-OpenIE (Neural Entailment Models). We adapted two neural entailment models, DecompAttn (Parikh et al., 2016) and DGEM (Khot et al., 2018) , to the task of answering multiplechoice questions. The models were trained on an extended version of the SciTail dataset (Khot et al., 2018) .

To adapt these to multiple-choice QA, we first convert the question q plus an answer option a i into a hypothesis sentence (or paragraph) h i , use this as a search query to retrieve text sentences t ij from a corpus, then compute the entailment scores between h i and each t ij . This is repeated for all answer options, and the option with the overall highest entailment score selected. Further details are given in the Appendix. The DGEM model uses a structured representation of the hypothesis h i , extracted with a proprietary parser plus Open IE. To create a releasable version of DGEM, we also evaluate (and release) a variant DGEM-OpenIE, a version of DGEM that only uses Open IE to create the structured representation of h i , thus avoiding proprietary tools.

Table 4: Types of knowledge suggested by ARC Challenge Set questions

Table 5: Types of reasoning suggested by ARC Challenge Set questions

8. Bidaf (Reading Comprehension Model).

We also adapted BiDAF (Seo et al., 2017b) , a top-performer on the SQuAD dataset (Rajpurkar et al., 2016) , to our task. As BiDAF is a direct answer system, we adapted it to multiple-choice QA following the approach used in several previous projects (Khashabi et al., 2018b; Welbl et al., 2017a; Kembhavi et al., 2017) as follows: First, given a question, we create a single paragraph by concatenating a set of retrieved sentences. In this case, we use the same sentences retrieved by the entailment models for all answer options (above). We then use BiDAF to select an answer span from this paragraph, given the question. Finally, we pick the multiple-choice option that maximally overlaps that answer span (here defined as the option with the highest percentage of lemmatized, non-stopword tokens covered by the BiDAF answer span). BiDAF was trained on SQuAD then further tuned to science questions using continued training. Table 6 summarizes the scores obtained by various baseline algorithms on the test partitions of the Challenge and Easy sets.

Results

The IR and PMI (dataset definition) solvers, by design, score near zero on the Challenge set. The slightly abovezero score is due to the solver occasionally picking multiple (tied) answers, resulting in a partial credit for a few questions. We include these questions in the Challenge set.

The most striking observation is that none of the algorithms score significantly higher than the random baseline on the Challenge set, where the 95% confidence interval is ±2.5%. In contrast, their performance on the Easy set is generally between 55% and 65%. This highlights the different nature and difficulty of the Challenge set.

The poor performance of the non-IR solvers is partly explained by their correlation with the IR solver: the first step for nearly all of them (except TableILP, which uses nonsentential knowledge but has low knowledge coverage) is to use a simple IR method to obtain relevant sentences, and then process them in different ways such as extracting struc- 58.97 † These solvers were used to define the dataset, affecting scores. ‡ Code available at https://github.com/allenai/arc-solvers Table 6 : Performance of the different baseline systems. Scores are reported as percentages on the test sets. For up-to-date results, see the ARC leaderboard at http://data.allenai.org/arc. ture, attempting matching, attempting chaining, etc. However, the retrieval bias of the underlying IR methods is towards sentences that are all very similar to the question, and away from sentences that individually only partially match the question, but together fully explain the correct answer (e.g., through chaining). This suggests the need for a more advanced retrieval strategy for questions that require combining multiple facts, as well as new methods for combining that information.

Conclusion

Datasets have become highly influential in driving the direction of research. Recent datasets for QA have led to impressive advances, but have focused on factoid questions where surface-level cues alone are sufficient to find an answer, discouraging progress on questions requiring reasoning or other advanced methods. To help the field move towards more difficult tasks, we have presented the AI2 Reasoning Challenge (ARC), consisting of a new question set, text corpus, and baselines, and whose Challenge partition is hard for retrieval and co-occurence methods. We find that none of the baseline systems tested can significantly outperform a random baseline on the Challenge set, including two neural models with high performances on SNLI and SQuAD. Progress on ARC would thus be an impressive achievement, given its design, and be significant step forward for the community. To access ARC, view the leaderboard, and submit new entries, visit the ARC Website at http://data.allenai.org/arc. Availability The ARC Dataset, Corpus, three baseline neural models (DecompAttn, BiDAF, and DGEM), and the leaderboard, are all available from http://data.allenai.org/arc. use any other neural entailment model implemented in Al-lenNLP.

Answer Scoring

Given the scores for each premise score e (q, a, p q,a ), we use the maximum supporting sentence score as the answer choice score c (q, a) = max score e (q, a, p q,a ).

ARC includes the publically releasable subset of the Kaggle questions (about 60% of the Kaggle set, making up 43% of the ARC set).

Also available separately at http://allenai.org/data.html 3 Only 11 question words, mainly proper nouns, do not appear in the corpus: Daphne, Sooj, LaKeisha, Quickgrow, Hypergrow, CCGCAT, nonsnow, Quickgrow, Coaccretion, HZn, MgBr 4 In fact slightly lower than random guessing, likely due to some corpus similarities where the same distractors in Waterloo were also present in theARC Corpus.

https://github.com/allenai/arc-solvers

Figure 1: Relative sizes of different knowledge types suggested by the ARC Challenge Set.

Figure 2: Relative sizes of different reasoning types suggested by the ARC Challenge Set.