Think you have Solved Direct-Answer Question Answering? Try ARC-DA, the Direct-Answer AI2 Reasoning Challenge
We present the ARC-DA dataset, a direct-answer (“open response”, “freeform”) version of the ARC (AI2 Reasoning Challenge) multiple-choice dataset. While ARC has been influential in the community, its multiple-choice format is unrepresentative of real-world questions, and multiple choice formats can be particularly susceptible to artifacts. The ARCDA dataset addresses these concerns by converting questions to direct-answer format using a combination of crowdsourcing and expert review. The resulting dataset contains 2985 questions with a total of 8436 valid answers (questions typically have more than one valid answer). ARC-DA is one of the first DA datasets of natural questions that often require reasoning, and where appropriate question decompositions are not evident from the questions themselves. We describe the conversion approach taken, appropriate evaluation metrics, and several strong models. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We hope the dataset spurs further advances in complex questionanswering by the community.
Multiple-choice (MC) datasets are popular and common in the NLP community, e.g., CommonsenseQA (Talmor et al., 2019) , OpenbookQA (Mihaylov et al., 2018) , and VCR (Zellers et al., 2019) , in particular because of the ease of automatic evaluation. However, they have two notable drawbacks: First, they are unnatural (real-world questions rarely come with answer options). Second, the multiple-choice format is particularly susceptible to artifacts, where systems learn short-cuts to obtain a high score (Gururangan et al., 2018) .
Similarly, while there are many NLP datasets of directanswer questions (also called "open response" or "freeform" questions), e.g., SQuaD (Rajpurkar et al., 2016) , TriviaQA (Joshi et al., 2017) , and NaturalQuestions (Kwiatkowski et al., 2019) , the majority of these are span-retrieval ("lookup") tasks where a question is matched against a given/retrieved sentence or paragraph to identify an answer span. The few DA datasets that do target reasoning, e.g., 1 ARC-DA is available at https://allenai.org/data/arc-da DA: How are the stem of a tree and the stem of a flower most similar? both support the plant | support leaves | both carry water | both carry nutrients | they support the plant Figure 1 : Multiple-choice (MC) questions from ARC, and their direct answer (DA) equivalents in the new ARC-DA dataset. Alternative DA answers are separated by a |.
What is missing, still, are direct-answer (DA) datasets of natural questions exploring a wide variety of problem types and reasoning styles, and where answers are not constrained to be spans of a source text. This work alleviates this gap by supplying such a dataset, namely ARC-DA, a direct-answer version of the ARC (AI2 Reasoning Challenge) multiplechoice dataset . Note that ARC-DA questions are not necessarily more difficult than the original ARC questions (we find scores on ARC-DA are roughly similar to those on ARC), rather they are more natural, avoiding the arXiv:2102.03315v1 [cs.CL] 5 Feb 2021 multiple-choice format.
The original ARC dataset contained questions collected from a large number of science exam and quiz sources. It has proven useful for the community, stimulating new research in reasoning-based QA, e.g., (Musa et al., 2019; Boratko et al., 2018; Ni et al., 2019; Xie et al., 2020) , and as of January 2021 has 35 entries on its leaderboard 2 . ARC is particularly interesting from an NLP perspective: the questions were authored by human experts (e.g., examination boards), they are sensible and high quality, they avoid the repetition common to crowdsourced datasets, they are highly varied in both the language they use and the reasoning skills they are designed to probe, and they are practical, understandable, and motivating. Arguably, the combination of these factors makes the dataset a useful "Grand Challenge" for the field ) (The current top score on ARC-Challenge is 81.1%, thus still with room for improvement). The work here, ARC-DA, thus builds on this, providing a direct-answer version of part of the ARC dataset. Several examples of original ARC questions and the ARC-DA versions are shown in Figure 1 .
We first describe the method used for the conversion, and then present baseline scores using strong T5-based models. Evaluating DA questions poses an additional challenge, compared with scoring MC questions. To address this challenge, we use both human judgements (obtained with GE-NIE, an automated crowdscoring pipeline (Khashabi et al., 2021) ), and automated metrics. Although high, the best scores (81% GENIE, 61.4% F1, 63.2% ROUGE-L) still leave considerable room for improvement. In addition, the dataset provides a natural setting for new research on explanation, as many questions require reasoning to construct answers. We encourage the community to make use of this dataset to make further progress in advanced questionanswering.
Naïvely, one can convert MC to DA simply by removing the answer choices, and using the correct answer choice as the target answer. 3 However, there are several problems that can arise:
• There may be multiple ways of wording the correct answer. • There may be multiple possible correct answers, and in some cases too many to enumerate all of them. • The question itself may be ill-defined without answer options.
To address these problems, we convert the 7787 ARC MC questions to DA using the process described below.
We start with a large scale crowdsourcing process to filter questions to those suitable for the DA setting and collect alternative correct answers for them:
1. Initial Question Filtering: Remove questions where the question sentence 4 contains one of several empiricallychosen filter phrases, e.g., "Which of". 5 Questions containing these phrases were observed to usually be illformed without the answer options, e.g., "Which of these items contains only a liquid?".
2. Collecting Answers: Each question was then posed to five independent crowdworkers as a DA question, and the workers were asked to:
• Answer the question (enter a free-form answer). If there were multiple answers, they were asked to enter two or three. • Identify if the question had one, several, or many answers, or if the question was nonsensical.
If the question was too ambiguous or nonsensical, the crowdworker had the option of not providing an answer. The crowdworker interface is shown in Appendix A.
3. Additional Filtering: The questions were further filtered, only retaining:
• questions that had answers from at least two workers.
• questions where at least two worker-provided answers had some non-stop-word overlap.
Otherwise the question was deemed too open-ended and rejected.
The resulting questions were then reviewed by in-house ("expert") workers, who performed the following operations:
1. Question Filtering: Rejected questions that still appeared too open-ended (e.g., "Name an insect.").
2. Answer Verification:
Reviewed crowdworker answers to remove incorrect answers, and add additional missed answers.
3. Question Rewording: Reworded questions that were poorly phrased or incomplete as standalone questions, e.g., "The cell structure that makes a plant cell more rigid than an animal cell is the" becomes "The cell structure that makes a plant cell more rigid than an animal cell is called what?" 4. Answer Modification: For long (wordy) answers, ensure that a shorter version including just the salient terms is also present. For example, for the question: "In what form does water vapor exist in the atmosphere?", the crowdworkers gave two answers: "An invisible gas in the air", and "An invisible gas". As the simple answer "gas" is sufficient for this question, the expert would add "gas" as an additional answer option. This process was run over the entire ARC question set. Approximately 60% of the original questions were removed during crowdworker annotation (50% in the initial question filtering, 10% more in the additional filtering), followed by another 10% during in-house review, resulting in 2985 questions in the final ARC-DA dataset. Although the final dataset is less that half the size of ARC, it is still large enough for models to learn the style of the task (e.g., see Table 3 later), without simply memorizing the task itself, thus avoiding large-scale supervised training pitfalls. This trend towards more realistically sized datasets is seen elsewhere also, e.g., OBQA (Mihaylov et al., 2018) , QASC (Khot et al., 2019) , TRACIE .
We retain the same train/dev/test labels for questions as in the original ARC dataset, resulting in approximately similar proportions as ARC. We also do not separate the original ARC-Easy and ARC-Challenge questions, but instead merge them into a single dataset. We do this because the labels "Easy" and "Challenge" were based on the MC choices. (Switching from MC to DA can result in a "Hard" question becoming conceptually easy, and vice versa). However, we do retain the original Easy/Challenge labels as metadata in the ARC-DA dataset. The resulting dataset statistics are summarized in Table 1 .
Knowledge And Reasoning Types
We found that the distribution of knowledge and reasoning types required by ARC-DA questions, as classified by Boratko et al. (2018) , to be roughly the same as in ARC, see Figure 2 (created using Boratko et al's data). For a detailed description of these categories, see (Boratko et al., 2018) .
It's not immediately clear how one should score answers to DA questions. Doing this is more difficult than for MC questions, as (usually) the set of gold DA answers is incomplete. Further, even if the answer is unique conceptually (e.g., the answer "gravity") it may be phrased in multiple ways ("the force of gravity" "gravitational force", "gravitation", ...). As
Reasoning Types Figure 2 : Comparison of the distribution of questions among different knowledge (top) and reasoning types (bottom), comparing ARC with ARC-DA. Overall, the distributions are roughly similar. Data is from sampled annotations created by (Boratko et al., 2018) . For a detailed description of the categories, see (Boratko et al., 2018 ). a result, scoring is necessarily approximate. However, this should not be a reason to shy away from such problems; valid comparisons can still be made, and there are obvious benefits to working in the more realistic DA setting.
We propose two ways to score answers to ARC-DA: The first is human scoring via GENIE 6 , a human-in-the-loop leaderboard framework that scores answers using an automated crowdsourced pipeline (Khashabi et al., 2021) . GE-NIE streamlines the human scoring of machine-generated answers by automatically posting them on crowdsourcing platforms, collecting qualitative human judgements (converted to numeric scores using the rubric in Table 2) , then performing statistical analyses to quantify uncertainty. It also includes various constraints to ensure quality control. To use GENIE, we submit our answers to the leaderboard, then wait for the task to complete (which follows a fixed, periodic schedule). Note that GENIE is publicly available for other researchers interested in this dataset.
Second, we consider two popular automatic metrics to score answers by comparing them to the (typically incomplete) set of gold answers, namely ROUGE and an F1 wordoverlap measure. For ROUGE (Lin et al., 2006) , we use the F1 score for the ROUGE-L variant which considers the longest common subsequence, thus penalizing words out of order. 7 For the simple F1 word-overlap measure, we adopt the conventions from the SQuAD dataset (Rajpurkar et al., 2016) in terms of ignoring punctuation and a few stop words. For both ROUGE and F1, we take the maximum score over all of the gold answers for a given question (i.e., an answer is scored against its best-matching gold answer), and then average over all the questions.
We note that both ROUGE and F1 have known intrinsic pitfalls. For example, as F1 ignores word order, the prediction "from solid to liquid" would be considered a perfect match for the gold answer "from liquid to solid".
For these reasons, our preferred metric for ARC-DA is GENIE (despite the turnaround time), which also alleviates the problem of missing gold answers.
We next describe a few strong baseline systems for ARC-DA and report their performance.
To build a strong baseline model, we start with (a reimplementation of) UnifiedQA (Khashabi et al., 2020) , a QA system trained on multiple QA datasets using the text-to-text pretrained T5 transformer (Raffel et al., 2020) (we use the 11B version). We then fine-tune two models on ARC-DA, one using sentences retrieved from a general corpus of text K, and one without. The input to these models is the question Q (plus retrieved sentences, for the first model). The desired output is a correct answer to Q. We call the resulting models UnifiedQA + ARC-DA.
For the "with IR" (Information Retrieval) variant of Uni-fiedQA + ARC-DA, given a question Q, we retrieve 10 sentences K 1 , ..., K 10 from the corpus K using Q as the search query (here, using ElasticSearch). For K, we use the Aristo Corpus, a Web-crawled corpus containing 280GB of general and science-related sentences augmented with ≈80k additional science textbook sentences . The input to the model is then:
$question$ = Q ; $context$ = K 1 ...K 10
The desired output of the model is a correct answer to the question. To train the model, since we (typically) have multiple, alternative gold target answers A 1 , ..., A n in the training data, we generate N a training examples for each question, where each example uses a randomly sampled answer from A i . In other words, each individual gold answer (of which there are a few per question) and unique question are used to construct an individual training example, capped at Table 4 : Results on ARC-DA dev set (338 questions). Here we show human evaluation by one of the authors (EXPERT), rather than GENIE scores.
a max of N a training examples per question. In our experiments, we used N a = 4. Each training instance thus has a single gold answer, and the fine-tuning otherwise follows the T5 procedure of using teacher forcing (Williams and Zipser, 1989) . Note there is a (deliberate) asymmetry in train/test: Each training instance encourages the system to predict a particular gold answer, while each test output is considered correct if it predicts any of the gold answers. This style of teaching for questions with multiple answers has been found effective in previous work, e.g., (Bosselut et al., 2019; Rashkin et al., 2018) . For the "without IR" variant, the same process is applied except the input to the model is simply: $question$ = Q Since UnifiedQA is question-format agnostic, 8 we also create variants of the above models (again with and without retrieval) by fine-tuning them jointly on ARC-DA as described above as well as on the original multiple choice questions of ARC. The resulting models are referred to as UnifiedQA + ARC-DA/MC.
The results for the models are shown in Table 3 . To help interpret the GENIE scores, note that crowdworkers label answers according to the rubric and corresponding real values as shown in Table 2 . For comparison, one of the authors manually scored the answers on the development set, using a principle of partial credit for non-ideal answers; this is shown under the EXPERT column of Table 4 .
There are several results of note. First, the scores are high in absolute terms, with the human-scored GE-NIE/EXPERT numbers being roughly comparable to scores on the original MC questions, found to be 86.8%/92.6% without/with IR. 9 This suggests that the DA questions are not necessarily harder than the MC versions, despite the format change, although they are more natural (non-multiplechoice). While intuitively one might expect DA questions to be more difficult to answer as the number of potential answers changes from 4 to a potentially infinite number, some may also be easier as any correct answer is valid, allowing the model to sidestep subtle distinctions that may be used in the MC choices.
Second, the GENIE scores slightly underestimate the "true" score, which we take as the EXPERT score (Table 4) , namely the score one might expect to receive in an examination setting with a professional grader. This may be due to occasional annotation errors and/or unreliable annotators that slip through GENIE's quality controls. (Also note the GENIE score in Table 3 is on the test set, while the EXPERT score in Table 4 is on dev, which may account for some of the difference (test performance is typically slightly worse than dev)). While in principle the upper bound on the EX-PERT score is 100%, namely for a perfect set of answers, our preliminary tests suggest the GENIE upper bound (for ARC-DA) may be around 90% for a perfect set of answers due to this noise, given GENIE's current pipeline (additional improvements to GENIE are under consideration).
Third, the automated metrics are only a loose approximation of the true target. In absolute terms, there is a significant gap between the automated metrics (F1 and ROUGE-L) and the human evaluations (GENIE and EXPERT), suggesting that there are indeed additional answers and answer phrasings missing in ARC-DA gold answers. We also see that the rank-ordering of models based on human vs. automated metrics is not identical (although is generally similar). Assuming that the human-based scores are the most accurate (although expensive), this indicates that automatic metrics should be used with caution: While they can be used as a useful proxy, it is not appropriate to draw conclusions from them based on small (e.g., 1%) differences.
Impact On Mc Question-Answering
As an unexpected corollary, we ran the UnifiedQA + ARC-DA/MC model on the original ARC MC dataset, 10 and obtained new state-of-the-art results (81.4% on ARC-Challenge and 92.7% on ARC-Easy). 11 Note also that this model has the highest score on ARC-DA (GENIE score of 81%, Table 3 ). This suggests that there is some additional training signal provided by the DA training questions that is assisting in MC QA, and likewise that the additional MC training is helping answer DA questions. This phenomenon is reminiscent of the discovery in the original UnifiedQA paper that multi-format training can provide an overall boost in individual scores (Khashabi et al., 2020) .
Progress in QA requires new datasets in more realistic settings, for example using natural questions that require more than a "lookup" answer. The ARC-DA dataset addresses this need, containing a direct answer version of (a subset of) the ARC multiple-choice questions. These questions are expert (examination board) authored, high quality, sensible, and avoid the repetition common to crowdsourced datasets, making them of particular interest to NLP. We have also shown that baseline scores, although strong, are far from perfect, offering a new challenge to the NLP community, as well as a new setting to study explanation in the context of questions requiring reasoning. We invite readers to take up this challenge! The ARC-DA dataset is available at https://allenai.org/data/arc-da, and the GENIE human evaluation framework is publicly available at https://genie.apps.allenai.org.
Appendix A. Instructions to Crowdworkers Below are the instructions provided to the (Amazon Mechanical Turk) crowdworkers for answering DA questions:
Instructions (click here to collapse/expand instructions)
This HIT is to write down some answers to 5 science questions, so that we can test an AI system (Aristo) that we are developing. The questions were originally taken from multiple choice exams, but we are wanting to convert them to "direct answer" format. Your task is to write down one or more answers to the questions.
As the questions originally came from multiple choice exams, there may often be more than one answer. In those cases, please enter two or three possible answers separated by a ";", e.g., For Q: Which is an animal? you might enter three answers "dog; cat; elephant".
Here is an example:
Question: A ball is tossed up in the air and it comes back down. The ball comes back down because of Enter your answer(s): gravity (If you see more than one answer, enter two or three separated by ";", e.g. "flower; tree; plant".) Now select the appropriate option below about this question:
There is a clear, single answer There is conceptually just one answer, but it could be expressed in different ways (enter 1-3 examples above)
There are several (2-4) different, correct answers to this question (enter 2-3 examples above)
There are many different, correct answers to this question (enter 2-3 examples)
The question makes sense, but I don't know the answer (enter "don't know" as the answer)
This question doesn't make sense or is unanswerable (enter "?" as the answer) Comment: In this case, there's one clear answer ("gravity"), hence the worker has entered it and checked the first box.
Some more examples are below, please read them carefully!
Some Important Notes:
Some questions might sound a little strange. This is because they were originally a multiple choice question. Try and answer it as best you can. For "Which..." questions, think of these as asking a "What..." question, for example: Question: What is an example of an animal? Your answer (for example): dog; cat; mouse put down two or three example answers separated by a ";", e.g., "dog; cat; elephant". If you can see a couple of ways of answering a question, put them down separated by a ";". For example:
Question: Sleet, rain, snow, and hail are forms of: Your answer (for example): weather; bad weather; precipitation Question: Which type of energy does a person use to pedal a bicycle? Your answer (for example): motion; kinetic energy Some answers might be a phrase or sentence, e.g.,:
Feel free to use the internet to help get information. BUT If you happen to find exactly this question on the internet (e.g., as part of a multiple-choice exam), please don't read the answer and in particular don't copy in the multiple-choice answer! We are wanting "natural" answers to this question rather than the original multiple choice answer, so copying in the multiple-choice answer defeats the point. If you're unsure, or it's taking too long to work out the answer, enter "don't know" and select the "I don't know the answer" choice If the question doesn't make sense or is unanswerable, enter "?". For categorizing the question, just use your best judgement. Thank you for your help! You rock! 1. Examples of questions where there is a clear, single answer Q:In New York State, the longest period of daylight occurs during which month? Your Answer: June Q: Which form of energy is needed to change water from a liquid to a gas? A: heat Comment: In these cases, there's one clear answer.
2. Examples of questions where There is conceptually just one answer, but it could be expressed in different ways Q: A dog opens its mouth and lets its tongue hang out. A human's body produces sweat. These are two ways that organisms may adjust to Your Answer (for example): warm weather; hot temperatures; hot weather; heat Q: What is the main source of energy for the water cycle? A: sun; sunlight; sunshine Comment: As there are several different ways of describing the answer, they are listed above separated by ";". Aim to enter two or three such variations. The above answers are just examples, others are possible. 6. Examples of questions where the question doesn't make sense or is unanswerable (enter "?" as the answer) Q: Which is the largest? Your Answer: ? Q: Which animal is preparing for a seasonal change in the environment? A: ? Q: Which object is the best conductor of electricity? A: ? Comment: Enter a "?" if the question doesn't make sense or is unanswerable.
3. Examples Of Questions Where
Thank you for your help! You rock!
https://leaderboard.allenai.org/arc/submissions/public 3 Indeed, this is the approach taken by(Lin et al., 2020) to use (a filtered subset of) ARC in a direct-answer setting.
Many questions are multi-sentence, with a preamble before the actual question sentence.5 The filter phrases are: which of, most, best, least, est, order, supports, characteristic, trait, which object, which statement, below, which is, which are, example, which term, conclusion, which would, which item, which action, which two, which sentence, which one, sequence, which fact, which
Available at https://genie.apps.allenai.org/
We use the implementation from https://github.com/googleresearch/google-research/tree/master/rouge, with stemming turned on.
That is, given an MC question, UnifiedQA will output an answer choice label; while given a DA question, UnifiedQA will generate an answer directly.
To obtain these MC scores, we ran the same UnifiedQA model, before fine-tuning on ARC-DA, on the original ARC multiplechoice versions of the 1397 ARC-DA test questions.10 As before, note that UnifiedQA is format-agnostic, outputing an answer option label given an MC question, or a direct answer given a DA question.11 https://leaderboard.allenai.org/arc/submissions/public