Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions
Can we enable NLP models to appropriately respond to instructional prompts and consequently generalize to new tasks? To study this question, we leverage the existing NLP datasets and the instructions that were used to crowdsource them to create NATURALINSTRUCTIONS, a dataset of instructions and task-specific input/output data. This dataset consists of 61 distinct language instructions and about 600k task instances, and is used to evaluate existing state-of-the-art languagemodels (LMs) in addressing new tasks by few-shot prompting of GPT3 and fine-tuning BART. Our analysis indicates that: (a) the existing models indeed benefit from instructions and hence, show improved generalization to new tasks; (b) while models like GPT-3 generally benefit from instructions, the extent of their gains varies across different fields of instructions and also depends on the task being solved; (c) generalization to unseen tasks in NATURAL-INSTRUCTIONS remains far from perfect for the state-of-the-art, indicating significant room for more progress in this direction.1
McCarthy et al. (1960) in their seminal work outlined a hypothetical machine that takes instructions as declarative knowledge as input and incorporates it in its decision-making. This vision, however, remains elusive due to many challenges that are at the heart of artificial intelligence. Backed by the progress made in pretrained neural language models (Raffel et al., 2020; Brown et al., 2020) we revisit McCarthy's vision of enabling machines to act according to instructions.
While the current dominant paradigm (supervised learning with labeled examples) has been successful in building task-specific models (Wang 1 https://github.com/allenai/natural-instructions ‹ Work done while interning at the Allen Institute for AI. Figure 1 : NATURAL-INSTRUCTIONS contains a variety of tasks, each natural language instructions. A model is expected to understand the given instructions and answer the given input accordingly. Models that make appropriate use of the instructions can generalize to unseen tasks. et al., 2019), the resulting models fail to effectively generalize to unseen tasks (for example, a model that is supervised to solve questions cannot solve a classification task) -which limit their applicability in real life. However, models equipped with understanding and reasoning with natural language instructions (Goldwasser and Roth, 2014; Efrat and Levy, 2020) should be able to generalize to any task that can be defined via instructions.
In this work, we introduce NATURAL-INSTRUCTIONS, a high-quality dataset of 61 language understanding tasks and their accompanying instructions. As shown in Fig.1 , the challenge involves a variety of tasks, such as classifying a document, modifying an input sentence, answering questions, etc. A system is expected to understand the given human-readable natural language instructions and apply them accordingly to the arXiv:2104.08773v1 [cs.CL] 18 Apr 2021 given instance.
Inspired by Efrat and Levy (2020) , our NATURAL-INSTRUCTIONS dataset uses the crowdsourcing instructions of existing NLP datasets and their data instances as a challenge for NLP models. Compared to the previous work, NATURAL-INSTRUCTIONS includes a diverse set of tasks and instructions represented with a unified schema, which enables evaluation at large-scale and generalization across tasks. In particular, we break up the crowdsourcing instructions (collected from a variety of tasks) into minimal, but well-defined subproblems. For example, Quoref 's crowdsourcing template is broken into two tasks of question-generation and answergeneration. Moreover, we map the instructions into a unified schema ( §3.2) to make them more coherent and easier to understand for language models.
More detailed examples of our tasks are shown in Fig. 2. For instance, defines the task of classifying a given question into one of their types ('span', 'number', 'date'). To contextualize the instruction, each task contains positive and negative examples. Finally, each subtask contains task instances, i.e., the inputs to a system and expected responses.
We benchmark BART (Lewis et al., 2019) and GPT-3 (Brown et al., 2020) , two recent generative LMs ( §6). Our evaluation of these models indicates that natural language instructions improve the generalization to new tasks. This pleasant surprise stands in contrast to the findings of Efrat and Levy (2020) who did not find any benefit in instructions. We hypothesize the shown benefit here is due to our formulation (minimal minimal instructions mapped into schemas) and the scale of tasks included in NATURAL-INSTRUCTIONS.
Our models generally benefit from instructions, however, we find that these gains are sensitive to various parameters, such as the semantics of the task at hand (Fig.4) . Ablations ( §6.3) indicate that GPT-3 gains notably from positive examples, mildly benefits from the task definitions, and tends to be confused about negative examples or 'things to avoid' instructions.
While instructions improve model generalization, the model performances are far from what they should be. Specifically, the gap with our oracle upper-bound (measured by task-specific models) indicate a sizable room to progress. We hope this gap, as well as the availability of NATURAL-INSTRUCTIONS will encourage development of stronger models of language.
Contributions: In summary, the contributions of this work are as follows: (a) we introduce NATURAL-INSTRUCTIONS, a large dataset of natural language instructions curated from existing well-known datasets mapped to unified schema; (b) we benchmark state-of-the-art models and show better generalization by using the instructions; (c) we conduct a variety of analyses which show the sensitivity of the models to the target task and the elements of the instruction.
2 Related Works
A recent line of work has explored the ability of models to reason with respect to natural language description of a goal task (Goldwasser and Roth, 2014; Weller et al., 2020; Efrat and Levy, 2020; Hase and Bansal, 2021; Ye and Ren, 2021) . showed the ability of LMs to incorporate logical rules/facts expressed in natural language in their reasoning. Weller et al. (2020) proposed a crowdsourced dataset with short question-like task descriptions. Compared to this work, our instructions are longer, more complex and natural (in the sense that they were originally targeted for laypeople). The closest work to ours is Efrat and Levy (2020) who examined models' ability to follow natural language instructions that were built based on existing datasets (SNLI, SQuAD and NewsQA). Our approach differs with this work in the following ways: First, to have a unified encoding of the instructions, we map the html content of the instructions (shown to crowdworkers) to a unified schema ( §3.2). Second, each of the source instructions (shown to human workers) are split into independent tasks ( §3.3) to allow us focus on individual sub-problems. Additionally, Efrat and Levy (2020)'s scope of study was limited to 3 datasets, while our proposed dataset contains many tasks which allows evaluation on a wider variety of tasks, as well as fine-tuning experiments across different tasks ( §6).
Another related line of work is the research on effective prompting of LMs Schick and Schütze, 2020; Scao and Rush, 2021; Reynolds and McDonell, 2021; Liu et al., 2021; Tam et al., 2021) . Most of the works in this bucket explore ways to form natural language prompt for generative LMs, often for benchmarks like Super-GLUE . We see several limita-question generation (from MC-TACO) -Title: Writing questions that involve commonsense understanding of "event duration". -Definition: In this task, we ask you to write a question that involves ?event duration", based on a given sentence. Here, event duration is defined as the understanding of how long events typically last. For example, ?brushing teeth?, usually takes few minutes.
-Emphasis & Caution: The written questions are not required to have a single correct answer.
-Things to avoid: Don't create questions which have explicit mentions of answers in text. Instead, it has to be implied from what is given. In other words, we want you to use "instinct" or "common sense".
-Input: Sentence: Jack played basketball after school, after which he was very tired.
-Output: How long did Jack play basketball? -Reason: the question asks about the duration of an event; therefore it's a temporal event duration question.
-Input: Sentence: He spent two hours on his homework.
-Output: How long did he do his homework? -Reason: We DO NOT want this question as the answer is directly mentioned in the text. -Suggestion: -Negative Example -Prompt: Ask a question on "event duration" based on the provided sentence. -Title: Answering a fill in the blank question on objects -Definition: You need to answer a given question containing a blank (_). Your answer must be one of the two objects mentioned in the question for example "trophy" and "suitcase".
-Emphasis & Caution: --Things to avoid: Your answer must not contain a word that is not present in the question.
-Input: Context word: fit. Question: The trophy doesn't fit into the brown suitcase because _ is too large.
-Output: trophy -Reason: Answer is one of the objects ("trophy" and "suitcase") in the question. Since the blank is a "large" object that didn't fit the "suitcase", the answer must be "trophy".
-Input: Context word: fit. Question: The trophy doesn't fit into the brown suitcase because _ is too large.
-Output: bottle -Reason: The issue is that the answer is not one of the objects present in the question which are "trophy" and "suitcase". Note that, a valid answer must be one of the objects present in the question. -Title: Finding the answer type of a reasoning question -Definition: This task involves annotating the answer type to a given question that involve some kind of complex reasoning (including numerical reasoning). Note that the questions require looking at more than one part of the passage to answer. There are 3 possible answer types (i) spans, (ii) numbers and (iii) dates. If the answer can be found in the passage, label it as "span". If the answer is a number, label as "number". Similarly, label "date" if you think the answer to the given question is a date. -Title: Modifying a fill in the blank question on persons -Definition: You're given a fill-in-the-blank question where the answer is PersonX. You need to minimally change the given question so that the answer flips to PersonY. This task typically involves replacing one word i.e. the 'trigger word' by its antonym (e.g. changing from "sympathetic" to "stern"). -Input: Context word: upset. Question: PersonX yelled at PersonY because _ was so upset about the news. Answer: PersonX. -Output: PersonX comforted at PersonY because _ was so upset about the news.
-Reason: On replacing the trigger word "yelled" by its antonym "comforted", the answer flips to PersonY which is as per the given instruction. So, this is a valid question.
-Prompt: What is the type of the answer corresponding to the given question? Number, Date, or Span?
-Input: Context Word: day. Question: PersonX learned new organizational skills from PersonY because _ 's day schedule was very chaotic. Answer: PersonX -Expected Output: PersonX learned new organizational skills from PersonY because _ 's day schedule was very efficient. task instance -Input: Context word: step. Question: PersonX was always ahead of PersonY, as _ walked with a quick step. Answer: PersonX. -Output: PersonY was always ahead of PersonY, as _ walked with a quick step.
-Reason: Here, the issue is that the usage order of PersonX and PersonY has been changed in the generated question. Remember that, for a question to be valid, PersonX should appear earlier than PersonY.
-Suggestion: - tions with this line of work: since there is no established benchmark for studying "prompts" (and more broadly, instructions), each work has a different setup for their study, which creates barriers for the reproducibility of the experiments. Additionally, most of the "prompts" being investigated are often overly-simple. For example, (Scao and Rush, 2021) uses prompts like the following for their question-answering subtask: "p. Based on the previous passage, q?
Finally, we highlight other flavors of "instructions" studied in various NLP sub-communities. For example, mapping natural language to database commands (Kim et al., 2020) , visualizations (Shao and Nakashole, 2020) , console commands (Lin et al., 2018) , robot actions (Shridhar et al., 2020; Stepputtis et al., 2020) , inter alia. A common attribute of all such sub-communities is that their language instructions are mapped to a fixed symbolic grammar (e.g., SQL commands). Conversely, in our setup, there is no low-level grammar for our instructions. All the constraints and expectations) of our tasks need to be inferred from the natural language statement of the instructions.
This section elaborates on our construction of NATURAL-INSTRUCTIONS. The main focus of this construction is the addition of natural language instructions to the existing tasks, which is the focus of this section. We build our benchmark by utilizing existing datasets and their crowdsourcing templates ( §3.1). We describe important properties of instructions collected using our approach ( §3.1.1) leading to a design of unified schema for instructions ( §3.2). Finally, we describe how collected data is mapped into our unified schema ( §3.3). We conclude this section with a series of quantitative analyses on the dataset ( §3.4).
3.1 Data Collection
To construct NATURAL-INSTRUCTIONS, we re-use existing, widely-adopted natural language benchmarks that are collected via crowdsourcing plat-forms and hence, come with crowdsourcing templates. We only focus on textual instructions and avoid datasets that involve visual or auditory steps. In particular, we use the templates used for building the following datasets: CosmosQA (Huang et al., 2019) , DROP (Dua et al., 2019) , Essential-Terms (Khashabi et al., 2017) , MCTACO (Zhou et al., 2019) , MultiRC (Khashabi et al., 2018) , QASC (Khot et al., 2020) , Quoref , and Winogrande (Sakaguchi et al., 2020) .
Dividing instructions into minimal subtasks. Almost all the crowdworking instructions contain sequences of steps to guide crowdworkers towards their desired goal. For examples, QASC and MC-TACO contain a relatively high number of steps in the data creation process. Unlike the work of (Efrat and Levy, 2020) which tackle crowdsourcing instructions as-is, we divide each crowdsourcing template into minimal tasks, thus generating multiple subtasks out of a single crowdsourcing template. For instance, the Quoref crowdsourcing template (input: context and instructions, output: dataset of QA pairs) is split into 2 subtasks-(1) question generation (input: context and instructions, output: question), and (2) answer generation (input: context, instructions and question, output: answer). Similarly, MCTACO's annotation crowdsourcing involves generating multiple types of questions, each with their own separate instructions. Therefore, we divide the task into multiple subtasks each representing a distinct type of question generation task. Our intuition behind this whole process is to make the subtask definitions more consistent across various subtasks and hence, increasing the chances of success for the models.
3.1.1 Unified Design Considerations
We follow the following design considerations when compiling our data.
Natural human-readable instructions. The collected instructions in natural language are designed for crowd workers and hence understandable to laypeople in a way that they can perform the intended task seamlessly after reading the given instructions.
Diversity. The collected instructions cover various intricacies of language for a wide variety of reasoning skills since they are naturally written by creators of datasets. Importantly, we avoid templating our instructions Negative instructions. The conventional learning paradigms in NLP mostly rely on the inductive bias produced by positive examples. However, for humans, negative instructions (describing undesirable behaviors; (Lin et al., 2003; Jindal and Roth, 2011) ) are effective means to communicate a given task's scope. For humans, concise negative examples can be as informative as many positive examples. Our collected instructions include negative instructions originally provided to crowd workers to add constraints for data generation.
3.2 Instruction Schema
Since the instructions used in crowdsourcing our source datasets were built by various authors for various purposes, they are different in a variety of ways (a summary of their differences is included in Appendix B.) To reduce the excess variability of these instructions, we cast them into a unified representation presented in Fig. 3 . This unified schema was the result of our preliminary pilot study conducted on a subset of our source datasets ( §3.1). It is noteworthy that each of the examples in Fig. 2 follows the same schema. Below we describe the ingredients of this schema:
• Title provides a high-level description of the task and its associated skill (such as question generation, answer generation). • Definition provides the core detailed instructions for the task. • Things to Avoid contain instructions regarding undesirable annotations that must be avoided. Such instructions are typically helpful in defining the scopes of a task and the space of acceptable responses. • Emphasis and Caution are the statements highlighted in the crowdsourcing templates which were intended to be emphasized or warned against. These are typically short, but "must see" instructions for crowdworkers. • Positive Examples help crowdworkers understand a task better. They typically contain inputs/outputs that are similar to the input given to a worker/system and its expected output. • Negative Examples contextualize "Things to Avoid" by providing examples which workers/system must not produce. • Reason field helps crowdworkers understand the reasons behind why an example is good or bad. • Suggestion contains suggestions on how a negative example could be modified to turn it into a positive example.
3.3 Mapping The Existing Instructions To The Proposed Schema
In this section, we describe various considerations in the process of mapping raw instructions (designed for crowdworkers) to our proposed schema ( §3.2), while adhering to our desiderata ( §3.1.1).
The modifications suggested in this step were applied by one author and was verified by another author.
Minimal Descriptions by Reducing the repeated content. During our construction, we limit repetition, wherever possible. While repetition often helps in augmenting human understanding, short and concise instructions are often more effective for computers due to their limited attention span (Beltagy et al., 2020) .
Augmenting the examples. There is a large variance in the number of examples provided in the instructions. Most of the instructions consistently contain more positive examples than negative examples. QASC instructions, in particular, do not contain any negative examples. Whenever possible, we add positive/negative examples to the instructions, according to their desirable and undesirable attributes.
Augmenting the example explanations. Not all datasets contain explanations for each of their examples. Quoref is the only dataset that provides provide reasons as to why each example is positive/negative and suggestions for improving each of its negative examples. For the datasets missing such information, we add these details wherever possible.
Mapping html styles to plain text. Typically a crowdsourcing setup contains various effects, such as html colors highlighting certain pieces of text. Some of such effects are captured by our schema. For instance, highlighted sentences (for emphasis) are typically incorporated as part of our emphasis/caution field. In some cases, html information (e.g., certain organization of the information boxes) are lost while converting them into our schema. We hope such issues will be addressed in future work.
Model-in-loop instructions. Some recent dataset instructions involve model-in-loop decisions. For instance, in one of the steps in QASC's annotation, crowdworkers have access to a retrieval engine via language queries. Since our instructions are static, we ignore such subtasks and delegate them for future work. We acknowledge that mapping crowdsoucing instructions to our schema is a lossy process. In the constructions we have tried to retain as much information as possible so that the resulting instructions are still faithful to the original task.
By the end of this step, we have a dataset of 61 subtasks (each with their own language instructions) and input/output instances. The complete list of instructions is included in the appendix. To make it easier to study our task, we have categorized them into 7 semantic categories: (1) question generation, (2) answer generation, (3) incorrect answer generation, (4) classification, (5) minimal text modification, (6) long text generation, and (7) verification. Table 1 shows that the distributions of various categories. As it can be observed, question generation, answer generation, and classification categories have the highest number of subtasks. In terms of the number of instances, the classification category has the highest number of instances (194k) whereas long text generation has the lowest number of instances (30k). statistic value # of subtasks 61 # of instances 620k avg. length of "title" (tokens) 8.3 avg. length of "prompt" (tokens)