Natural Instructions: Benchmarking Generalization to New Tasks from Natural Language Instructions

Swaroop Mishra
Daniel Khashabi
Chitta Baral
Hanna Hajishirzi
ArXiv
2021
View in Semantic Scholar

Abstract

Can we enable NLP models to appropriately respond to instructional prompts and consequently generalize to new tasks? To study this question, we leverage the existing NLP datasets and the instructions that were used to crowdsource them to create NATURALINSTRUCTIONS, a dataset of instructions and task-specific input/output data. This dataset consists of 61 distinct language instructions and about 600k task instances, and is used to evaluate existing state-of-the-art languagemodels (LMs) in addressing new tasks by few-shot prompting of GPT3 and fine-tuning BART. Our analysis indicates that: (a) the existing models indeed benefit from instructions and hence, show improved generalization to new tasks; (b) while models like GPT-3 generally benefit from instructions, the extent of their gains varies across different fields of instructions and also depends on the task being solved; (c) generalization to unseen tasks in NATURAL-INSTRUCTIONS remains far from perfect for the state-of-the-art, indicating significant room for more progress in this direction.1

1 Introduction

McCarthy et al. (1960) in their seminal work outlined a hypothetical machine that takes instructions as declarative knowledge as input and incorporates it in its decision-making. This vision, however, remains elusive due to many challenges that are at the heart of artificial intelligence. Backed by the progress made in pretrained neural language models (Raffel et al., 2020; Brown et al., 2020) we revisit McCarthy's vision of enabling machines to act according to instructions.

While the current dominant paradigm (supervised learning with labeled examples) has been successful in building task-specific models (Wang 1 https://github.com/allenai/natural-instructions ‹ Work done while interning at the Allen Institute for AI. Figure 1 : NATURAL-INSTRUCTIONS contains a variety of tasks, each natural language instructions. A model is expected to understand the given instructions and answer the given input accordingly. Models that make appropriate use of the instructions can generalize to unseen tasks. et al., 2019), the resulting models fail to effectively generalize to unseen tasks (for example, a model that is supervised to solve questions cannot solve a classification task) -which limit their applicability in real life. However, models equipped with understanding and reasoning with natural language instructions (Goldwasser and Roth, 2014; Efrat and Levy, 2020) should be able to generalize to any task that can be defined via instructions.

Figure 1: NATURAL-INSTRUCTIONS contains a variety of tasks, each natural language instructions. A model is expected to understand the given instructions and answer the given input accordingly. Models that make appropriate use of the instructions can generalize to unseen tasks.

In this work, we introduce NATURAL-INSTRUCTIONS, a high-quality dataset of 61 language understanding tasks and their accompanying instructions. As shown in Fig.1 , the challenge involves a variety of tasks, such as classifying a document, modifying an input sentence, answering questions, etc. A system is expected to understand the given human-readable natural language instructions and apply them accordingly to the arXiv:2104.08773v1 [cs.CL] 18 Apr 2021 given instance.

Inspired by Efrat and Levy (2020) , our NATURAL-INSTRUCTIONS dataset uses the crowdsourcing instructions of existing NLP datasets and their data instances as a challenge for NLP models. Compared to the previous work, NATURAL-INSTRUCTIONS includes a diverse set of tasks and instructions represented with a unified schema, which enables evaluation at large-scale and generalization across tasks. In particular, we break up the crowdsourcing instructions (collected from a variety of tasks) into minimal, but well-defined subproblems. For example, Quoref 's crowdsourcing template is broken into two tasks of question-generation and answergeneration. Moreover, we map the instructions into a unified schema ( §3.2) to make them more coherent and easier to understand for language models.

More detailed examples of our tasks are shown in Fig. 2. For instance, defines the task of classifying a given question into one of their types ('span', 'number', 'date'). To contextualize the instruction, each task contains positive and negative examples. Finally, each subtask contains task instances, i.e., the inputs to a system and expected responses.

Figure 2: Examples from NATURAL-INSTRUCTIONS. Each task follows the schema provided in Fig. 3.

We benchmark BART (Lewis et al., 2019) and GPT-3 (Brown et al., 2020) , two recent generative LMs ( §6). Our evaluation of these models indicates that natural language instructions improve the generalization to new tasks. This pleasant surprise stands in contrast to the findings of Efrat and Levy (2020) who did not find any benefit in instructions. We hypothesize the shown benefit here is due to our formulation (minimal minimal instructions mapped into schemas) and the scale of tasks included in NATURAL-INSTRUCTIONS.

Figure 3: The schema used for representing instruction in NATURAL-INSTRUCTIONS (§3.2), shown in plate notation.

Our models generally benefit from instructions, however, we find that these gains are sensitive to various parameters, such as the semantics of the task at hand (Fig.4) . Ablations ( §6.3) indicate that GPT-3 gains notably from positive examples, mildly benefits from the task definitions, and tends to be confused about negative examples or 'things to avoid' instructions.

Figure 4: GPT-3 and BART were evaluated with various encoding and various categories of tasks. The benefit of instructions to the models depends on the semantics of the task. For instance, for GPT-3 (left) minimal text modification category benefits a lot, while the benefits to verification tasks are minimal.

While instructions improve model generalization, the model performances are far from what they should be. Specifically, the gap with our oracle upper-bound (measured by task-specific models) indicate a sizable room to progress. We hope this gap, as well as the availability of NATURAL-INSTRUCTIONS will encourage development of stronger models of language.

Contributions: In summary, the contributions of this work are as follows: (a) we introduce NATURAL-INSTRUCTIONS, a large dataset of natural language instructions curated from existing well-known datasets mapped to unified schema; (b) we benchmark state-of-the-art models and show better generalization by using the instructions; (c) we conduct a variety of analyses which show the sensitivity of the models to the target task and the elements of the instruction.

2 Related Works

A recent line of work has explored the ability of models to reason with respect to natural language description of a goal task (Goldwasser and Roth, 2014; Weller et al., 2020; Efrat and Levy, 2020; Hase and Bansal, 2021; Ye and Ren, 2021) . showed the ability of LMs to incorporate logical rules/facts expressed in natural language in their reasoning. Weller et al. (2020) proposed a crowdsourced dataset with short question-like task descriptions. Compared to this work, our instructions are longer, more complex and natural (in the sense that they were originally targeted for laypeople). The closest work to ours is Efrat and Levy (2020) who examined models' ability to follow natural language instructions that were built based on existing datasets (SNLI, SQuAD and NewsQA). Our approach differs with this work in the following ways: First, to have a unified encoding of the instructions, we map the html content of the instructions (shown to crowdworkers) to a unified schema ( §3.2). Second, each of the source instructions (shown to human workers) are split into independent tasks ( §3.3) to allow us focus on individual sub-problems. Additionally, Efrat and Levy (2020)'s scope of study was limited to 3 datasets, while our proposed dataset contains many tasks which allows evaluation on a wider variety of tasks, as well as fine-tuning experiments across different tasks ( §6).

Another related line of work is the research on effective prompting of LMs Schick and Schütze, 2020; Scao and Rush, 2021; Reynolds and McDonell, 2021; Liu et al., 2021; Tam et al., 2021) . Most of the works in this bucket explore ways to form natural language prompt for generative LMs, often for benchmarks like Super-GLUE . We see several limita-question generation (from MC-TACO) -Title: Writing questions that involve commonsense understanding of "event duration". -Definition: In this task, we ask you to write a question that involves ?event duration", based on a given sentence. Here, event duration is defined as the understanding of how long events typically last. For example, ?brushing teeth?, usually takes few minutes.

-Emphasis & Caution: The written questions are not required to have a single correct answer.

-Things to avoid: Don't create questions which have explicit mentions of answers in text. Instead, it has to be implied from what is given. In other words, we want you to use "instinct" or "common sense".

-Input: Sentence: Jack played basketball after school, after which he was very tired.

-Output: How long did Jack play basketball? -Reason: the question asks about the duration of an event; therefore it's a temporal event duration question.

Positive Example

-Input: Sentence: He spent two hours on his homework.

-Output: How long did he do his homework? -Reason: We DO NOT want this question as the answer is directly mentioned in the text. -Suggestion: -Negative Example -Prompt: Ask a question on "event duration" based on the provided sentence. -Title: Answering a fill in the blank question on objects -Definition: You need to answer a given question containing a blank (_). Your answer must be one of the two objects mentioned in the question for example "trophy" and "suitcase".

-Emphasis & Caution: --Things to avoid: Your answer must not contain a word that is not present in the question.

-Input: Context word: fit. Question: The trophy doesn't fit into the brown suitcase because _ is too large.

-Output: trophy -Reason: Answer is one of the objects ("trophy" and "suitcase") in the question. Since the blank is a "large" object that didn't fit the "suitcase", the answer must be "trophy".

-Input: Context word: fit. Question: The trophy doesn't fit into the brown suitcase because _ is too large.

-Output: bottle -Reason: The issue is that the answer is not one of the objects present in the question which are "trophy" and "suitcase". Note that, a valid answer must be one of the objects present in the question. -Title: Finding the answer type of a reasoning question -Definition: This task involves annotating the answer type to a given question that involve some kind of complex reasoning (including numerical reasoning). Note that the questions require looking at more than one part of the passage to answer. There are 3 possible answer types (i) spans, (ii) numbers and (iii) dates. If the answer can be found in the passage, label it as "span". If the answer is a number, label as "number". Similarly, label "date" if you think the answer to the given question is a date. -Title: Modifying a fill in the blank question on persons -Definition: You're given a fill-in-the-blank question where the answer is PersonX. You need to minimally change the given question so that the answer flips to PersonY. This task typically involves replacing one word i.e. the 'trigger word' by its antonym (e.g. changing from "sympathetic" to "stern"). -Input: Context word: upset. Question: PersonX yelled at PersonY because _ was so upset about the news. Answer: PersonX. -Output: PersonX comforted at PersonY because _ was so upset about the news.

-Reason: On replacing the trigger word "yelled" by its antonym "comforted", the answer flips to PersonY which is as per the given instruction. So, this is a valid question.

-Prompt: What is the type of the answer corresponding to the given question? Number, Date, or Span?

-Input: Context Word: day. Question: PersonX learned new organizational skills from PersonY because _ 's day schedule was very chaotic. Answer: PersonX -Expected Output: PersonX learned new organizational skills from PersonY because _ 's day schedule was very efficient. task instance -Input: Context word: step. Question: PersonX was always ahead of PersonY, as _ walked with a quick step. Answer: PersonX. -Output: PersonY was always ahead of PersonY, as _ walked with a quick step.

-Reason: Here, the issue is that the usage order of PersonX and PersonY has been changed in the generated question. Remember that, for a question to be valid, PersonX should appear earlier than PersonY.

-Suggestion: - tions with this line of work: since there is no established benchmark for studying "prompts" (and more broadly, instructions), each work has a different setup for their study, which creates barriers for the reproducibility of the experiments. Additionally, most of the "prompts" being investigated are often overly-simple. For example, (Scao and Rush, 2021) uses prompts like the following for their question-answering subtask: "p. Based on the previous passage, q? .". NATURAL-INSTRUCTIONS can serve as a benchmark for the study of complex prompts formulated as natural language instructions. Additionally, the diversity of the tasks in NATURAL-INSTRUCTIONS would allow the field to study generalization across subtasks by effective comprehension of the instruction -one of the key motivations for this work.

Negative Example

Finally, we highlight other flavors of "instructions" studied in various NLP sub-communities. For example, mapping natural language to database commands (Kim et al., 2020) , visualizations (Shao and Nakashole, 2020) , console commands (Lin et al., 2018) , robot actions (Shridhar et al., 2020; Stepputtis et al., 2020) , inter alia. A common attribute of all such sub-communities is that their language instructions are mapped to a fixed symbolic grammar (e.g., SQL commands). Conversely, in our setup, there is no low-level grammar for our instructions. All the constraints and expectations) of our tasks need to be inferred from the natural language statement of the instructions.

3 Natural-Instructions

This section elaborates on our construction of NATURAL-INSTRUCTIONS. The main focus of this construction is the addition of natural language instructions to the existing tasks, which is the focus of this section. We build our benchmark by utilizing existing datasets and their crowdsourcing templates ( §3.1). We describe important properties of instructions collected using our approach ( §3.1.1) leading to a design of unified schema for instructions ( §3.2). Finally, we describe how collected data is mapped into our unified schema ( §3.3). We conclude this section with a series of quantitative analyses on the dataset ( §3.4).

3.1 Data Collection

To construct NATURAL-INSTRUCTIONS, we re-use existing, widely-adopted natural language benchmarks that are collected via crowdsourcing plat-forms and hence, come with crowdsourcing templates. We only focus on textual instructions and avoid datasets that involve visual or auditory steps. In particular, we use the templates used for building the following datasets: CosmosQA (Huang et al., 2019) , DROP (Dua et al., 2019) , Essential-Terms (Khashabi et al., 2017) , MCTACO (Zhou et al., 2019) , MultiRC (Khashabi et al., 2018) , QASC (Khot et al., 2020) , Quoref , and Winogrande (Sakaguchi et al., 2020) .

Dividing instructions into minimal subtasks. Almost all the crowdworking instructions contain sequences of steps to guide crowdworkers towards their desired goal. For examples, QASC and MC-TACO contain a relatively high number of steps in the data creation process. Unlike the work of (Efrat and Levy, 2020) which tackle crowdsourcing instructions as-is, we divide each crowdsourcing template into minimal tasks, thus generating multiple subtasks out of a single crowdsourcing template. For instance, the Quoref crowdsourcing template (input: context and instructions, output: dataset of QA pairs) is split into 2 subtasks-(1) question generation (input: context and instructions, output: question), and (2) answer generation (input: context, instructions and question, output: answer). Similarly, MCTACO's annotation crowdsourcing involves generating multiple types of questions, each with their own separate instructions. Therefore, we divide the task into multiple subtasks each representing a distinct type of question generation task. Our intuition behind this whole process is to make the subtask definitions more consistent across various subtasks and hence, increasing the chances of success for the models.

3.1.1 Unified Design Considerations

We follow the following design considerations when compiling our data.

Natural human-readable instructions. The collected instructions in natural language are designed for crowd workers and hence understandable to laypeople in a way that they can perform the intended task seamlessly after reading the given instructions.

Diversity. The collected instructions cover various intricacies of language for a wide variety of reasoning skills since they are naturally written by creators of datasets. Importantly, we avoid templating our instructions Negative instructions. The conventional learning paradigms in NLP mostly rely on the inductive bias produced by positive examples. However, for humans, negative instructions (describing undesirable behaviors; (Lin et al., 2003; Jindal and Roth, 2011) ) are effective means to communicate a given task's scope. For humans, concise negative examples can be as informative as many positive examples. Our collected instructions include negative instructions originally provided to crowd workers to add constraints for data generation.

3.2 Instruction Schema

Since the instructions used in crowdsourcing our source datasets were built by various authors for various purposes, they are different in a variety of ways (a summary of their differences is included in Appendix B.) To reduce the excess variability of these instructions, we cast them into a unified representation presented in Fig. 3 . This unified schema was the result of our preliminary pilot study conducted on a subset of our source datasets ( §3.1). It is noteworthy that each of the examples in Fig. 2 follows the same schema. Below we describe the ingredients of this schema:

• Title provides a high-level description of the task and its associated skill (such as question generation, answer generation). • Definition provides the core detailed instructions for the task. • Things to Avoid contain instructions regarding undesirable annotations that must be avoided. Such instructions are typically helpful in defining the scopes of a task and the space of acceptable responses. • Emphasis and Caution are the statements highlighted in the crowdsourcing templates which were intended to be emphasized or warned against. These are typically short, but "must see" instructions for crowdworkers. • Positive Examples help crowdworkers understand a task better. They typically contain inputs/outputs that are similar to the input given to a worker/system and its expected output. • Negative Examples contextualize "Things to Avoid" by providing examples which workers/system must not produce. • Reason field helps crowdworkers understand the reasons behind why an example is good or bad. • Suggestion contains suggestions on how a negative example could be modified to turn it into a positive example.

3.3 Mapping The Existing Instructions To The Proposed Schema

In this section, we describe various considerations in the process of mapping raw instructions (designed for crowdworkers) to our proposed schema ( §3.2), while adhering to our desiderata ( §3.1.1).

The modifications suggested in this step were applied by one author and was verified by another author.

Minimal Descriptions by Reducing the repeated content. During our construction, we limit repetition, wherever possible. While repetition often helps in augmenting human understanding, short and concise instructions are often more effective for computers due to their limited attention span (Beltagy et al., 2020) .

Augmenting the examples. There is a large variance in the number of examples provided in the instructions. Most of the instructions consistently contain more positive examples than negative examples. QASC instructions, in particular, do not contain any negative examples. Whenever possible, we add positive/negative examples to the instructions, according to their desirable and undesirable attributes.

Augmenting the example explanations. Not all datasets contain explanations for each of their examples. Quoref is the only dataset that provides provide reasons as to why each example is positive/negative and suggestions for improving each of its negative examples. For the datasets missing such information, we add these details wherever possible.

Mapping html styles to plain text. Typically a crowdsourcing setup contains various effects, such as html colors highlighting certain pieces of text. Some of such effects are captured by our schema. For instance, highlighted sentences (for emphasis) are typically incorporated as part of our emphasis/caution field. In some cases, html information (e.g., certain organization of the information boxes) are lost while converting them into our schema. We hope such issues will be addressed in future work.

Model-in-loop instructions. Some recent dataset instructions involve model-in-loop decisions. For instance, in one of the steps in QASC's annotation, crowdworkers have access to a retrieval engine via language queries. Since our instructions are static, we ignore such subtasks and delegate them for future work. We acknowledge that mapping crowdsoucing instructions to our schema is a lossy process. In the constructions we have tried to retain as much information as possible so that the resulting instructions are still faithful to the original task.

By the end of this step, we have a dataset of 61 subtasks (each with their own language instructions) and input/output instances. The complete list of instructions is included in the appendix. To make it easier to study our task, we have categorized them into 7 semantic categories: (1) question generation, (2) answer generation, (3) incorrect answer generation, (4) classification, (5) minimal text modification, (6) long text generation, and (7) verification. Table 1 shows that the distributions of various categories. As it can be observed, question generation, answer generation, and classification categories have the highest number of subtasks. In terms of the number of instances, the classification category has the highest number of instances (194k) whereas long text generation has the lowest number of instances (30k). statistic value # of subtasks 61 # of instances 620k avg. length of "title" (tokens) 8.3 avg. length of "prompt" (tokens)

Table 1: Statistics of NATURAL-INSTRUCTIONS

3.4 Dataset Statistics

12.6 avg. length of "definition" (tokens) 65.5 avg. length of "things to avoid" (tokens) 24.1 avg. length of "emphasis/caution" (tokens) 45.0 avg. length of "reason" (tokens) 24.9 avg. length of "suggestion" (tokens)

4 Problem Setup And Evaluation

NATURAL-INSTRUCTIONS consists of a set of tasks (each with their own language instructions) and input/output instances. Here, we formally define our evaluation setup to learn from instructions within a task ( §4.1) and across tasks ( §4.2). We then explain how we use language models to encode instruction ( §5.1).

4.1 Evaluating Within Tasks

Each task t in NATURAL-INSTRUCTIONS consists of an instruction I t and its corresponding instances Z t , where Z t " px t i , y t i q

( nt i"1 denote the set of n t instances that belong to the t-th task. According to our schema ( §3.2), each instruction I t for the t-th task is a set that contains the following fields: Each instruction I t in NATURAL-INSTRUCTIONS given instruction I t and an input instance x t i to its corresponding output y t i :

M : pI t , x t i q Ñ y t i

Data split within tasks. For each task, we randomly create training/dev/test splits (sized 6.5k{0.5k{1k) and use the appropriate splits for supervision and evaluation. The training sets are limited to 6.5k so that we don't have any issues related to data imbalance in our experiments. Evaluation We formulate three evaluation settings with different supervision types available to a model (Table 2) . In 'task-specific' setting, a model is supervised with the training instances of the evaluation task -similar to the conventional setup. In 'few-shot' setting, a model only observes a few examples of the evaluation task. 3 In 'generalization' setting, a model does not observe any instances from the evaluation task.

Table 2: Different modes of supervision considered in this work, when evaluating a model on the instances of a fixed task T P Teval.

Setup Evaluation Supervision

task-specific T P Teval all the instances of T few-shot T P Teval few instances of T generalization T P Teval instances of Tnon-eval tasks Table 2 : Different modes of supervision considered in this work, when evaluating a model on the instances of a fixed task T P T eval .

5 Evaluating Language Models To Address Natural-Instructions

We use generative language models BART (Lewis et al., 2019) and GPT-3 (Brown et al., 2020) to address tasks in NATURAL-INSTRUCTIONS. Here, we describe how we encode instructions and instances into plain text and feed them into generative language models ( §5.1). We then describe the model details ( §5.2).

5.1 Encoding Instructions

Let encpI, xq define a function that maps a given instruction I and input instance x to plain text. Evidently, there are many choices for this function.

In our study, we consider the following encodings (details in Appendix C):

"instance-only" encoding. This encoding is the conventional paradigm where no instructions exist, except the input instance.

"prompt" encoding. In this encoding, we append the prompt message before the input instance.

"prompt + definition" encoding. In this encoding, the prompt message and the task definition appear before the input instance. Intuitively, this encoding is more informative and more complex than "prompt" only encoding.

"all instructions" encoding. This encoding contains all the instruction content. We include as many examples as possible, before exceeding the token limit of LMs.

"positive examples" encoding. This encoding contains only positive examples of the subtask (no task description, etc.). We fit as many positive examples as it fits in the input. Such example-only encodings have been used in several recent studies (Zhao et al., 2021).

5.2 Models

We use GPT-3 (Brown et al., 2020) an autoregressive LM with 175 billion parameters, showing to have a surprising ability in mimicking a few demonstrations of the task provided at inference time as conditioning. For this model, we only experiment under the few-shot setting since we do not have access to fine-tune and update the model parameters.

To evaluate language models' capabilities under generalization and task-specific settings, we use BART (base) (Lewis et al., 2019) fine tuning model parameters. BART (base) is an encoder-decoder architecture with 140 million parameters (roughly 1.2k times smaller than GPT-3).

6.1 Experimental Setup

Few-shot GPT-3. We follow the recent work on inference-time few-shot prompting of GPT-3 (Brown et al., 2020) to evaluate it on our dataset. In running these experiments, we use the davinci engine and produce outputs with greedy decoding, generating up to a maximum number of tokens of 16 (the default value). We use the default stop condition which is 2 newline tokens. Because of our usage limits, this evaluation was done on 50 instances of each task. We evaluate under a variety of encodings ( §5.1). Our base encoding is "prompt" only. 4 We also study this model under other encodings ("prompt+definition", etc.) and report their scores relative to its base case.

Generalization BART. We train the BART model on all the 26 non-evaluation subtasks T non-eval ( §4.2) and it is never exposed to the tasks it is evaluated on. We train and evaluate this model under a variety of encodings ( §5.1). Our base encoding is "instance" only (no instructions). We also study this model with other encodings ("prompt+definition", etc.) and report their scores relative to our base case.

Task-Specific Bart (Oracle Upper-Bound).

We train BART on input/output instances of each task (no instructions) and evaluate on the same task. This is the conventional setup where the model is fine-tuned to solve the task only, without any instructions involved. Such a model, by design, won't generalize across different tasks since it is specialized to each subtask. However, the numbers elicited from this can be viewed as the upper-bounds for each task (i.e., how well can BART perform, if it were to be trained on many instances of this particular task).

Evaluation metrics. We treat all of our tasks as text generation problems and evaluate them with automated evaluation metrics for text generation. In particular, we use BLEURT (BT) (Sellam et al., 2020) and ROUGE-L (R-L) (Lin, 2004) . Automated evaluation metrics have the benefit that they're easy to run and reproduce. However, they are known to be inaccurate (Owczarzak et al., 2012) . To mitigate such issues, as noted earlier in the construction of task splits ( §4.2), the tasks included in T eval accept a relatively reliable automatic evaluation. Table 3 evaluates language models under different supervision settings and different instruction encoding. We observe the following from the experiments.

Table 3: Empirical results NATURAL-INSTRUCTIONS. The first row is a task-specific model with no instances. This model serves as our upper-bound, which is why it is grayed out. For each of the two systems (BART and GPT3), we see that the addition of the instructions helps propel the model performance, as indicated by the relative gains (the last column) compared to

6.2 Results

Instructions improve generalization to unseen tasks. Table 3 shows the performance of BART and GPT-3 when they are evaluated with various forms of encodings. For both BART and GPT-3, the relative gains compared to their base encodings (indicated in light green) show that the addition of instructions improves the performance of the models when evaluated on an unseen task. Note that these two models have different modes of supervision: BART is supervised with the T non-eval tasks, which has no overlap with the evaluation tasks T eval ( §4.2). For GPT-3, the supervision is provided by a few instances (10 in our experiments) as part of the prompt.

Note that these findings are in contrast with those of (Efrat and Levy, 2020) , where they observe that large language models "fail to follow a series of gradually simpler instructions". Like discussed Figure 4: GPT-3 and BART were evaluated with various encoding and various categories of tasks. The benefit of instructions to the models depends on the semantics of the task. For instance, for GPT-3 (left) minimal text modification category benefits a lot, while the benefits to verification tasks are minimal.

in ( §2), we hypothesize that the success here is fueled by breaking the crowdsourcing instructions into minimal tasks and mapping them into a single coherent schema, which makes it easier for LMs to exploit the prevalence and diversity of training tasks.

NATURAL-INSTRUCTIONS has a wide margin to be solved. The first row of Table 3 shows the scores of task-specific supervised models. While these models have the weakest generalization, they indicate room for improvement for our models that work based on generalization. This upperbound score is considerably higher than our GPT-3 and BART models (39% and 37%, respectively, in terms of ROUGE-L metric). We hope this performance gap between the generalization-based models (that use instructions) and task-specific models will motivate the field towards stronger models.

The benefit from instructions heavily depends on the task at hand. Figure 4 shows the performance of our models on our task categories, broken down into several coarse input encodings. Similar to our previous observations, all instructions encoding typically performs better than other encodings. However, these gains are not uniform across task categories.

6.3 Case Study: An Ablation Study Of Instructional Elements

We conduct an ablation study with GPT-3 on 3 distinct tasks (answer generation from Winogrande; question generation from QASC; verifying temporal reasoning category of a given question from MCTACO). Table 4 (top) shows the effect of eliminating various fields in the encoding while Table 4 (bottom) indicates the gains from adding each field. The overall observation is that GPT-3 benefits the most from positive examples, mildly from definition, and deteriorates with negative examples. We hypothesize it is easier for GPT-3 to mimic the patterns in positive examples while utilizing negative examples requires deeper understanding.

Table 4: An ablation study of the different fields included in NATURAL-INSTRUCTIONS based on GPT-3. This model benefits the most from positive examples and the least from negative examples.

6.4 Error Analysis

We conduct error analysis on 3 distinct tasks (answer generation from Winogrande; question generation from QASC; incorrect answer generation for MCTACO event-duration questions). We randomly select 30 samples from each of these tasks and categorize the associated errors. Table 5 summarizes our analysis. We observe that GPT-3 mainly suffers from generating redundant content and ignoring the instructions provided. On the other hand, our BART model provides more control, however, it fails to generate proper output.

Table 5: Breakdown of GPT-3 and BART errors on 30 instances of 3 different subtasks.

Figure 5: Variations in the number of subtasks

Figure 6: Variation in the number of positive and negative examples

Figure 7: Variation in the number of sentences in the crowdsourcing instructions across datasets

7 Discussion And Conclusion

The instruction paradigm (Goldwasser and Roth, 2014; Efrat and Levy, 2020 ) is a strict generalization of the current instance-only paradigm. With the saturation of the existing benchmarks that follow the conventional paradigm, we believe the field should move towards models that can address a broader range of tasks. Learning to follow instructions is one alternative that brings in more generality to our models and it remains to be challenging to our models. In this paper, we revisit the goal of learning from instructions and introduce NATURAL-INSTRUCTIONS to fill in the vacuum left by the lack of realistic benchmark that covers a wide range of tasks. Our dataset is based on several recent datasets and covers a wide range of tasks. Our experiments with the state-of-the-art generative models reveal that they make a non-zero use of language instruction, which is a positive development. However, these models are remarkably far from task-specific upper-bounds. We hope this work will bring more attention to building stronger models that can generalize to a wider range of tasks. Figure 9 shows how a task is divided into multiple subtasks for the MCTACO dataset. MCTACO has five categories (Event Duration, Event Frequency etc.). Each category contributes to 2 subtasks one for question generation and one for answer generation.

Figure 9: Dividing a data creation task into multiple subtasks for the MCTACO dataset.

B Analysis Of Crowdsourcing Templates

We analyzed crowdsourcing templates of 6 datasets:

CosmosQA (Huang et al., 2019) , DROP (Dua et al., 2019) , MCTACO (Zhou et al., 2019) , QASC (Khot et al., 2020) , Quoref , and Winogrande (Sakaguchi et al., 2020) . Our intention behind the analysis is to identify similarities and differences across templates and subsequently decide regarding collection of more templates. Size Distribution of Fields: We observe significant variation in size across the 6 datasets ( Figure 7 ). In the case of QASC, the instruction size associated with each step of the data creation process is very high, whereas for Winogrande, it is exactly the opposite-instruction size associated with each step of the data creation process is very low. Instead, the size of the common instruction (i.e., the instruction preceding the first step of the data creation process) is high in Winogrande; this is also seen for DROP. The major mode of instruction varies across datasets. Examples and instructions associated with each step of data creation respectively take up the majority of space in Quoref and CosmosQA. MCTACO relies on examples to explain the crowdsourcing task, while Winogrande and QASC depend mostly on common instructions and instructions associated with each step of the data creation process respectively, to explain the task to the crowdworker. Number of Steps in the Data Creation Process: Figure 5 illustrates how the number of steps in the data creation process varies across the 6 datasets. QASC and MCTACO contain a relatively higher number of steps in the data creation process in comparison to DROP, Quoref, CosmosQA, and Winogrande.

Number Of Positive And Negative Examples:

Variation in the occurrence of "Positive" and "Negative Examples" across datasets has been illustrated in Figure 6 . Only Winogrande provides an equal number of "Positive" and "Negative Examples". QASC instructions do not contain any "Negative Examples". Overall, DROP instructions consist of a relatively higher number of examples than other datasets. Presence of Reasons and Suggestions in Examples: All datasets except QASC contain both "Positive" and "Negative Examples". However, Quoref is the only dataset to provide "Reasons" for all the "Positive" and "Negative Examples". There are explanations associated with each of the "Negative Examples", but the presence of explanations associated with "Positive Examples" varies across datasets. Finally, Quoref is the only dataset to provide "Suggestions" along with the "Reasons" associated with the "Negative Examples".

Dimensions of Input and Output: The input dimension of a step is defined as the number of previous step outputs that are fed as input. Parallely, the output dimension of a step is the number of distinct outputs the model needs to produce in that step-for example, if a model has to generate both a question and an answer in a step, the output dimension will be 2. CosmosQA and QASC have relatively high dimensional instances, whereas Quoref and MC-TACO have relatively low dimensional instances.

B.2 Qualitative Analysis

Writing Style: There exists significant variation in writing style across Instructions of the 6 datasets. For instance, though DROP, Quoref and QASC have the common objective of fooling an AI model, the instructions are stated differently across them. DROP Instructions say "There is an AI running in the background which will also try to answer the question. You won't be able to submit the question if the AI gives the same response." The writing style in Quoref however is different: "We also want you to avoid questions that can be answered correctly by someone without actually understanding the paragraph. To help you do so, we provided an AI system running in the background that will try to answer the questions you write. You can consider any question it can answer to be too easy. However, please note that the AI system incorrectly answering a question does not necessarily mean that it is good." In QASC, variations are as follows: "Two AI systems will try to answer your question. Make sure you fool at least on AI with an incorrect answer. If you fool both AIs, you will receive a bonus of $0.25."

Information: We observe that sometimes instructions of a dataset contain information that is relevant to several other datasets, which do not contain similar instruction information. For example, Quoref, DROP and CosmosQA are datasets which are all based on reading comprehension tasks. Cos-mosQA contains a step in the data creation process asking users to skip passages containing inappropriate or offensive content. This information is also relevant to Quoref and DROP, but is not mentioned in their respective instructions.

Topic: Figure 10 illustrates some examples where the reasoning skill associated with the datasets is the same, but the topic varies. The experience gained creating data for one topic may help with understanding instructions and creating data for another dataset with the same underlying reasoning skill.

Figure 10. Not extracted; please refer to original document.

Hardness: In a typical crowdsourcing task, certain tasks may be harder than the others, often these are the core tasks, e.g.: question generation, adversarial data creation, etc. Additional information, especially in the form of tips is always helpful in solving these hard tasks. Figure 11 illustrates that the task of question generation is stated differently in Quoref, CosmosQA and QASC. QASC mentions an easy and detailed way to create questions, whereas CosmosQA mentions several different attributes of a good quality question. Knowing about the CosmosQA and QASC question generation processes may help with data creation for Quoref and other such question generation tasks, where less additional information is provided regarding question creation. Associated Reasoning Skill: Finally there are similarities among datasets in terms of their underlying skill requirements. Figure 8 illustrates datasets clustered based on similarity in their associated reasoning class.

Figure 11. Not extracted; please refer to original document.

C Encoding Of The Instructions

To feed the instances to LMs, we first encoder them into plain text. Let encpI, xq define a function that maps a given instruction I and input instance x to plain text. Evidently, there are many choices for this function. In our study, we consider the following encodings:

"instance-only" encoding. This encoding is the conventional paradigm where no instructions exist: encpIt, xq :"input : x output :" (1)

"prompt" encoding. In this encoding, we append the prompt message before the input: encpIt, xq :"prompt : I prompt t input :

x output :"

"prompt + definition" encoding. In this encoding, the prompt message and the task definition appear before the input: encpIt, xq :""definition : I def.

Intuitively, this encoding is more informative and more complex than "prompt" encoding.

"all instructions" encoding. This encoding contains all the instruction content:

where enc ex pI t q is an alternating encoding positive and negative examples. We include as many examples as possible, before exceeding the input limit.

"positive examples" encoding. This encoding contains only positive examples of the subtask (no task description, etc). . .

input : x output :"

We fit as many positive examples as it fits in the input. Such example-only are have been used several recent studies in the field (Zhao et al., 2021) . Table 7 shows detailed results of some additional task-specific experiments we performed. Interestingly, the addition of 10 task-specific data sample significantly increase the generalization performance of models. Table 6 illustrates detailed results of the ablation experiments we performed.

Table 6: Detailed results of the encoding ablation performed on three distinct subtasks.

Table 7: Empirical results NATURAL-INSTRUCTIONS. The best numbers among the four encodings are indicated with bold. The first row is grayed out since it is our oracle upperbound.

The tasks are enumerated in the appendix.3 We use "few-shot" to refer to any setup with a small number of labeled examples, regardless of whether these examples are used for fine-tuning or inference-time conditioning (no gradient updates).

"instance"-only encoding does not make sense for GPT-3; otherwise the task is undefined.

Table 8: Detailed Structure of NATURAL-INSTRUCTIONS