ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks

Mohit Shridhar
Jesse Thomason
Daniel Gordon
Yonatan Bisk
Winson Han
R. Mottaghi
Luke Zettlemoyer
D. Fox
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2020
View in Semantic Scholar

Abstract

We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like “Rinse off a mug and place it in the coffee maker.” and low-level language instructions like “Walk to the coffee maker on the right.” ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision- and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.

1. Introduction

A robot operating in a human spaces needs to connect natural language to the world. This symbol grounding [21] problem has largely focused on connecting language to static images. However, robots need to understand taskoriented language, for example "Rinse off a mug and place it in the coffee maker" illustrated in Figure 1 .

Figure 1: ALFRED consists of 25k+ language directives corresponding to expert demonstrations on household tasks. Above, we highlight several action sequence frames corresponding to portions of the accompanying language instruction. Unlike related datasets that focus only on navigation, ALFRED requires interactions with objects, keeping track of state changes, and callbacks to previous instructions.

Platforms for translating language to action have become increasingly popular, spawning new test-beds [12, 3, 14, 41] . These benchmarks include language-driven navigation and embodied question answering, which have seen dramatic improvements in modeling thanks to environments like Matterport 3D [11, 3] , AI2-THOR [25] , and AI Habi- 1 Paul G. Allen School of Computer Sci & Eng, Univ of Washington 2 Carnegie Mellon University LTI & Microsoft Research AI 3 Allen Institute for AI 4 NVIDIA 1 2 3

4 5 6

"walk to the coffee maker on the right"

"wash the mug in the sink" "put the clean mug in the coffee maker"

"pick up the mug and go back to the coffee maker"

"pick up the dirty mug from the coffee maker" "turn and walk to the sink" Above, we highlight several action sequence frames corresponding to portions of the accompanying language instruction. Unlike related datasets that focus only on navigation, ALFRED requires interactions with objects, keeping track of state changes, and callbacks to previous instructions. tat [43] . However, these datasets ignore complexities arising from describing task-oriented behaviors with objects.

We introduce ALFRED, a new benchmark for connecting human language to actions, behaviors, and objects in an interactive visual environment. Expert task demonstrations are accompanied by both high-and low-level human language instructions in 120 indoor scenes in the new AI2-THOR 2.0 [25] . These demonstrations involve partial observability, long action horizons, underspecified natural language, and irreversible actions.

ALFRED includes 25,743 English language directives describing 8,055 expert demonstrations averaging 50 steps each, resulting in 428,322 image-action pairs. Motivated by work in robotics on segmentation-based grasping [36] , agents in ALFRED interact with objects visually, specifying a pixelwise interaction mask of the target object. This Table 1 : Dataset comparison. ALFRED is the first interactive visual dataset to include high-and low-level natural language instructions for object and environment interactions. TACoS [42] provides detailed high-and low-level text descriptions of cooking videos, but does not facilitate task execution. For navigation, ALFRED enables discretized, grid-based movement, while other datasets use topological graph navigation or avoid navigation altogether. ALFRED requires an agent to generate spatially located interaction masks for action commands. By contrast, other datasets only require choosing from a discrete set of available interactions and object classes or offer no interactive capability.

Table 1: Dataset comparison. ALFRED is the first interactive visual dataset to include high- and low-level natural language instructions for object and environment interactions. TACoS [42] provides detailed high- and low-level text descriptions of cooking videos, but does not facilitate task execution. For navigation, ALFRED enables discretized, grid-based movement, while other datasets use topological graph navigation or avoid navigation altogether. ALFRED requires an agent to generate spatially located interaction masks for action commands. By contrast, other datasets only require choosing from a discrete set of available interactions and object classes or offer no interactive capability.

inference is more realistic than simple object class prediction, where localization is treated as a solved problem. Existing beam-search [17, 51, 46] and backtracking solutions [23, 28] are infeasible due to the larger action and state space, long horizon, and inability to undo certain actions. To establish baseline performance levels, we evaluate a sequence-to-sequence model shown to be successful on vision-and-language navigation tasks [27] . This model is not effective on the complex tasks in ALFRED, achieving less than 5% success rates. For analysis, we also evaluate individual sub-goals like the routine of cooking something in a microwave. While performance is better for isolated sub-goals, the model lacks the reasoning capacity for longhorizon and compositional task planning.

In summary, ALFRED facilitates learning models that translate from language to sequences of actions and predictions of visual interaction masks for object interactions. This benchmark captures many reasoning challenges present in real-world settings for translating human language to robot actions for accomplishing household tasks. Models that can overcome these challenges will begin to close the gap towards real-world, language-driven robotics. Table 1 summarizes the benefits of ALFRED relative to other visual action datasets with language annotations. Vision & Language Navigation. In vision-and-language navigation tasks, either natural or templated language describes a route to a goal location through egocentric visual observations [30, 13, 12, 3, 14] . Since the proposal of R2R [3] , researchers have dramatically improved the navigation performance of models [52, 17, 51, 23, 28] with techniques like progress monitoring [27] , as well as introduced task variants with additional, on-route instructions [38, 37, 49] . Much of this research is limited to static environments. By contrast, ALFRED tasks include navigation, object interactions, and state changes. Vision & Language Task Completion. There are several existing benchmarks based on simple block worlds and fully observable scenes [9, 33] . ALFRED provides more difficult tasks in richer, visually complex scenes, and uses partially observable environments. The CHAI benchmark [32] evaluates agents performing household instructions, but includes only a single interact action outside navigation. ALFRED has seven manipulation actions, such as pick up, turn on, and open, state changes like clean versus dirty, and variation in language and visual complexity.

2. Related Work

Previous work using the original AI2-THOR environment also investigated the task of visual semantic planning [56, 19] . Artificial language in those datasets comes from templates, and environment interaction is handled with discrete class predictions, for example selecting apple as the target object from predefined options. ALFRED features human language instructions, and object selections are carried out with class-agnostic, pixelwise interaction masks. In VirtualHome [41] , demonstration programs are generated from video demonstration and natural language instructions, but inference does not involve egocentric visual and action feedback or partial observability.

There is extensive literature on language-based instruction following in the natural language processing community. There, research has focused on mapping instructions to actions [13, 47, 5, 31, 35] Figure 2 : ALFRED annotations. We introduce seven different task types with many combinations of objects in 120 scenes. An example of each task type is given above. For the Clean & Place demonstration, we also show the three crowdsourced language directives. Please see the supplemental material for example demonstrations and language for each task.

Figure 2: ALFRED annotations. We introduce seven different task types with many combinations of objects in 120 scenes. An example of each task type is given above. For the Clean & Place demonstration, we also show the three crowdsourced language directives. Please see the supplemental material for example demonstrations and language for each task.

visual question answering in embodied environments use templated language or static scenes [20, 15, 55, 53] . In AL-FRED, rather than answering a question, the agent must complete a task specified using natural language, which requires both navigation and interaction with objects. Instruction Alignment. There has also been work on aligning natural language with video clips to find visual correspondences between words and concepts [42, 54, 44, 1, 57] . ALFRED requires performing tasks in an interactive setting as opposed to learning from recorded videos. Robotics Instruction Following. Instruction following is a long-standing topic of interest in robotics [7, 10, 34, 50, 29, 40, 39, 45] . Lines of research consider different tasks such as cooking [10] , table clearing [39] , and mobile manipulation [29] . In general, they are limited to a few scenes [34] , consider a small number of objects [29] , or use the same environment for training and testing [7] . In contrast, AL-FRED includes 120 indoor scenes, many object classes with diverse appearance across scenes and states, and a test set of unseen environments.

3. The Alfred Dataset

The ALFRED dataset comprises 25,743 language directives corresponding to 8,055 expert demonstration episodes. Each directive includes a high-level goal and a set of stepby-step instructions. Each expert demonstration can be deterministically replayed in the AI2-THOR 2.0 simulator.

3.1. Expert Demonstrations

Expert demonstrations are composed of an agent's egocentric visual observations of the environment and what action is taken at each timestep as well as ground-truth interaction masks. Navigation actions move the agent or change its camera orientation, while manipulation actions include picking and placing objects, opening and closing cabinets and drawers, and turning appliances on and off. Interactions can involve multiple objects, such as using a knife to slice an apple, cleaning a mug in the sink, and browning a potato in the microwave. Manipulation actions are accompanied by a ground truth segmentation of the target object. At inference time, interaction masks must be generated along with an action to indicate an object for interaction. Figure 2 gives examples of the high-level agent tasks in ALFRED, like putting a cleaned object at a destination. These tasks are parameterized by the object of focus, the destination receptacle (e.g., table top), the scene in which to carry out the task, and a base object for a stack (for Stack & Place). ALFRED contains expert demonstration of these seven tasks executed using combinations of 58 unique object classes and 26 receptacle object classes across 120 different indoor scenes. For objects classes like potato slice, the agent must first pick up a knife and find a potato to create slices. All object classes contain multiple variations with different shapes, textures, and colors. For example, there are 30 unique variants of the apple class. Indoor scenes include different room types: 30 each of kitchens, bathrooms, bedrooms, and living rooms.

For 2,685 combinations of task parameters, we gather three expert demonstrations per parameter set, for a total of 8,055 unique demonstrations with an average of 50 action steps. The distributions of actions steps in ALFRED demonstrations versus related datasets is given in Figure 3 . As an example, for task parameters {Heat & Place, potato, counter top, KITCHEN-8}, we generate three different expert demonstrations by starting the agent and objects in randomly chosen locations. Object start positions have some commonsense class-specific constraints, for example a fork can start inside a drawer, but an apple cannot.

Figure 3: Comparison to Existing Datasets. Expert demonstration steps and instruction tokens of ALFRED compared to other datasets with human language for action sequences: Touchdown (TD) [14], VirtualHome (VH) [41], and Room-to-Room (R2R) [3]. The total number of demonstrations or annotations is given with the dataset label.

Contrasting navigation-only datasets where expert We then define task-specific PDDL goal conditions, for example that a heated potato is resting on a table top. Note that the planner encodes the environment as fully observable and has perfect knowledge about world dynamics. For training and testing agent models, however, the environment is partially observable: it is only viewed through the agent's egocentric vision as actions are carried out. We split these expert demonstrations into training, validation, and test folds ( Table 2) . Following work in visionand-language navigation [3] , we further split the validation and test into two conditions: seen and unseen environments. This split facilitates examining how well models generalize to entirely new spaces with novel object class variations.

Table 2: ALFRED Data Splits. All expert demonstrations and associated language directives in the validation and test folds are distinct from those in the train fold. The validation and test sets are split into seen and unseen folds. Scenes in the seen folds of validation and test data are subsets of those in the train fold. Scenes in the unseen validation and test folds are distinct from the train folds and from each other.

3.2. Language Directives

For every expert demonstration, we collect open vocabulary, free-form language directives from at least three different annotators using Amazon Mechanical Turk (AMT), resulting in 25k total language directives. Language directives include a high-level goal together with low-level instructions, as shown in Figures 1 and 2 . The distribution of language annotation token lengths in ALFRED versus related datasets is given in Figure 3 .

AMT workers are told to write instructions to tell a "smart robot" how to accomplish what is shown in a video. We create a video of each expert demonstration and segment it such that each segment corresponds to an instruction. We consult the PDDL plan for the expert demonstration to identify task sub-goals, for example the many low-level steps to navigate to a knife, or the several steps to heat a potato slice in the microwave once standing in front of it. We visually highlight action sequences related to sub-goals via colored timeline bars below the video. In each HIT (Human Intelligence Task), a worker watches the Instruction Length Figure 3 : Comparison to Existing Datasets. Expert demonstration steps and instruction tokens of ALFRED compared to other datasets with human language for action sequences: Touchdown (TD) [14] , VirtualHome (VH) [41] , and Room-to-Room (R2R) [3] . The total number of demonstrations or annotations is given with the dataset label.

video, then writes low-level, step-by-step instructions for each highlighted sub-goal segment. The worker also writes a high-level goal that summarizes what the robot should accomplish during the expert demonstration. These directives are validated through a second HIT by at least two annotators, with a possible third tie-breaker. For validation, we show a worker all three language directive annotations without the video. The worker selects whether the three directives describe the same actions, and if not, which is most different. If a directive is chosen as most different by a majority of validation workers, it is removed and the demonstration is subsequently re-annotated by another worker. Qualitatively, these rejected annotations contain incorrect object referents (e.g., "egg" instead of "potato") or directions (e.g., "go left towards..." instead of "right").

4. Baseline Models

An agent trained for ALFRED tasks needs to jointly reason over vision and language input and produce a sequence of low-level actions to interact with the environment.

4.1. Sequence-To-Sequence Models

We model the interactive agent with a CNN-LSTM sequence-to-sequence (SEQ2SEQ) architecture. A CNN enodes the visual input, a bidirectional-LSTM generates a representation of the language input, and a decoder LSTM infers a sequence of low-level actions while attending over the encoded language. See Figure 4 for an overview and the supplementary material for implementation details. Supervision. We train all models using the teacher-forcing paradigm on the expert trajectories, and this ensures the language directives match the visual inputs. At each timestep, the model is trained to produce the expert action and associated interaction mask for manipulation actions.

Figure 4: Model overview. At each step, our model reweights the instruction based on the history (x̂t), and combines the current observation features (vt) and the previously executed action (at−1). These are passed as input to an LSTM cell to produce the current hidden state. Finally, the new hidden state (ht) is combined with the previous features to predict both the next action (at) and a pixelwise interaction mask over the observed image to indicate an object.

We note that student-forcing in ALFRED is non-trivial, ... Put the Figure 4 : Model overview. At each step, our model reweights the instruction based on the history (x t ), and combines the current observation features (v t ) and the previously executed action (a t−1 ). These are passed as input to an LSTM cell to produce the current hidden state. Finally, the new hidden state (h t ) is combined with the previous features to predict both the next action (a t ) and a pixelwise interaction mask over the observed image to indicate an object. even disregarding language alignment. Obtaining expert demonstration actions on the fly in navigation-only datasets like R2R [3] involves precomputing all optimal navigation paths. in ALFRED, obtaining these on the fly demonstrations requires re-planning, and in some cases is not possible at all. For example, if during a task of {Clean & Place, apple, refrigerator, KITCHEN-3} a sample-forcing model slices the only apple in the scene, the action cannot be recovered from and the task cannot be completed.

Visual encoding. Each visual observation o t is encoded with a frozen ResNet-18 [22] CNN, where we take the output of the final convolution layer to preserve spatial information necessary for grounding specific objects in the visual frame. We embed this output using two more 1×1 convolution layers and a fully-connected layer. During training, a set of T observations from the expert demonstration is encoded as Language encoding. Given a natural language goal G = g 1 , g 2 , . . . g Lg of L g words, and step-bystep instructions S = s 1 , s 2 . . . s Ls of L s words, we append them into a single input sequence X

V = v 1 , v 2 , . . . , v T , where v

= g 1 , g 2 , . . . g Lg , , s 1 , s 2 . . . s

Ls with the token indicating the separation between the high-level goal and low-level instructions. This sequence is fed into a bidirectional LSTM encoder to produce an encoding x = {x 1 , x 2 , . . . , x Lx } for each word in X.

Attention over language. The agent's action at each timestep is based on an attention mechanism that identifies relevant tokens in the instruction. We perform soft-attention on the language features x to compute the attention distribution α t conditioned on the hidden state of the decoder h t−1 from the last timestep:

z t = (W x h t−1 ) x, α t = Softmax(z t ), x t = α t x (1)

where W x are learnable parameters of a fully-connected layer, z t is a vector of scalar values that represent the attention mass for each word in x, andx t is the weighted sum of x over the attention distribution α t induced from z t . Action decoding. At each timestep t, upon receiving a new observation image o t , the LSTM decoder takes in the visual feature v t , language featurex t , and the previous action a t−1 , and outputs a new hidden state h t :

u t = [v t ;x t ; a t−1 ], h t = LSTM (u t , h t−1 ) (2)

where [; ] denotes concatenation. The hidden state h t is used to obtain the attention weighted language featurex t+1 . Action and mask prediction. The agent interacts with the environment by choosing an action and producing dense pixelwise binary mask to indicate specific objects in the frame. Although AI2-THOR supports continuous control for agent navigation and object manipulation, we discretize the action space for modeling simplicity. The agent chooses from among 13 actions. There are 5 navigation actions: MoveAhead, RotateRight, RotateLeft, LookUp, and LookDown together with 7 interaction actions: Pickup, Put, Open, Close, ToggleOn, ToggleOff, and Slice. Interaction actions require a pixelwise mask to denote the object of interest. Finally, the agent predicts a Stop action to end the episode. We concatenate the hidden state h t with the input features u t and train two separate networks to predict the next action a t and interaction mask m t :

a t = argmax (W a [h t ; u t ]) , m t = σ (deconv [h t ; u t ]) (3)

where W a is a fully-connected layer, deconv is a three-layer deconvolution network, and σ is a sigmoid activation function. Action selection is trained using softmax cross entropy with the expert action. The interaction masks are learned end-to-end in a supervised manner based on ground-truth object segmentations using binary cross-entropy loss. The mask loss is weight balanced to account for sparsity in these dense masks in which target objects can take up a small portion of the visual frame.

4.2. Progress Monitoring

ALFRED tasks require reasoning over long sequences of images and instruction words. We propose two auxiliary losses that use additional temporal information to reduce this burden, introducing a sequence-to-sequence model with progress monitoring (SEQ2SEQ+PM).

Ma et al. showed that agents benefit from maintaining an internal estimate of their progress towards the goal for navigation tasks [27] . Akin to learning a value function in reinforcement learning, progress monitoring helps to learn the utility of each state in the process of achieving the overall task. Intuitively, this allows our agent to better distinguish between visually similar states such as just before putting an object in the microwave versus just after taking the object out. We introduce a simple module that predicts progress, p t ∈ [0, 1], conditioned on the decoder hidden state h t and the concatenated input u t :

EQUATION (4): Not extracted; please refer to original document.

where W p are learnable parameters of a fully connected layer, and σ is a sigmoid activation function. The supervision for p t is based on normalized time-step values t/T , where t is the current time-step, and T is the total length of the expert demonstration. We train with an L2 loss. We also train the agent to predict the number of subgoals completed so far, c t . These sub-goals represent segments in the demonstration corresponding to sequences of actions like navigation, pickup, and heating as identified in the PDDL plan, discussed in Section 3.2. Each segment has a corresponding language instruction, but the alignment must be learned. This sub-goal prediction encourages the agent to coarsely track its progress through the language directive. This prediction is also conditioned on the decoder hidden state h t and the concatenated input u t :

EQUATION (5): Not extracted; please refer to original document.

where W c are learnable parameters of a fully connected layer and σ is a sigmoid activation function. We train c t in a supervised fashion by using the normalized number of subgoals accomplished by the expert at each timestep, c t /C, as the ground-truth label for a task with C sub-goals. We again train with an L2 loss.

5. Experiments

We evaluate the baseline models in the AI2-THOR simulator. When evaluating on test folds, we run corresponding models with the lowest validation loss. Episodes that exceed 400 steps or cause more than 10 API execution failures are terminated. Execution failures arise from bumping into walls or predicting action interaction masks for incompatible objects, such as attempting to Pickup a counter top. These limitations encourage efficiency and reliability. We assess the overall and partial success of models' task executions across episodes.

5.1. Evaluation Metrics

ALFRED allows us to evaluate both full task and task goal-condition completion. In navigation-only tasks, one can only measure how far the agent is from the goal. In AL-FRED, we can also evaluate whether task goal-conditions have been completed, for example that a potato has been sliced. For all of our experiments, we report both Task Success and Goal-Condition Success. Each Goal-Condition relies on multiple instructions, for example navigating to an object and then slicing it. Task Success. Each expert demonstration is parameterized by a task to be performed, as illustrated in Figure 2 . Task Success is defined as 1 if the object positions and state changes correspond correctly to the task goal-conditions at the end of the action sequence, and 0 otherwise. Consider the task: "Put a hot potato slice on the counter". The agent succeeds if, at the end of the episode, any potato slice object has changed to the heated state and is resting on any counter top surface. Goal-Condition Success. The goal-condition success of a model is the ratio of goal-conditions completed at the end of an episode to those necessary to have finished a task. For example, in the previous Heat & Place example, there are four goal-conditions. First, a potato slice must be created from a full potato. Second, a potato slice should become heated. Third, a potato slice should come to rest on a counter top. Fourth, the same potato slice that is heated should be on the counter top. If the agent slices a potato, then moves a slice to the counter top without heating it, then the goal-condition success score is 2/4 = 0.5. On average, tasks in ALFRED have 2.55 goal conditions. The final score is calculated as the average goal-condition success of each episode. Task success is 1 only if goal-condition success is 1. Path Weighted Metrics. We include a Path Weighted version of both metrics that considers the length of the expert demonstration [2] . Expert demonstrations found via a PDDL solver on global information are not guaranteed to be optimal. However, they avoid exploration, use shortest path navigation, and are generally efficient. The path weighted score p s for metric s is given as

EQUATION (6): Not extracted; please refer to original document.

whereL is the number of actions the model took in the episode, and L * is the number of actions in the expert demonstration. Intuitively, a model receives half-credit for taking twice as long as the expert to accomplish a task.

Table 3: Task and Goal-Condition success percentages. For each metric, the corresponding path weighted metrics are given in parentheses. The highest values per fold and metric are shown in blue.

5.2. Sub-Goal Evaluation

Completing the entire sequence of actions required to finish a task is challenging. In addition to assessing full task success, we study the ability of a model to accomplish the next sub-goal conditioned on the preceding expert sequence. The agent is tested by first forcing it to follow the expert demonstration to maintain a history of states leading up to the sub-goal, then requiring it to complete the sub-goal conditioned on the entire language directive and current visual observation. For the task "Put a hot potato slice on the counter" for example, we can evaluate the sub-goal of navigating to the potato after rolling the expert demonstration forward through picking up a knife. The tasks in ALFRED contain on average 7.5 such sub-goals (results in Table 4 ).

Table 4: Evaluations by path weighted sub-goal success. The highest values per fold and task are shown in blue. We note that the LANGUAGE-ONLY model achieves less than 2% on all sub-goals. See supplemental material for more.

6. Analysis

Results from our experiments are presented in Table 3 . We find that the initial model, without spatial or semantic maps, object segmentations, or explicit object-state tracking, performs poorly on ALFRED's long-horizon tasks with high-dimensional state-spaces. The SEQ2SEQ model achieves 10% goal-condition success rate, showing that the agent does learn to partially complete some tasks. This headroom motivates further research into models that can perform the complex vision-and-language planning introduced by ALFRED. The model performance starkly contrasts other vision-and-language datasets focused on navigation, where sequence-to-sequence models with progress monitoring perform well [27] .

6.1. Random Agent

A random agent is commonly employed as a baseline in vision-and-language tasks. In ALFRED, an agent that chooses a uniform random action and generates a uniform random interaction mask at each timestep achieves 0% on all folds, even without an API failure limit.

6.2. Unimodal Ablations

Previous work established that agents without visual inputs, language inputs, or both performed better than random agents and were competitive with initial baselines for several navigation and question answering tasks. These performance gaps were due to structural biases in the datasets or issues with model capacity [48] . We evaluate these ablation baselines to study vision and language bias in ALFRED.

The unimodal ablation performances in Table 3 indicate that both vision and language modalities are necessary to achieve the tasks in ALFRED. The VISION-ONLY model finishes some goal-conditions by interacting with familiar objects seen during training. The LANGUAGE-ONLY model similarly finishes some goal-conditions by following lowlevel language instructions for navigation and memorizing interaction masks for common objects like microwaves that are centered in the visual frame.

6.3. Model Ablations

We ablate the amount of language supervision available to the model, as language directives are given as both a high-level goal and step-by-step instructions. Providing only high-level, underspecified goal language is insufficient to complete the tasks, but is enough to complete some goalconditions. Using just low-level, step-by-step instructions performs similarly to using both high-and low-levels. Thus, this simple model does not seem to exploit the goal instruction to plan out sub-goals for step-by-step execution.

The two progress monitoring signals are marginally helpful, increasing the success rate from ∼3% to ∼4%. Progress monitoring lead to more efficient task completion, as indicated by the consistently higher path weighted scores. They may help avoid action repetition and with the prediction of the Stop action.

The agent takes more steps than the expert in all cases, as indicated by the lower path weighted scores. Sometimes, this is caused by failing to keep track of state-changes, for example heating up an egg in the microwave multiple times. Further, the models also do not generalize to unseen scenes, due to the overall visual complexity in ALFRED arising from new scenes and novel object class instances.

6.4. Human Evaluation

For a random subset of 73 expert demonstrations from the unseen test fold, we obtained a human evaluation of the 219 corresponding language directives. Six new annotators viewed the language directives alongside the expert demonstration. The annotators marked "Yes" or "No" to indicate whether they felt they could carry out the demonstration based on the language and egocentric frames. This evaluation corresponds to a model with "perfect" language understanding and human-level understanding of the rendered visual scene. For a conservative evaluation, we asked that annotators mark "No" if any step in the language directive was incorrect or confusing. On average, annotators marked 85% of the directives as sufficient to accurately produce the expert demonstration, indicating that the language directives in ALFRED well-aligned with the demonstrations.

6.5. Sub-Goal Performance

We examine performance of the SEQ2SEQ model on individual sub-goals in ALFRED. For this experiment, we use the expert trajectory to move the agent through the episode up to the sub-task. Then, the agent begins inference based on the language directive and current visual frame. Table 4 presents path-length weighted success scores for 8 sub-goals. Goto and Pickup are the lowest performing sub-tasks with the SEQ2SEQ+PM model achieving ∼48% and ∼35%, respectively, even in seen environments. Visual semantic navigation is considerably harder in unseen environments. Similarly, interaction masks for Pickup actions in unseen environments are worse due to unfamiliar object-background contexts and new object instances. Simple sub-goals like Cool, and Heat are achieved at a high Table 4 : Evaluations by path weighted sub-goal success.

The highest values per fold and task are shown in blue. We note that the LANGUAGE-ONLY model achieves less than 2% on all sub-goals. See supplemental material for more.

success rate of ∼90% because these tasks are mostly objectagnostic. For example, the agent becomes familiar with using microwaves to heat things regardless of the object inhand, because microwaves have little visual diversity across kitchens. Overall, the sub-goal evaluations indicate that models that exploit modularity and hierarchy, or make use of pretrained object segmentation models, may make headway on full task sequences.

7. Conclusions

We introduced ALFRED, a benchmark for learning to map natural language instructions and egocentric vision to sequences of actions. ALFRED poses a challenging modeling problem, towards the long-term vision of languagedriven robots capable of navigation and interaction. The environment dynamics and interaction mask predictions required in ALFRED narrow the gap between agents in simulation and robots operating in the real world [36] .

We use ALFRED to evaluate a sequence-to-sequence model with progress monitoring, shown to be effective in vision-and-language navigation tasks [27] . While this model is relatively competent at accomplishing some subgoals (e.g. operating microwaves is similar across Heat & Place tasks), the overall task success rates are poor. The long horizon of ALFRED tasks poses a significant challenge with sub-problems including visual semantic navigation, object detection, referring expression grounding, and action grounding. These challenges may be approachable by models that exploit hierarchy [26, 8] , modularity [4, 16] , and structured reasoning and planning [6] . Such approaches have not been applied to data with the language complexity and long horizon action sequences of ALFRED. We are encouraged by the possibilities and challenges that the AL-FRED benchmark introduces to the community.

• t = the task type;

• s = the scene in AI2-THOR;

• o = the object class to be picked up;

• r = the final destination for o or ∅ for Examine;

• m = the secondary object class for Stack & Place tasks (∅ for other task types).

To construct the next tuple, we first find the largest source of imbalance in the current set of tuples. For example if o = apple is more common than o = plunger, o = plunger will be ranked higher than o = apple. We additionally account for the prior distribution of each entity (e.g., if cup is already represented in the data often as both o and m, it becomes disfavored by the sampling algorithm for all slots). We do this greedily across all slots until the tuple is complete. Given any partial piece of information about the task, the distributions of the remaining task parameters remain heterogeneous under this sampling, weakening baseline priors such as ignoring the language input and always executing a common task in the environment.

Once a task parameter sample is complete, the chosen scene is instantiated, objects and agent start position are ran- Figure F1 : Task distribution across train, validation seen and unseen dataset splits. New Unique Tokens Figure F2 : The number of unique tokens introduced per annotation of language directives.

domized, and the relevant room data is encoded into PDDL rules for an expert demonstration. If the PDDL planner cannot generate an expert demonstration given the room configuration, or if the agent fails an action during execution, for example by running into walls or opening doors onto itself due to physical constraints, the episode is abandoned. We gather three distinct expert demonstrations per task parameter sample. These demonstrations are further vetted by rolling them forward using our wrapper to the AI2-THOR API to ensure that a "perfect" model can reproduce the demonstration. The full sampling generation and verification code will be published along with the dataset.

A.2. Example Language Directives

We chose to gather three directives per demonstration empirically. For a subset of over 700 demonstrations, we gathered up to 6 language directives from different annotators. We find that after three annotations, fewer than 10 unique tokens on average are introduced by additional annotators ( Figure F2) . Figure F3 shows the Mechanical Turk interface used to gather language annotations. Workers were presented with a video of the expert demonstration with timeline segments indicating sub-goals. The workers annotated each segment while scrubbing through the video, and wrote a short summary description for the entire sequence. We payed workers $0.7 per annotation. During vetting, annotators were paid $0.35 per HIT (Human Interaction Task) to compare 5 sets of three directives each. These wages were set based on local minimum-wage rates and average completion time. Figure F7 shows 7 expert trajectories (one per task type) and their accompanied annotations. Room-to-Room navigation [3] , for example, is 4 6 ≈ 4000 (6 average steps and 4 navigation actions). By contrast, the ALFRED average branching factor is 12 50 ≈ 10 53 (50 average steps for 12 actions). Beyond action type prediction, the ALFRED state space resulting from dynamic environments and the need to produce pixel-wise masks for interactive actions explodes further. Figure F4 shows a few examples of masks generated by the SEQ2SEQ+PM model in seen and unseen validation scenes. The Microwave mask accurately captures the contours of the object since the model is familiar with receptacles in seen environments. In contrast, the Sink mask in the unseen scene poorly fits the unfamiliar object topology. Figure F6 : Pickup distributions in the train, validation seen and unseen folds. Figure F8 : Visual diversity of AI2-THOR [25] scenes. Top to bottom rows: kitchens, living rooms, bedrooms, and bathrooms. Object locations are randomized based on placeable surface areas and class constraints. See https: //ai2thor.allenai.org/demo/ for an interactive demo.