ALFRED: A Benchmark for Interpreting Grounded Instructions for Everyday Tasks
Authors
Abstract
We present ALFRED (Action Learning From Realistic Environments and Directives), a benchmark for learning a mapping from natural language instructions and egocentric vision to sequences of actions for household tasks. ALFRED includes long, compositional tasks with non-reversible state changes to shrink the gap between research benchmarks and real-world applications. ALFRED consists of expert demonstrations in interactive visual environments for 25k natural language directives. These directives contain both high-level goals like “Rinse off a mug and place it in the coffee maker.” and low-level language instructions like “Walk to the coffee maker on the right.” ALFRED tasks are more complex in terms of sequence length, action space, and language than existing vision- and-language task datasets. We show that a baseline model based on recent embodied vision-and-language tasks performs poorly on ALFRED, suggesting that there is significant room for developing innovative grounded visual language understanding models with this benchmark.
1. Introduction
A robot operating in a human spaces needs to connect natural language to the world. This symbol grounding [21] problem has largely focused on connecting language to static images. However, robots need to understand taskoriented language, for example "Rinse off a mug and place it in the coffee maker" illustrated in Figure 1 .
Platforms for translating language to action have become increasingly popular, spawning new test-beds [12, 3, 14, 41] . These benchmarks include language-driven navigation and embodied question answering, which have seen dramatic improvements in modeling thanks to environments like Matterport 3D [11, 3] , AI2-THOR [25] , and AI Habi- 1 Paul G. Allen School of Computer Sci & Eng, Univ of Washington 2 Carnegie Mellon University LTI & Microsoft Research AI 3 Allen Institute for AI 4 NVIDIA 1 2 3
4 5 6
"walk to the coffee maker on the right"
"wash the mug in the sink" "put the clean mug in the coffee maker"
"pick up the mug and go back to the coffee maker"
"pick up the dirty mug from the coffee maker" "turn and walk to the sink" Above, we highlight several action sequence frames corresponding to portions of the accompanying language instruction. Unlike related datasets that focus only on navigation, ALFRED requires interactions with objects, keeping track of state changes, and callbacks to previous instructions. tat [43] . However, these datasets ignore complexities arising from describing task-oriented behaviors with objects.
We introduce ALFRED, a new benchmark for connecting human language to actions, behaviors, and objects in an interactive visual environment. Expert task demonstrations are accompanied by both high-and low-level human language instructions in 120 indoor scenes in the new AI2-THOR 2.0 [25] . These demonstrations involve partial observability, long action horizons, underspecified natural language, and irreversible actions.
ALFRED includes 25,743 English language directives describing 8,055 expert demonstrations averaging 50 steps each, resulting in 428,322 image-action pairs. Motivated by work in robotics on segmentation-based grasping [36] , agents in ALFRED interact with objects visually, specifying a pixelwise interaction mask of the target object. This Table 1 : Dataset comparison. ALFRED is the first interactive visual dataset to include high-and low-level natural language instructions for object and environment interactions. TACoS [42] provides detailed high-and low-level text descriptions of cooking videos, but does not facilitate task execution. For navigation, ALFRED enables discretized, grid-based movement, while other datasets use topological graph navigation or avoid navigation altogether. ALFRED requires an agent to generate spatially located interaction masks for action commands. By contrast, other datasets only require choosing from a discrete set of available interactions and object classes or offer no interactive capability.
-
inference is more realistic than simple object class prediction, where localization is treated as a solved problem. Existing beam-search [17, 51, 46] and backtracking solutions [23, 28] are infeasible due to the larger action and state space, long horizon, and inability to undo certain actions. To establish baseline performance levels, we evaluate a sequence-to-sequence model shown to be successful on vision-and-language navigation tasks [27] . This model is not effective on the complex tasks in ALFRED, achieving less than 5% success rates. For analysis, we also evaluate individual sub-goals like the routine of cooking something in a microwave. While performance is better for isolated sub-goals, the model lacks the reasoning capacity for longhorizon and compositional task planning.
In summary, ALFRED facilitates learning models that translate from language to sequences of actions and predictions of visual interaction masks for object interactions. This benchmark captures many reasoning challenges present in real-world settings for translating human language to robot actions for accomplishing household tasks. Models that can overcome these challenges will begin to close the gap towards real-world, language-driven robotics. Table 1 summarizes the benefits of ALFRED relative to other visual action datasets with language annotations. Vision & Language Navigation. In vision-and-language navigation tasks, either natural or templated language describes a route to a goal location through egocentric visual observations [30, 13, 12, 3, 14] . Since the proposal of R2R [3] , researchers have dramatically improved the navigation performance of models [52, 17, 51, 23, 28] with techniques like progress monitoring [27] , as well as introduced task variants with additional, on-route instructions [38, 37, 49] . Much of this research is limited to static environments. By contrast, ALFRED tasks include navigation, object interactions, and state changes. Vision & Language Task Completion. There are several existing benchmarks based on simple block worlds and fully observable scenes [9, 33] . ALFRED provides more difficult tasks in richer, visually complex scenes, and uses partially observable environments. The CHAI benchmark [32] evaluates agents performing household instructions, but includes only a single interact action outside navigation. ALFRED has seven manipulation actions, such as pick up, turn on, and open, state changes like clean versus dirty, and variation in language and visual complexity.
2. Related Work
Previous work using the original AI2-THOR environment also investigated the task of visual semantic planning [56, 19] . Artificial language in those datasets comes from templates, and environment interaction is handled with discrete class predictions, for example selecting apple as the target object from predefined options. ALFRED features human language instructions, and object selections are carried out with class-agnostic, pixelwise interaction masks. In VirtualHome [41] , demonstration programs are generated from video demonstration and natural language instructions, but inference does not involve egocentric visual and action feedback or partial observability.
There is extensive literature on language-based instruction following in the natural language processing community. There, research has focused on mapping instructions to actions [13, 47, 5, 31, 35] Figure 2 : ALFRED annotations. We introduce seven different task types with many combinations of objects in 120 scenes. An example of each task type is given above. For the Clean & Place demonstration, we also show the three crowdsourced language directives. Please see the supplemental material for example demonstrations and language for each task.
visual question answering in embodied environments use templated language or static scenes [20, 15, 55, 53] . In AL-FRED, rather than answering a question, the agent must complete a task specified using natural language, which requires both navigation and interaction with objects. Instruction Alignment. There has also been work on aligning natural language with video clips to find visual correspondences between words and concepts [42, 54, 44, 1, 57] . ALFRED requires performing tasks in an interactive setting as opposed to learning from recorded videos. Robotics Instruction Following. Instruction following is a long-standing topic of interest in robotics [7, 10, 34, 50, 29, 40, 39, 45] . Lines of research consider different tasks such as cooking [10] , table clearing [39] , and mobile manipulation [29] . In general, they are limited to a few scenes [34] , consider a small number of objects [29] , or use the same environment for training and testing [7] . In contrast, AL-FRED includes 120 indoor scenes, many object classes with diverse appearance across scenes and states, and a test set of unseen environments.
3. The Alfred Dataset
The ALFRED dataset comprises 25,743 language directives corresponding to 8,055 expert demonstration episodes. Each directive includes a high-level goal and a set of stepby-step instructions. Each expert demonstration can be deterministically replayed in the AI2-THOR 2.0 simulator.
3.1. Expert Demonstrations
Expert demonstrations are composed of an agent's egocentric visual observations of the environment and what action is taken at each timestep as well as ground-truth interaction masks. Navigation actions move the agent or change its camera orientation, while manipulation actions include picking and placing objects, opening and closing cabinets and drawers, and turning appliances on and off. Interactions can involve multiple objects, such as using a knife to slice an apple, cleaning a mug in the sink, and browning a potato in the microwave. Manipulation actions are accompanied by a ground truth segmentation of the target object. At inference time, interaction masks must be generated along with an action to indicate an object for interaction. Figure 2 gives examples of the high-level agent tasks in ALFRED, like putting a cleaned object at a destination. These tasks are parameterized by the object of focus, the destination receptacle (e.g., table top), the scene in which to carry out the task, and a base object for a stack (for Stack & Place). ALFRED contains expert demonstration of these seven tasks executed using combinations of 58 unique object classes and 26 receptacle object classes across 120 different indoor scenes. For objects classes like potato slice, the agent must first pick up a knife and find a potato to create slices. All object classes contain multiple variations with different shapes, textures, and colors. For example, there are 30 unique variants of the apple class. Indoor scenes include different room types: 30 each of kitchens, bathrooms, bedrooms, and living rooms.
For 2,685 combinations of task parameters, we gather three expert demonstrations per parameter set, for a total of 8,055 unique demonstrations with an average of 50 action steps. The distributions of actions steps in ALFRED demonstrations versus related datasets is given in Figure 3 . As an example, for task parameters {Heat & Place, potato, counter top, KITCHEN-8}, we generate three different expert demonstrations by starting the agent and objects in randomly chosen locations. Object start positions have some commonsense class-specific constraints, for example a fork can start inside a drawer, but an apple cannot.
Contrasting navigation-only datasets where expert We then define task-specific PDDL goal conditions, for example that a heated potato is resting on a table top. Note that the planner encodes the environment as fully observable and has perfect knowledge about world dynamics. For training and testing agent models, however, the environment is partially observable: it is only viewed through the agent's egocentric vision as actions are carried out. We split these expert demonstrations into training, validation, and test folds ( Table 2) . Following work in visionand-language navigation [3] , we further split the validation and test into two conditions: seen and unseen environments. This split facilitates examining how well models generalize to entirely new spaces with novel object class variations.
3.2. Language Directives
For every expert demonstration, we collect open vocabulary, free-form language directives from at least three different annotators using Amazon Mechanical Turk (AMT), resulting in 25k total language directives. Language directives include a high-level goal together with low-level instructions, as shown in Figures 1 and 2 . The distribution of language annotation token lengths in ALFRED versus related datasets is given in Figure 3 .
AMT workers are told to write instructions to tell a "smart robot" how to accomplish what is shown in a video. We create a video of each expert demonstration and segment it such that each segment corresponds to an instruction. We consult the PDDL plan for the expert demonstration to identify task sub-goals, for example the many low-level steps to navigate to a knife, or the several steps to heat a potato slice in the microwave once standing in front of it. We visually highlight action sequences related to sub-goals via colored timeline bars below the video. In each HIT (Human Intelligence Task), a worker watches the Instruction Length Figure 3 : Comparison to Existing Datasets. Expert demonstration steps and instruction tokens of ALFRED compared to other datasets with human language for action sequences: Touchdown (TD) [14] , VirtualHome (VH) [41] , and Room-to-Room (R2R) [3] . The total number of demonstrations or annotations is given with the dataset label.
video, then writes low-level, step-by-step instructions for each highlighted sub-goal segment. The worker also writes a high-level goal that summarizes what the robot should accomplish during the expert demonstration. These directives are validated through a second HIT by at least two annotators, with a possible third tie-breaker. For validation, we show a worker all three language directive annotations without the video. The worker selects whether the three directives describe the same actions, and if not, which is most different. If a directive is chosen as most different by a majority of validation workers, it is removed and the demonstration is subsequently re-annotated by another worker. Qualitatively, these rejected annotations contain incorrect object referents (e.g., "egg" instead of "potato") or directions (e.g., "go left towards..." instead of "right").
4. Baseline Models
An agent trained for ALFRED tasks needs to jointly reason over vision and language input and produce a sequence of low-level actions to interact with the environment.
4.1. Sequence-To-Sequence Models
We model the interactive agent with a CNN-LSTM sequence-to-sequence (SEQ2SEQ) architecture. A CNN enodes the visual input, a bidirectional-LSTM generates a representation of the language input, and a decoder LSTM infers a sequence of low-level actions while attending over the encoded language. See Figure 4 for an overview and the supplementary material for implementation details. Supervision. We train all models using the teacher-forcing paradigm on the expert trajectories, and this ensures the language directives match the visual inputs. At each timestep, the model is trained to produce the expert action and associated interaction mask for manipulation actions.