Abstract
We describe a framework for research and evaluation in Embodied AI. Our proposal is based on a canonical task: Rearrangement. A standard task can focus the development of new techniques and serve as a source of trained models that can be transferred to other settings. In the rearrangement task, the goal is to bring a given physical environment into a specified state. The goal state can be specified by object poses, by images, by a description in language, or by letting the agent experience the environment in the goal state. We characterize rearrangement scenarios along different axes and describe metrics for benchmarking rearrangement performance. To facilitate research and exploration, we present experimental testbeds of rearrangement scenarios in four different simulation environments. We anticipate that other datasets will be released and new simulation platforms will be built to support training of rearrangement agents and their deployment on physical systems.
1. Introduction
Embodied AI is the study and development of intelligent systems with a physical or virtual embodiment. Over the past few years, significant advances have been made in developing intelligent agents that can navigate in previously unseen environments. These advances have been accelerated by dedicated software platforms [44, 61, 75, 62] and clear experimental protocols [1, 4] . Navigation research is
The authors are listed in alphabetical order. thriving in part due to healthy infrastructure and experimental methodology.
An exciting frontier for Embodied AI research concerns interaction and contact between the agent and the environment: tasks that call on the agent to actively engage with and modify the environment in order to accomplish its goals. A number of software platforms support such interaction scenarios [37, 44, 62, 76, 28] . These software platforms simulate realistic onboard perception and in some cases the physical dynamics of the agent, environment, and their interaction.
One missing ingredient is a clear task definition that can span different software platforms and catalyze coordinated accumulation of knowledge and ability across research groups. Clear task definitions and evaluation metrics are essential in the common task framework, which is substantially responsible for progress in computer vision and natural language processing [22] .
In computer vision, standard tasks such as image classification and object detection have facilitated the development of foundational techniques and architectures that have enriched the whole field [23, 59] . Language modeling and machine translation have served a similar role in natural language processing [69, 70, 2, 73] . In both fields, these standard tasks focus the development and validation of new representations and algorithms, and serve as a source of trained models that can be transferred to other tasks, as with convolutional backbones pretrained for image classification [35] , object detectors [58, 34] , and transformers pretrained on natural language [20, 57, 77, 11] . Goal state Current state Task Specification Figure 1 . Object rearrangement example. The goal and the current state of the scene are shown in the left and right images, respectively. The agent is required to move objects (e.g., chair) or change their state (e.g., close the fridge) to recover the goal configuration. The rightmost panel shows different ways of specifying the rearrangement task.
In this report, we develop a task definition that can likewise align and accelerate research in Embodied AI. The task is rearrangement: Given a physical environment, bring it into a specified goal state. Figure 1 provides an example. We propose rearrangement as a canonical task for Embodied AI because it naturally unifies instances that are of clear practical interest: setting the table, cleaning the bedroom, loading the dishwasher, picking and placing orders in a fulfillment center, rearranging the furniture, and many more. Rearrangement scenarios can be defined with stationary manipulators that operate locally or with mobile systems that traverse complex scenes such as houses and apartments. Many experimental settings that have been explored in robotics can be viewed as instances of rearrangement, as well as many compelling settings that are beyond the reach of present-day systems ( Figure 2 ).