ManipulaTHOR: A Framework for Visual Object Manipulation
The domain of Embodied AI has recently witnessed substantial progress, particularly in navigating agents within their environments. These early successes have laid the building blocks for the community to tackle tasks that require agents to actively interact with objects in their environment. Object manipulation is an established research domain within the robotics community and poses several challenges including manipulator motion, grasping and long-horizon planning, particularly when dealing with oft-overlooked practical setups involving visually rich and complex scenes, manipulation using mobile agents (as opposed to tabletop manipulation), and generalization to unseen environments and objects. We propose a framework for object manipulation built upon the physics-enabled, visually rich AI2-THOR framework and present a new challenge to the Embodied AI community known as ArmPointNav. This task extends the popular point navigation task  to object manipulation and offers new challenges including 3D obstacle avoidance, manipulating objects in the presence of occlusion, and multi-object manipulation that necessitates long term planning. Popular learning paradigms that are successful on PointNav challenges show promise, but leave a large room for improvement.
Embodied AI, the sub-specialty of artificial intelligence at the intersection of robotics, computer vision, and natural language processing continues to gain popularity amongst researchers within these communities. This has expedited progress on several fronts -open source simulators are getting faster, more robust, and more realistic via photorealism and sophisticated physics engines, a variety of tasks are being worked on such as navigation and instruction following, new algorithms and models are inching us towards more powerful and generalizable models and the recent development of multiple sim-to-real environments with paired worlds in simulation and real is enabling researchers to study the challenges of overcoming the domain gap from virtual to physical spaces. A notable outcome has been the development of near-perfect pure learning-based Point Navigation  agents, far outperforming classical approaches.
Most of the focus and progress in Embodied AI has revolved around the task of navigation -including navigating to coordinates, to object instances, and to rooms. Navigating around in an environment is a critical means to an end, not an end in itself. The aspiration of the Embodied AI community remains the development of embodied agents that can perform complex tasks in the real world, tasks that arXiv:2104.11213v1 [cs.CV] 22 Apr 2021 involve actively manipulating objects in one's environment. The early successes and interest in Embodied AI have laid a foundation for the community to tackle the myriad of challenges that lie within the problem of object manipulation.
Object manipulation has long posed daunting challenges to roboticists. Moving manipulators within an environment requires estimating free spaces and avoiding obstacles in the scene, tasks which are rendered even harder due to the unwieldy nature of robotic arms. Generalizing to novel environments and objects is another important challenge. Finally, real-world tasks often involve manipulating multiple objects in succession in cluttered scenes, which requires fairly complex visual reasoning and planning. Besides, developing simulators for object manipulation poses a unique set of challenges. In contrast to navigation tasks that require camera translation and fairly rudimentary collision checking, object manipulation requires fine grained collision detection between the agent, its arms, and surrounding objects, and the usage of advanced physics emulators to compute the resulting displacements of the constituent entities. In particular, these computations are expensive and require significant engineering efforts to produce effective simulations at reasonably high frame rates.
We extend the AI2-THOR  framework by adding arms to its agents, enabling these agents to not only navigate around their environments but also actively manipulate objects within them. The newly introduced arm rig is designed to work with both forward and inverse kinematics, which allows one to control the arm using both joint actuations or by specifying the desired wrist translation. This flexibility allows Embodied AI practitioners to train policies requiring fine-grained actuator controls for all joints if they so desire, or instead use inbuilt kinematics functionalities and focus solely on the desired positioning of the end of the arm and manipulator.
As a first step towards generalizable object manipulation, we present the task of ARMPOINTNAV-moving in the scene towards an objects, picking it up and moving it to the desired location ( Figure 1 ). ARMPOINTNAV builds upon the navigation task of PointNav  in that it is an atomic locomotive task, a key component of more complex downstream goals, specifies source and target locations using relative coordinates as opposed to other means such as language or images and utilizes compass as part of its sensor suite. But in contrast, it offers significant new challenges. Firstly, the task requires the motion of both the agent and the arm within the environment. Secondly, it frequently entails reaching behind occluding obstacles to pick up objects which requires careful arm manipulation to avoid collisions with occluding objects and surfaces. Thirdly, it may also require the agent to manipulate multiple objects in the scene as part of a successful episode, to remove objects, or make space to move the target object, which requires long-term planning with multiple entities. Finally, the motion of the arm frequently occludes a significant portion of the view, as one may expect, which is in sharp contrast to PointNav that only encounters static unobstructed views of the world.
The end-to-end ARMPOINTNAV model provides strong baseline results and shows an ability to not just generalize to new environments but also to novel objects within these environments -a strong foundation towards learning generalizable object manipulation models. This end-to-end model is superior to a disjoint model that learns a separate policy for each skill within an episode.
In summary, we (a) introduce a novel efficient framework (ManipulaTHOR) for low level object manipulation, (b) present a new dataset for this task with new challenges for the community, and (c) train an agent that generalizes to manipulating novel objects in unseen environments. Our framework, dataset and code will be publicly released. We hope that this new framework encourages the Embodied AI community towards solving complex but exciting challenges in visual object manipulation.
2. Related Works Object Manipulation.
A long-standing problem in robotics research is object manipulation [12, 4, 5, 31, 39, 6, 9, 21, 10] . Here, we explain some recent example works that are more relevant to our work.  address the problem of multi-step manipulation to interact with objects in presence of clutter and occlusion.  propose a planning approach to grasp objects in a cluttered scene by relying on partial point cloud observation.  learn a 3D scene representation to predict the dynamics of objects during manipulation.  propose a reinforcement learning approach for robotic manipulation where they construct new policies by composing existing skills.  propose a model-based planner for multi-step manipulation. [22, 35] study mobile manipulation by generating sub-goal tasks. A combination of visually complex scenes, generalization to novel objects and scenes, joint navigation and manipulation, as well as navigating while manipulating object in hand are the key factors that distinguish our work from the previous work on object manipulation. Environments for object manipulation. While several popular Embodied AI frameworks have focused on the navigation task, recently proposed improvements and frameworks such as iGibson  , SAPIEN  and TDW  have enabled new research into manipulation. Sapien  is a virtual environment designed for low-level control of a robotic agent with an arm. In contrast, our framework includes a variety of visually rich and reconfigurable scenes allowing for a better exploration of the perception problem. Meta-World  is developed to study multi-task learning in the context of robotic manipulation. The Meta-World framework includes a static table-top robotic arm and a limited set of objects. In contrast, our framework enables studying the problem of joint navigation and manipulation using a variety of objects. RLBench  also provides a simulated environment for a table-top robotic arm. Robo-Turk  is a crowdsourcing platform to obtain human trajectories for robotic manipulation. RoboTurk also considers table-top manipulation scenarios.  provide a large-scale dataset of grasping and manipulation to evaluate the generalization of the models to unstructured visual environments. Unlike our framework, their dataset is non-interactive and includes only pre-recorded manipulation trajectories. iGibson  involves object interaction, but it does not support low-level manipulation (the interactions primarily involved pushing objects and rotation around hinges). Recently, an extension of iGibson  has enabled object manipulation with contact forces. Visual navigation. Our problem can be considered as an extension of the visual navigation work [43, 17, 24, 26, 34, 40, 7, 33] in the Embodied AI literature. There are a few key differences between our manipulation task and navigation. In manipulation, the shape of the agent changes dynamically due to the extension of the arm. Also, the manipulation of objects is performed in 3D and through clutter, while the navigation works assume 2D motion on a plane in fairly clean scenes. Finally, our proposed task requires the agent to plan its motion as well as the motion of its arm simultaneously.
The growing popularity of Embodied AI can be partly attributed to the availability of numerous free and fast 3D simulators such as AI2-THOR  , Habitat  and iGibson  . Some of these simulators excel at their photorealism, some at their speed, some at the interactivity they afford while others at their physics simulations. While researchers have many options to choose from when it comes to researching embodied navigation, fewer choices exist to study object manipulation, particularly in visually rich environments with varied objects and scenes. Simulating object manipulation presents unique challenges to simulator builders beyond ones posed by navigation, including the need for fine-grained physics emulations, object and manipulator properties, and obtaining acceptable frame rates.
We present ManipulaTHOR, an extension to the AI2-THOR framework that adds arms to its agents. AI2-THOR is a suitable base framework due to its powerful physics engine, Unity, variety of realistic indoor scenes, large asset library of open source manipulable objects as well as articulated receptacles such as cabinets, microwaves, boxes, and fridges. While AI2-THOR has been previously used to train agents that interact with objects, this interaction has been invoked at a high level -for instance, a cabinet is opened by choosing a point on the cabinet and invoking the "open" command. ManipulaTHOR allows agents to interact with objects at a lower level via their arm manipulators, and thus opens up a whole new direction for Embodied AI research. The sensors that are available for use are RGB image, depth frame, GPS, agent's location, and arm configuration. Arm Design. In ManipulaTHOR, each agent has a single arm. The physical design of the arm is deliberately simple: a three-jointed arm with equal limb-lengths, attached to the body of the agent. This design is inspired by Kinova's line of robots , with smooth contours and seamless jointtransitions and it is composed entirely of swivel joints, each with a single axis of rotation. The shoulder and wrist support 360 degrees of rotation and the hand grasper comes with a 6DOF (see Figure 2 ). The robot's arm rig has been designed to work with either forward or inverse kinematics (IK), meaning its motion can be driven joint-by-joint, or directly from the wrist, respectively.
Y Z X X Z Y Y Z X X Z Y Y Z X 1.06m 0.95m 0.6335m Y Z X (a) (b) (c) (d) (e) (f)
Grasper. The Grasper is defined as a sphere at the end of the arm. Objects that intersect with the sphere can be picked up by the grasper. This abstract design follows the 'abstracted grasping' actuation model of  in lieu of more involved designs like jaw grippers or humanoid hand graspers, enabling researchers to solve problems involved with object manipulation through the environment without having to account for the complexities of grasping. Object grasping is a challenging problem with a rich history in the robotics community and we hope to add this explicit functionality into ManipulaTHOR in future versions.
Arm Interface. The arm comes with the following functionalities: 1) manipulating the location and orientation of the wrist (the joints connecting the base of the arm to the wrist are resolved via IK as the wrist moves), 2) adjusting the height of the arm, 3) obtaining the arm state's metadata including joint positions, 4) picking up the objects colliding with the grasper's sphere, 5) dropping the held object and 6) changing the radius of the grasper's sphere.
Physics Engine. We use NVIDIA's PhysX engine through Unity's engine integration to enable physically realistic object manipulation. This engine allows us to realistically move objects around, move the arm in the space, and cascade forces when the arm hits an object.
Rendering Speed. Accurate collision detection and object displacement estimation are very time consuming but are important requirements for our simulator. Through extensive engineering efforts, we are able to obtain a training speed of 300 frames per second (fps) on a machine with 8 NVIDIA T4 GPUs running 40 cores. To put this into perspective, POINTNAV using AI2-THOR on the same machine achieves a training speed of roughly 800 fps, but has very rudimentary collision checks and no arm to manipulate. At 300 fps researchers may train for 20M steps per day, a fast rate to advance research in this direction, which we hope to improve significantly with more optimizations in our code base.
As a first step towards generalizable object manipulation, we present the task of ARMPOINTNAV-moving an object in the scene from a source location to a target location. This involves, navigating towards the object, moving the arm gripper close to the object, picking it up, navigating towards the target location, moving the arm gripper (with the object in place) close to the target location, and finally releasing the object so it lands carefully. In line with the agent navigation task of POINTNAV  , source and target locations of the object are specified via (x, y, z) coordinates in the agent coordinate frame. Dataset. To study the task of ARMPOINTNAV, we present the Arm POINTNAV Dataset (APND). This consists of 30 kitchen scenes in AI2-THOR that include more than 150 object categories (69 interactable object categories) with a variety of shapes, sizes and textures. We use 12 pickupable categories as our target objects. As shown in Figure 3 , we use 20 scenes in the training set and the remaining is evenly split into Val and Test. We train with 6 object categories and use the remaining to test our model in a Novel-Obj setting. Metrics. We report the following metrics:
• Success rate without disturbance (SRwD) -Fraction of successful episodes in which the arm (or the agent) does not collide with/move other objects in the scene.
• Success rate (SR) -Similar to SRwD, but less strict since Figure 3 : Scene and object splits in APND. In order to benchmark the performance on ARMPOINTNAV, in addition to providing a large pool of datapoints for training, we provide a small subset of tasks per data split. We randomly subsampled 60 tasks per object per scene for evaluation purposes.
it does not penalize collisions and movements of objects.
• Pick up success rate (PuSR) -Fraction of episodes where the agent successfully picks up the object.
• Episode Length (Len) -Episode length for both success and failure episodes.
• Successful episode Length (SuLen) -Episode length for successful episodes.
• Pick up successful episode length (PuLen) -Episode length for episodes with successful pickups. APND offers significant new challenges. The agent must learn to navigate not only itself but also its arm relative to its body. Also, as the agent navigates in the environment, it should avoid colliding with other objects -which brings new complexities given the addition of the arm and potentially carrying an object in its gripper. Further, reaching to pick up objects involves free-space estimation and obstacle avoidance -which becomes challenging when the source or target locations are behind the occluders. Moreover, it needs to choose the perfect time to attempt pickup as well as ending the episode. Finally, these occluders themselves may need to be manipulated in order to complete the task. The agent should overcome these challenges while its view is frequently obstructed by the arm and/or the object being carried. Figure 4 illustrates a few of the challenges involved with picking up the object from its source location. Figure 5 shows the distribution of the distances of the target location of the object from its initial state. For comparison, we show the step size for agent navigation and arm navigation as well. Note that the initial distance of the agent from the object is not taken into account.