Abstract
We have observed significant progress in visual navigation for embodied agents. A common assumption in studying visual navigation is that the environments are static; this is a limiting assumption. Intelligent navigation may involve interacting with the environment beyond just moving forward/backward and turning left/right. Sometimes, the best way to navigate is to push something out of the way. In this paper, we study the problem of interactive navigation where agents learn to change the environment to navigate more efficiently to their goals. To this end, we introduce the Neural Interaction Engine (NIE) to explicitly predict the change in the environment caused by the agent’s actions. By modeling the changes while planning, we find that agents exhibit significant improvements in their navigational capabilities. More specifically, we consider two downstream tasks in the physics-enabled, visually rich, AI2-THOR environment: (1) reaching a target while the path to the target is blocked (2) moving an object to a target location by pushing it. For both tasks, agents equipped with an NIE significantly outperform agents without the understanding of the effect of the actions indicating the benefits of our approach.
1. Introduction
Embodied AI has witnessed remarkable progress over the past few years owing to advances in learning algorithms, benchmarks, and standardized tasks. A popular task that has received a considerable amount of attention is visual navigation [3, 5, 8, 29, 39, 48] , where the goal is to navigate towards a specific coordinate or object within an unseen environment. One of the common implicit assumptions for these navigation methods is that the scene is static, and the agent cannot interact with the objects to change their pose.
Consider the scenario that the path of the agent towards the target location is blocked by an obstacle (e.g., a chair) as shown in Fig. 1 (top) . To reach the target, the agent has to move the obstacle out of the way. Therefore, planning for reaching the target requires not only understanding the outcome of agent actions but also the dynamics of agentobject interactions. There are many factors such as object size, spatial relationship with other objects in the scene, and reaction of the object to the applied forces, that influence the outcome of the interaction with the object. Hence, longhorizon planning for navigation conditioned on the object dynamics offers unique challenges that are often overlooked in the recent navigation literature.

The first challenge is to learn whether an action affects the pose of an object or not. Navigation actions (e.g., rotate right or move ahead) typically do not affect the position of objects in the world coordinate frame while interaction actions (e.g., pushing an object) can change the object pose. The objects move in the ego-centric view of the agent due to agent movements or interaction with objects. Learning how objects move as a result of camera motion or interaction imposes the second challenge. Learning how to interact with objects is another challenge. For example, the agent should learn that pushing an object against a wall does not change its pose.
In this paper, we propose a novel model for navigation while interacting with objects within a scene that jointly plans a sequence of actions and predicts the changes in the scene conditioned on those actions. More specifically, the model includes a Neural Interaction Engine (NIE) module that predicts the affine transformation of objects from the perspective of the agent conditioned on the actions. The goal is to learn if/how the actions affect the pose of the objects. The NIE module receives gradients for not only the prediction of the pose in the next frame but also the navigation policy. We evaluate our model on two downstream tasks Ob-sNav and ObjPlace. The goal of ObsNav is to reach a specific coordinates in a scene while the paths from the initial location of the agent to the target are blocked by objects. The goal of ObjPlace is to push an object on the floor while navigating so it reaches a target point. These are challenging tasks since the agent requires an accurate understanding of the dynamics of the objects and their interaction with other objects in the scene. We perform our experiments in 120 scenes of the physics-enabled AI2-THOR [19] environment. Our experiments show significant improvement over baselines that are not capable of explicitly predicting the effect of interactions showing the merit of our NIE model. In summary, we highlight three primary contributions. (1) We propose Neural Interaction Engine, as a model for predicting the state of the observed objects conditioned on the agent actions. (2) We propose new datasets for two navigation-based tasks using a physics-enabled framework, which enables changing the pose of objects and models rich object-object and agent-object interactions. (3) We show that predicting the outcome of actions is a crucial capability for embodied agents by showing significant improvements over baselines that do not possess this capability.
2. Related Work
Action-conditioned learning of rigid body dynamics. The goal of these works is to learn the dynamics of rigid body motion under the effect of applied actions. Byravan and Fox [6] segment a point cloud into salient regions and predict the rigid body motion. Li et al. [21] learn to re-position and re-orient an object with unknown physical properties. Several works [11, 12, 13, 46] have proposed formulations of visual Model Predictive Control, where the central insight is that a predictive model of sensory input is a powerful signal for learning to perform tasks. A number of other strategies for action-conditioned learning have been proposed, these include: learning latent physical properties of objects using visual observation of interactions with those objects [45] , learning forward and inverse scene dynamics from object interaction data [26] , representing scenes as object-centric graphs and learning to predict changes in object pose after applying a push action [28] , learning the dynamics of balls and walls in the game of billiards [14] , and modeling the dynamics of robot interactions by jointly estimating forward and inverse models of dynamics [1] . In contrast to all of these approaches, we consider the more complex mobile robot scenario, where we factorize the effect of robot motion and object motion. Learning dynamics from perception. The dynamics of objects can be inferred from images and videos alone without any interaction. [35] decompose frame-to-frame pixel motion into scene depth, 3D camera rotation and translation, and a set of object regions with their corresponding 3D motion. [22] reason about the underlying physical properties of objects that appear in a sequence of frames and predict future motion of those objects. [17, 41] jointly train a perception module, an object-based physics engine and a renderer to generate the future predictions. [7] propose Neural Physics Engine that outputs the future states of objects and their properties. [36] also infers the physical state of objects from video input and predict their future trajectories. [40] infer physical properties of objects such as mass and density from videos. [24, 25, 46, 47] predict the dynamics of objects and their future trajectory. These approaches focus on simple scenarios (such as balls of uniform mass or a stack of cubes), no agent action is considered or assume a static camera. Visual navigation. The tasks that we consider in this paper involves visual navigation. Visual navigation has been addressed in various papers in recent Embodied AI literature. Most works focus on point navigation (PointNav) [3, 9, 29, 38] or object navigation (ObjectNav) [5, 8, 10, 39] . Our task is different since in these works only static scenes are considered.
Our task is closer to existing tasks that consider navigation among movable obstacles [4, 18, 23, 31, 32, 43, 44] . The difference with [31, 32] is that those works are not learning-based and generalization to unseen scenes is not evaluated. Our task differs from that of [44] in that our agent applies forces to objects with different magnitudes and directions (as opposed to moving objects by colliding with them). Our approach also shows significant improvements over the vanilla RL approaches used in [44] .
3. Model
In this section, we begin by providing an overview of the proposed model. We then introduce our Neural Interaction Engine (NIE) and explain how we integrate the NIE into the policy network. Finally, we describe the learning objective and how we learn the entire model with the NIE module.
3.1. Model Overview
Our model has three main components: a visual encoder, Neural Interaction Engine, and a policy network, as illustrated in Fig. 2 . First, the visual encoder produces a representation v from a visual observation i. The visual observation includes an RGB image captured by a mounted camera and a depth image captured by a depth sensor. The visual encoder is a convolutional neural network aiming to ex- tract informative features from the given observation. Second, the NIE, which receives the same input observation i, extracts keypoints p o of an object o ∈ O, and predicts keypoint locations p a o after applying each action a ∈ A. Fig. 3 shows typical examples of p chair and p a chair after applying Push, Pull, RightPush and MoveAhead actions. More specifically, the NIE predicts affine transformation matrices m a o ∈ R 4×4 corresponding to each object and each action. Then, we derive the p a o by translating and rotating the p o via m a o in 3D space. Applying the affine transformation to the keypoints preserves the rigid body constraint while moving keypoints of the same object. The NIE summarizes both the extracted keypoints and the action-conditioned keypoints into an action-conditioned state feature r a . In this way, the NIE provides possible outcomes resulting from each action to the policy network. Finally, given a goal representations g, the policy network utilizes both v and r a to generate an action a for the agent.
3.2. Neural Interaction Engine
The NIE operates by first extracting object keypoints p ∈ R O× (N ×3) , where N denotes the number of keypoints, O denotes the observed objects, and each p ∈ R 3 describes a point in the three dimensional space, and then, based on these keypoints, predicting the action-conditioned keypoints p a ∈ R O×|A|×(N ×3) for each action a in the action space A. The engine captures a summary of possible outcomes for each action and object. These summaries are used by the policy network to sample an action a.
As shown in Fig. 4 , the input to NIE includes the observation i, which includes an RGB frame and a depth map, the visual representation v from the visual encoder, the object category embedding, and the action index embedding. The observation is first passed through a MaskRCNN [16] to obtain object segmentations. To extract the keypoints, we heuristically detect 8 corner points in an object segment as the keypoints for this object (see Sec. A for more details). We used a heuristic approach to find the keypoints, but any other keypoint detection approach (e.g., [20, 33] ) could be used instead. Further, using the depth map and camera parameters of the agent, we back project the keypoints onto the 3-dimensional space.
To predict the outcome of each action, the NIE predicts affine transformation matrices for each object and action, as shown in the Affine Transformation module in Fig. 4 . In practice, we first embed the keypoints p into hidden features and concatenate it with the object category embedding as well as the action index embedding. Then, we use an MLP to predict the affine transformation matrix m ∈ R O×|A|×4×4 for all objects O and all actions in the action space A. We translate and rotate the keypoints p according to m to obtain p a . Since each m a o ∈ m encodes the information associated with object category and the action Figure 4 : Neural Interaction Engine. The inputs to the neural interaction engine are action indices, object categories, visual representation v from the visual encoder, and visual observation i, which includes an RGB image and a depth map. After encoding each input modality, the engine uses an MLP to predict the affine transformation matrices to translate and rotate keypoints p to p a corresponding to all objects and all actions. Then, the engine encodes the average of keypoints into hidden features s as well as s a . Finally, the engine utilizes a self-attention layer to summarize the hidden features into a semantic action-conditioned state representation r a . a, the predicted keypoints not only contain semantic meaning, but also carry action-dependent information.
To encode keypoints and their corresponding actionconditioned keypoints, we first compute the center (c and c a ) of both p and p a by averaging the coordinates along each axis
(i.e, c x = 1 N N n=1 p n x , c y = 1 N N n=1 p n y , c z = 1 N N n=1 p n z ).
Further, we employ a state encoder to encode c and c a into hidden features (s and s a ), as shown in the Encode module in Fig. 4 .
The hidden features s and s a are then concatenated with the object category embedding to construct a semantic action-conditioned state representation r. Furthermore, we perform Self-Attention [34] on r over the object category axis and an Average-Pooling layer to obtain the actionconditioned state representation r a , as illustrated in the Attention module in Fig. 4 . The reason for this step is not only to make the action-conditioned representation more compact, but also to directly associate it to each action. Integrating NIE output into the Policy Network. We construct a global representation f by concatenating the goal representations g (e.g., target location encoding for the point navigation task), visual representation v, and actiondependent state features r a . The policy network takes f as the input and outputs a probability distribution over the action space. The agent samples an action from this distribution to execute in the environment.
3.3. Learning Objective
To train the model to learn the affine transformation matrix, we use the pose of an object before and after applying an action a in the environment to construct the ground truth affine transformation matrix. Then, we apply this ground truth affine transformation matrix to the keypoints p to obtain the ground truth action-conditioned keypoints t a . We cast the learning as a regression problem and use L1 loss to optimize NIE. The agent can only pick one action to execute at each timestamp. Hence it is not possible to obtain the ground truth action-conditioned keypoints t a for all possible actions a ∈ A. The agent only observes few objects among the object categories O, so we do not backpropagate the gradients back to the object categories that are not observed. As a result, during the training stage (as illustrated in Fig. 5 ), we only compute the loss for the executed action and backpropagate the gradients only through the path corresponding to a * , the action that is actually executed by the agent and also the observed object categories O * ⊂ O:
EQUATION (1): Not extracted; please refer to original document.
Further, to learn the policy network, we employ the Proximal Policy Optimization (PPO) [30] to perform an onpolicy reinforcement learning, as illustrated in Fig. 5 . The overall learning objective is L = L PPO + αL NIE , where the α ≥ 0 is a hyperparameter controlling the relative importance of the NIE loss.
4. Experiments
To evaluate the effectiveness of the proposed Neural Interaction Engine, we evaluate it on two downstream tasks. In the following, we first describe the two downstream tasks. We then describe environment details and the datasets we have collected for training and evaluating the proposed framework. Further, we provide the implementation details in Sec. 4.1. In Sec. 4.2 and Sec. 4.3, we introduce our comparative baselines and variations of our model. Finally, we present quantitative and qualitative results in Sec. 4.4. Downstream tasks. We consider two downstream tasks for our experiments:
• ObsNav -The goal of ObsNav is to move from a random starting location in a scene to specific coordinates while the path to the target point is blocked by obstacles on the floor. This is similar to PointNav [3] with the difference that the agent should move objects out of the way to reach the target.
• ObjPlace -The second downstream task that we consider is ObjPlace. The goal is to move an object on the floor from a random starting location to a specified coordinate in a scene. This task requires successive application of a force to an object while navigating towards the target point. Fig. 3 shows four typical examples where the agent applies Push, Pull, RightPush, LeftPush actions. Finally, we set the height and width of RGB and depth images to 224. Thereby, the ground truth object segmentation used to learn the NIE is also of the same dimensions. Data collection. We use Kitchens, Living Rooms, Bedrooms, and Bathrooms for our experiments (120 scenes in total). We follow the common practice for AI2-THOR wherein the first 20 scenes are used for training, the next 5 for validation, and the last 5 for testing in each scene category. To collect the datasets, we use 20 categories of objects such as Chair, SideTable, and DogBed. Please see Sec. B for the used objects. These objects are used as obstacles for ObsNav and as objects that should be displaced in ObjPlace. These objects are spawned on the floor for the downstream tasks. For each object category we have 5 different variations. We randomly select the first 4 variations to collect the training and validation data and use the 5th variation to collect the test data.
To generate the dataset for ObsNav, we utilize an undirected graph to compute the path from the agent's starting location to the target location. Then, we randomly spawn an object to block the path. To ensure that there is no way that the agent can directly reach the target location without moving an object, we repeat this process until there is no path between the agent's starting location (source node) and target location (end node). The top row in Fig. 6 shows five examples in this dataset.