Learning to Learn How to Learn: Self-Adaptive Visual Navigation Using Meta-Learning

Mitchell Wortsman
Kiana Ehsani
Mohammad Rastegari
Ali Farhadi
R. Mottaghi
2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
2019
View in Semantic Scholar

Abstract

Learning is an inherently continuous phenomenon. When humans learn a new task there is no explicit distinction between training and inference. As we learn a task, we keep learning about it while performing the task. What we learn and how we learn it varies during different stages of learning. Learning how to learn and adapt is a key property that enables us to generalize effortlessly to new settings. This is in contrast with conventional settings in machine learning where a trained model is frozen during inference. In this paper we study the problem of learning to learn at both training and test time in the context of visual navigation. A fundamental challenge in navigation is generalization to unseen scenes. In this paper we propose a self-adaptive visual navigation method (SAVN) which learns to adapt to new environments without any explicit supervision. Our solution is a meta-reinforcement learning approach where an agent learns a self-supervised interaction loss that encourages effective navigation. Our experiments, performed in the AI2-THOR framework, show major improvements in both success rate and SPL for visual navigation in novel scenes. Our code and data are available at: https://github.com/allenai/savn.

1. Introduction

Learning is an inherently continuous phenomenon. We learn further about tasks that we have already learned and can learn to adapt to new environments by interacting in these environments. There is no hard boundary between the training and the testing phases while we are learning and performing tasks: we learn as we perform. This stands in stark contrast with many modern deep learning techniques, where the network is frozen during inference.

What we learn and how we learn it varies during different stages of learning. To learn a new task we often rely on explicit external supervision. After learning a task, we further learn as we adapt to new settings. This adaptation does not necessarily need explicit supervision; we often do this via interaction with the environment. Traditional navigation approaches freeze the model during inference (top row); this may result in difficulties generalizing to unseen environments. In this paper, we propose a metareinforcement learning approach for navigation, where the agent learns to adapt in a self-supervised manner (bottom row). In this example, the agent learns to adapt itself when it collides with an object once and acts correctly afterwards. In contrast, a standard solution (top row) makes multiple mistakes of the same kind when performing the task.

In this paper, we study the problem of learning to learn and adapt at both training and test time in the context of visual navigation; one of the most crucial skills for any visually intelligent agent. The goal of visual navigation is to move towards certain objects or regions of an environment. A key challenge in navigation is generalizing to a scene that has not been observed during training, as the structure of the scene and appearance of objects are unfamiliar. In this paper we propose a self-adaptive visual navigation (SAVN) model which learns to adapt during inference without any explicit supervision using an interaction loss ( Figure 1 ).

Figure 1. Traditional navigation approaches freeze the model during inference (top row); this may result in difficulties generalizing to unseen environments. In this paper, we propose a metareinforcement learning approach for navigation, where the agent learns to adapt in a self-supervised manner (bottom row). In this example, the agent learns to adapt itself when it collides with an object once and acts correctly afterwards. In contrast, a standard solution (top row) makes multiple mistakes of the same kind when performing the task.

Formally, our solution is a meta-reinforcement learning approach to visual navigation, where an agent learns to adapt through a self-supervised interaction loss. Our approach is inspired by gradient based meta-learning algorithms that learn quickly using a small amount of data [13] . In our approach, however, we learn quickly using a small amount of self-supervised interaction. In visual navigation, adaptation is possible without access to any reward function or positive example. As the agent trains, it learns a self-supervised loss that encourages effective navigation. During training, we encourage the gradients induced by the self-supervised loss to be similar to those we obtain from the supervised navigation loss. The agent is therefore able to adapt during inference when explicit supervision is not available.

In summary, during both training and testing, the agent modifies its network while performing navigation. This approach differs from traditional reinforcement learning where the network is frozen after training, and contrasts with supervised meta-learning as we learn to adapt to new environments during inference without access to rewards.

We perform our experiments using the AI2-THOR [23] framework. The agent aims to navigate to an instance of a given object category (e.g., microwave) using only visual observations. We show that SAVN outperforms the non-adaptive baseline in terms of both success rate (40.8 vs 33.0) and SPL (16.2 vs 14.7) . Moreover, we demonstrate that learning a self-supervised loss provides improvement over hand-crafted self-supervised losses. Additionally, we show that our approach outperforms memory-augmented non-adaptive baselines.

2. Related Work

Deep Models for Navigation. Traditional navigation methods typically perform planning on a given map of the environment or build a map as the exploration proceeds [26, 40, 21, 24, 9, 4] . Recently, learning-based navigation methods (e.g., [50, 15, 27] ) have become popular as they implicitly perform localization, mapping, exploration and semantic recognition end-to-end.

Zhu et al. [50] address target-driven navigation given a picture of the target. A joint mapper and planner has been introduced by [15] . [27] use auxiliary tasks such as loop closure to speed up RL training for navigation. We differ in our approach as we adapt dynamically to a novel scene. [37] propose the use of topological maps for the task of navigation. They explore the test environment for a long period to populate the memory. In our work, we learn to navigate without an exploration phase. [20] propose a selfsupervised deep RL model for navigation. However, no semantic information is considered. [31] learn navigation policies based on object detectors and semantic segmentation modules. We do not rely on heavily supervised detectors and learn from a limited number of examples. [46, 44] incorporate semantic knowledge to better generalize to unseen scenarios. Both of these approaches dynamically update their manually defined knowledge graphs. However, our model learns which parameters should be updated during navigation and how they should be updated. Learning-based navigation has been explored in the context of other applications such as autonomous driving (e.g., [7] ), mapbased city navigation (e.g., [5] ) and game play (e.g., [43] ). Navigation using language instructions has been explored by various works [3, 6, 17, 47, 29] . Our goal is different since we focus on using meta-learning to more effectively navigate new scenes using only the class label for the target. Meta-learning. Meta-learning, or learning to learn, has been a topic of continued interest in machine learning research [41, 38] . More recently, various meta-learning techniques have pushed the state of the art in low-shot problems across domains [13, 28, 12] .

Finn et al. [13] introduce Model Agnostic Meta-Learning (MAML) which uses SGD updates to adapt quickly to new tasks. This gradient based meta-learning approach may also be interpreted as learning a good parameter initialization such that the network performs well after only a few gradient updates. [25] and [48] augment the MAML algorithm so that it uses supervision in one domain to adapt to another. Our work differs as we do not use supervision or labeled examples to adapt.

Xu et al. [45] use meta-learning to significantly speed up training by encouraging exploration of the state space outside of what the actor's policy dictates. Additionally, [14] use meta-learning to augment the agent's policy with structured noise. At inference time, the agent is able to better adapt from a few episodes due to the variability of these episodes. Our work instead emphasizes self-supervised adaptation while executing a single visual navigation task. Neither of these works consider this domain.

Clavera et al. [8] consider the problem of learning to adapt to unexpected perturbations using meta-learning. Our approach is similar as we also consider the problem of learning to adapt. However, we consider the problem of visual navigation and adapt via a self-supervised loss.

Both [18] and [48] learn an objective function. However, [18] use evolutionary strategies instead of meta-learning. Our approach for learning a loss is inspired by and similar to [48] . However, we adapt in the same domain without explicit supervision while they adapt across domains using a video demonstration. Self-supervision. Different types of self-supervision have been explored in the literature [1, 19, 11, 42, 49, 36, 34, 32] . Some works aim to maximize the prediction error in the representation of future states [33, 39] . In this work, we learn a self-supervised objective which encourages effective navigation.

3. Adaptive Navigation

In this section, we begin by formally presenting the task and our base model without adaptation. We then explain how to incorporate adaptation and perform training and testing in this setting. Figure 2 . Model overview. Our network optimizes two objective functions, 1) self-supervised interaction loss L φ int and 2) navigation loss Lnav. The inputs to the network at each time t are the egocentric image from the current location and word embedding of the target object class. The network outputs a policy π θ (st). During training, the interaction and navigation-gradients are back-propagated through the network, and the parameters of the self-supervised loss are updated at the end of each episode using navigation-gradients. At test time the parameters of the interaction loss remain fixed while the rest of the network is updated using interaction-gradients. Note that the green color in the figure represents the intermediate and final outputs.

Figure 2. Model overview. Our network optimizes two objective functions, 1) self-supervised interaction loss L φ int and 2) navigation loss Lnav. The inputs to the network at each time t are the egocentric image from the current location and word embedding of the target object class. The network outputs a policy πθ(st). During training, the interaction and navigation-gradients are back-propagated through the network, and the parameters of the self-supervised loss are updated at the end of each episode using navigation-gradients. At test time the parameters of the interaction loss remain fixed while the rest of the network is updated using interaction-gradients. Note that the green color in the figure represents the intermediate and final outputs.

3.1. Task Definition

Given a target object class, e.g. microwave, our goal is to navigate to an instance of an object from this class using only visual observations.

Formally, we consider a set of scenes S = {S 1 , ..., S n } and target object classes O = {o 1 , ..., o m }. A task τ ∈ T consists of a scene S, target object class o ∈ O, and initial position p. We therefore denote each task τ by the tuple τ = (S, o, p). We consider disjoint sets of scenes for the training tasks T train and testing tasks T test . We refer to the trial of a navigation task as an episode.

The agent is required to navigate using only the egocentric RGB images and the target object class (the target object class is given as a Glove embedding [35] ). At each time t the agent takes an action a from the action set A until the termination action is issued by the agent. We consider an episode to be successful if, within certain number of steps, the agent issues a termination action when an object from the given target class is sufficiently close and visible. If a termination action is issued at any other time, then the episode concludes and the agent has failed.

3.2. Learning

Before we discuss our self-adaptive approach we begin with an overview of our base model and discuss deep reinforcement learning for navigation in a traditional sense.

We let s t , the egocentric RGB image, denote the agent's state at time t. Given s t and the target object class, the network (parameterized by θ) returns a distribution over the actions which we denote π θ (s t ) and a scalar v θ (s t ). The distribution π θ (s t ) is referred to as the agent's policy while v θ (s t ) is the value of the state. Finally, we let π (a) θ (s t ) de-note the probability that the agent chooses action a.

We use a traditional supervised actor-critic navigation loss as in [50, 27] which we denote L nav . By minimizing L nav , we maximize a reward function that penalizes the agent for taking a step while incentivizing the agent to reach the target. The loss is a function of the agent's policies, values, actions, and rewards throughout an episode.

The network architecture is illustrated in Figure 2 . We use a ResNet18 [16] pretrained on ImageNet [10] to extract a feature map for a given image. We then obtain a joint feature-map consisting of both image and target information and perform a pointwise convolution. The output is then flattened and given as input to a Long Short-Term Memory network (LSTM). For the remainder of this work we refer to the LSTM hidden state and agent's internal state representation interchangeably. After applying an additional linear layer we obtain the policy and value. In Figure 2 we do not show the ReLU activations we use throughout, or reference the value v θ (s t ).

3.3. Learning To Learn

In visual navigation there is ample opportunity for the agent to learn and adapt by interacting with the environment. For example, the agent may learn how to handle obstacles it is initially unable to circumvent. We therefore propose a method in which the agent learns how to adapt from interaction. The foundation of our method lies in recent works which present gradient based algorithms for learning to learn (meta-learning). Background on Gradient Based Meta-Learning. We rely on the meta-learning approach detailed by the MAML algorithm [13] . The MAML algorithm optimizes for fast adaptation to new tasks. If the distribution of training and test-ing tasks are sufficiently similar then a network trained with MAML should quickly adapt to novel test tasks.

MAML assumes that during training we have access to a large set of tasks T train where each task τ ∈ T train has a small meta-training dataset D tr τ and meta-validation set D val τ . For example, in the problem of k-shot image classification, τ is a set of image classes and D tr τ contains k examples of each class. The goal is then to correctly assign one of the class labels to each image in D val τ . A testing task τ ∈ T test then consists of unseen classes.

The training objective of MAML is given by

EQUATION (1): Not extracted; please refer to original document.

where the loss L is written as a function of a dataset and the network parameters θ. Additionally, α is the step size hyper-parameter, and ∇ denotes the differential operator (gradient). The idea is to learn parameters θ such that they provide a good initialization for fast adaptation to test tasks. Formally, Equation 1optimizes for performance on D val τ after adapting to the task with a gradient step on D tr τ . Instead of using the network parameters θ for inference on D val τ , we use the adapted parameters θ − α∇ θ L (θ, D tr τ ). In practice, multiple SGD updates may be used to compute the adapted parameters. Training Objective for Navigation. Our goal is for an agent to be continually learning as it interacts with an environment. As in MAML, we use SGD updates for this adaptation. These SGD updates modify the agent's policy network as it interacts with a scene, allowing the agent to adapt to the scene. We propose that these updates should occur with respect to L int , which we call an interaction loss. Minimizing L int should assist the agent in completing its navigation task, and it can be learned or hand-crafted. For example, a hand-crafted variation may penalize the agent for visiting the same location twice. In order for the agent to have access to L int during inference, we use a self-supervised loss. Our objective is then to learn a good initialization θ, such that the agent will learn to effectively navigate in an environment after a few gradient updates using L int .

For clarity, we begin by formally presenting our method in a simplified setting in which we allow for a single SGD update with respect to L int . For a navigation task τ we let D int τ denote the actions, observations, and internal state representations (defined in Section 3.2) for the first k steps of the agent's trajectory. Additionally, let D nav τ denote this same information for the remainder of the trajectory. Our training objective is then formally given by

EQUATION (2): Not extracted; please refer to original document.

which mirrors the MAML objective from Equation (1). However, we have replaced the small training set D tr τ from MAML with an interaction phase. The intuition for our objective is as follows: at first we interact with the environment and then we adapt to it. More specifically, the agent interacts with the scene using the parameters θ. After k steps an SGD update with respect to the self-supervised loss is used to obtain the adapted parameters θ−α∇ θ L int θ, D int τ . In domain adaptive meta-learning, two separate losses are used for adaptation from one domain to another [25, 48] . A similar objective to Equation (2) is employed by [48] for one-shot imitation from observing humans. Our method differs in that we are learning how to adapt in the same domain through self-supervised interaction.

As in [25] , a first order Taylor expansion provides intuition for our training objective. Equation 2is approximated by

EQUATION (3): Not extracted; please refer to original document.

where •, • denotes an inner product. We are therefore learning to minimize the navigation loss while maximizing the similarity between the gradients we obtain from the self-supervised interaction loss and the supervised navigation loss. If the gradients we obtain from both losses are similar, then we are able to continue "training" during inference when we do not have access to L nav . However, it may be difficult to choose L int which allows for similar gradients. This directly motivates learning the self-supervised interaction loss.

3.4. Learning To Learn How To Learn

We propose to learn a self-supervised interaction objective that is explicitly tailored to our task. Our goal is for the agent to improve at navigation by minimizing this selfsupervised loss in the current environment.

During training, we both learn this objective and learn how to learn using this objective. We are therefore "learning to learn how to learn". As input to this loss we use the agent's previous k internal state representations concatenated with the agent's policy.

Formally, we consider the case where L int is a neural network parameterized by φ, which we denote L φ int . Our training objective then becomes

EQUATION (4): Not extracted; please refer to original document.

and we freeze the parameters φ during inference. There is no explicit objective for the learned-loss. Instead, we simply encourage that minimizing this loss allows the agent to navigate effectively. This may occur if the gradients from for mini-batch of tasks τ i ∈ T train do 4:

θ i ← θ 5: t ← 0 6:

while termination action is not issued do

Take action a sampled from π θi (s t ) 8 :

t ← t + 1 9:

if t is divisible by k then 10:

EQUATION τ 11: Not extracted; please refer to original document.

θ ← θ − β 1 i ∇ θ L nav (θ i , D τ )

12:

φ ← φ − β 2 i ∇ φ L nav (θ i , D τ )

13: return θ, φ both losses are similar. In this sense we are training the self-supervised loss to imitate the supervised L nav loss. As in [48] , we use one dimensional temporal convolutions for the architecture of our learned loss. We use two layers, the first with 10×1 filters and the next with 1×1. As input we concatenate the past k hidden states of the LSTM and the previous k policies. To obtain the scalar objective we take the ℓ 2 norm of the output. Though we omit the ℓ 2 norm, we illustrate our interaction loss in Figure 2 . Hand Crafted Interaction Objectives. We also experiment with two variations of simple hand crafted interaction losses which can be used as an alternative to the learned loss. The first is a diversity loss L div int which encourages the agent to take varied actions. If the agent does happen to reach the same state multiple times it should definitely not repeat the action it previously took. Accordingly,

EQUATION (5): Not extracted; please refer to original document.

where s t is the agent's state at time t, a t is the action the agent takes at time t, and g calculates the similarity between two states. For simplicity we let g(s i , s j ) be 1 if the pixel difference between s i and s j is below a certain threshold and 0 otherwise. Additionally, we consider a prediction loss L pred int where the agent aims to predict the success of each action. The idea is to avoid taking actions that the network predicts will fail. We say that the agent's action has failed if we detect sufficient similarity in two consecutive states. This may occur when the agent bumps into an object or wall. In addition to producing a policy π θ over actions the agent also predicts the success of each action. For state s t we denote the predicted probability that action a succeeds as q

θ i ← θ 3: t ← 0 4:

while termination action is not issued do

5:

Take action a sampled from π θi (s t ) 6 :

t ← t + 1 7:

if t is divisible by k then 8:

θ i ← θ i − α∇ θi L φ int θ i , D (t,k) τ

For L pred int we use a standard binary cross entropy loss between our success prediction q (a) θ and observed success. Using the same g from Equation (5) we write our loss as

L pred int θ, D int τ = k−1 t=0 H q (at) θ (s t ), 1 − g(s t , s t+1 ) , (6)

where H(•, •) denotes binary cross-entropy.

We acknowledge that in a non-synthetic environment it may be difficult to produce a reliable function g. Therefore we only use g in the hand-crafted variations of the loss.

3.5. Training And Testing

So far we have implicitly decomposed the agent's trajectory into an interaction and navigation phase. In practice, we would like the agent to keep adapting until the object is found during both training and testing. We therefore perform an SGD update with respect to the self-supervised interaction loss every k steps. We compute the interaction loss at time t by using the information from the previous k steps of the agent's trajectory, which we denote D

(t,k) τ . Note that D (t,k) τ

is analogous to D int τ in Equation (4). In addition, the agent should be able to navigate efficiently. Hence, we compute the navigation loss L nav using the the information from the complete trajectory of the agent, denoted by D τ .

For the remainder of this work we refer to the gradient with respect to L int as the interaction-gradient and the gradient with respect to L nav as the navigation-gradient. These gradients are illustrated in Figure 2 by red and green arrows, respectively. Note that we do not update the loss parameters φ via the interaction-gradient.

Though traditional works use testing and inference interchangeably we may regard inference more abstractly as any setting in which the task is performed without supervision. This occurs not only during testing but also within each episode of navigation during training. Algorithms 1 and 2 detail our method for training and testing, respectively. In Algorithm 1 we learn a policy network π θ and a loss network parameterized by φ with stepsize hyper-parameters α, β 1 , β 2 . Recall that k is a hyperparameter which prescribes the frequency of the interaction- Figure 3 . Qualitative examples. We compare our method with the non-adaptive baseline. We illustrate the trajectory of the agent (white corresponds to the beginning of the trajectory and dark blue shows the end). Black arrows represent rotation. We also show the egocentric view of the agent at a few time steps. Our method may learn from its mistakes (e.g., getting stuck behind an object).

Figure 3. Qualitative examples. We compare our method with the non-adaptive baseline. We illustrate the trajectory of the agent (white corresponds to the beginning of the trajectory and dark blue shows the end). Black arrows represent rotation. We also show the egocentric view of the agent at a few time steps. Our method may learn from its mistakes (e.g., getting stuck behind an object).

gradients. If we are instead considering a hand-crafted selfsupervised loss then we ignore φ and omit line 12.

Recall that the adapted parameters, which we denote θ i in Algorithm 1 and 2, are implicitly a function of θ, φ. Therefore, the differentiation in lines 11 and 12 is well defined though it requires the computation of Hessian vectorproducts. We never compute more than 4 interactiongradients due to computational constraints.

At test time we may adapt in an environment with respect to the self-supervised interaction loss, but we no longer have access to L nav . Note that the shared parameter θ is not updated during testing, as detailed in Algorithm 2.

4. Experiments

Our goal in this section is to (1) evaluate our selfadaptive navigation model in comparison to non-adaptive baselines, (2) determine if the learned self-supervised objective provides any improvement over hand-crafted selfsupervised losses, and (3) gain insight into how and why our method may be improving performance.

4.1. Experiment Setup

We train and evaluate our models using the AI2-THOR [23] environment. AI2-THOR provides indoor 3D synthetic scenes in four room categories, kitchen, living room, bedroom and bathroom. For each room type, we use 20 scenes for training, 5 for validation and 5 for testing (a total of 120 scenes).

We choose a subset of target object classes as our navigation targets such that (1) they are not hidden in cabinets, fridges, etc., (2) they are not too large that they take a big portion of the room and are visible from most parts of the room (e.g., beds in bedrooms). We choose the following sets of objects for each type of room: 1) Living room: pillow, laptop, television, garbage can, box, and bowl. 2) Kitchen: toaster, microwave, fridge, coffee maker, garbage can, box, and bowl. 3) Bedroom: plant, lamp, book, and alarm clock. 4) Bathroom: sink, toilet paper, soap bottle, and light switch.

We consider the actions A = {MoveAhead, RotateLeft, RotateRight, LookDown, LookUp, Done}. Horizontal rotation occurs in increments of 45 degrees while looking up and down change the camera tilt angle by 30 degrees. Done corresponds to the termination action discussed in Section 3.1. The agent successfully completes a navigation task if this action is issued when an instance from the target object class is within 1 meter from the agent's camera and within the field of view. This follows from the primary recommendation of [2] . Note that if the agent ever issues the Done action when it has not reached a target object then we consider the task a failure.

4.2. Implementation Details

We train our method and baselines until the success rate saturates on the validation set. We train one model across all scene types with an equal number of episodes per type using 12 asynchronous workers. For L nav , we use a reward of 5 for finding the object and -0.01 for taking a step. For each scene we randomly sample an object from the scene as a target along with a random initial position. For our interaction-gradient updates we use SGD and for our navigation-gradients we use Adam [22] . For step size hyper-parameters (α, β 1 , β 2 in Algorithm 1) we use 10 −4 and for k we use 6. Recall that k is the hyper-parameter which prescribes the frequency of interaction-gradients. We experimented with a schedule for k but saw no significant improvement in performance.

For evaluation we perform inference for 1000 different episodes (250 for each scene type). The scene, initial state of the agent and the target object are randomly chosen. All models are evaluated using the same set. For each training run we select the model that performs best on the validation set in terms of success.

4.3. Evaluation Metrics

We evaluate our method on unseen scenes using both Success Rate and Success weighted by Path Length (SPL). SPL was recently proposed by [2] and captures information about navigation efficiency. Success is defined as

1 N N i=1 S i and SPL is defined as 1 N N i=1 S i Li max(Pi,Li) ,

where N is the number of episodes, S i is a binary indicator of success in episode i, P i denotes path length and L i is the length of the optimal trajectory to any instance of the target object class in that scene. We evaluate the performance of our model both on all trajectories and trajectories where the optimal path length is at least 5. We denote this by L ≥ 5 (L refers to optimal trajectory length).

4.4. Baselines

We compare our models with the following baselines: Random agent baseline. At each time step the agent randomly samples an action using a uniform distribution. Nearest neighbor (NN) baseline. At each time step we select the most similar visual observation (in terms of Euclidean distance between ResNet features) among scenes in training set which contain an object of the class we are searching for. We then take the action that is optimal in the train scene when navigating to the same object class. No adaptation (A3C) baseline. The architecture for the baseline is the same as ours, however there is no interactiongradient and therefore no interaction loss. The training objective for this baseline is then min θ τ ∈Ttrain L nav (θ, D τ ) which is equivalent to setting α = 0 in Equation (4) . This baseline is trained using A3C [30] . Table 1 . Quantitative results. We compare variations of our method with random, nearest neighbor and non-adaptive baselines. We consider two evaluation metrics, Success Rate and SPL. We provide results for all targets 'All' and a subset of targets whose optimal trajectory length is greater than 5. We report the average over 5 training runs with standard deviations shown in sub-scripted parentheses. Table 1 summarizes the results of our approach and the baselines. We consider three variations of our method, which include SAVN (learned self-supervised loss) and the hand-crafted prediction and diversity loss alternatives.

Table 1. Quantitative results. We compare variations of our method with random, nearest neighbor and non-adaptive baselines. We consider two evaluation metrics, Success Rate and SPL. We provide results for all targets ‘All’ and a subset of targets whose optimal trajectory length is greater than 5. We report the average over 5 training runs with standard deviations shown in sub-scripted parentheses.

4.5. Results

Our learned self-supervised loss outperforms all baselines by a large margin in terms of both success rate and SPL metrics. Most notably, we observe about 8% absolute improvement in success and 1.5 in SPL over the nonadaptive (A3C) baseline. The self-supervised objective not only learns to navigate more effectively but it also learns to navigate efficiently.

The models trained with hand-crafted exploration losses outperform our baselines by large margins in success, however, the SPL performance is not as impressive as with the learned loss. We hypothesize that minimizing these handcrafted exploration losses are not as conducive to efficient navigation. Failed actions. We now explore a behavior which sets us apart from the non-adaptive baseline. In the beginning of an episode the agent looks around or explores the free space in front of it. However, as the episode progresses, the non-adaptive agent might issue the termination action or get stuck. Our method (SAVN), however, exhibits this pattern less frequently.

To examine this behavior we compute the ratio of actions which fail. Recall that an agent's action has failed if two consecutive frames are sufficiently similar. Typically, this will occur if an agent collides with an object. As shown in Figure 4 , our method experiences significantly fewer failed actions than the baseline as the episode progresses. Qualitative examples. Figure 3 qualitatively compares our method with the non-adaptive (A3C) baseline. In scenario (a) our baseline gets stuck behind the box and tries to move forward multiple times, while our method adapts dynamically and finds the way towards the television. Similarly in scenario (c), the baseline tries to move towards the lamp but after bumping into the bed 5 times and rotating 9 times, it issues Done in a distant location from the target.

Figure 4. Failed actions. Our approach learns to adapt and not to take unsuccessful actions as the navigation proceeds.

4.6. Ablation Study

In this section we perform an ablation on our methods to gain further insight into our result. Adding modules to the non-adaptive baseline. In Table 2 we experiment with the addition of various modules to our non-adaptive baseline. We begin by augmenting the baseline with an additional memory module which performs self-attention on the latest k = 6 hidden states of the LSTM. SAVN outperforms the memory-augmented baseline as well.

Table 2.Ablation results. We compare our approach with the nonadaptive baseline augmented with memory and our hand-crafted loss. We also provide the result when we use ground truth object information (bottom two rows).

Additionally, we add the prediction loss detailed in Section 3.4 to the training objective. This experiment reveals that our result is not simply a consequence of additional losses. By using our training objective with the added handcrafted prediction loss (referred to as 'Ours -prediction'), we outperform the baseline non-adaptive model with prediction (referred to as 'A3C w/ prediction') by 3.3% for all trajectories and 4.8% for trajectories of at least length 5 in terms of success rate. As discussed in the Section 4.5, minimizing the hand-crafted objectives during the episode may not be optimal for efficient exploration. This may be why we show a boost in SPL for trajectories of at least length 5 but not overall. We run the same experiment with the diversity loss but find that the baseline model is unable to converge with this additional loss. Ablation of the number of gradients. To showcase the efficacy of our method we modify the number of interactiongradient steps that we perform during the adaptation phase during training and testing. As discussed in Section 3.5, we never perform more than 4 interaction-gradients due to computational constraints. As illustrated by Figure 5 , there is an increase in success rate when more gradient updates are used, demonstrating the importance of the interactiongradients. Perfect object information. Issuing the termination action at the correct location plays an important role in our nav- Table 2 . Ablation results. We compare our approach with the nonadaptive baseline augmented with memory and our hand-crafted loss. We also provide the result when we use ground truth object information (bottom two rows). Figure 5 . Number of Gradients Ablation. Our success rate increases as more interaction-gradients are taken during training/testing. igation task. We observe that SAVN still outperforms the baseline even when the termination signal is provided by the environment (referred to as 'GT obj' in Table 2 ).

Figure 5. Number of Gradients Ablation. Our success rate increases as more interaction-gradients are taken during training/testing.

5. Conclusions

We introduce a self-adaptive visual navigation agent (SAVN) that learns during both training and inference. During training the model learns a self-supervised interaction loss that can be used when there is no supervision. Our experiments show that this approach outperforms nonadaptive baselines by a large margin. Furthermore, we show that the learned interaction loss performs better than handcrafted losses. Additionally, we find that SAVN navigates more effectively than a memory-augmented non-adaptive baseline. We conjecture that this idea may be applied in other domains where the agents may learn from selfsupervised interactions.