RobustNav: Towards Benchmarking Robustness in Embodied Navigation
Authors
Abstract
As an attempt towards assessing the robustness of embodied navigation agents, we propose ROBUSTNAV, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of visual – affecting RGB inputs – and dynamics – affecting transition dynamics – corruptions. Most recent efforts in visual navigation have typically focused on generalizing to novel target environments with similar appearance and dynamics characteristics. With ROBUSTNAV, we find that some standard embodied navigation agents significantly underperform (or fail) in the presence of visual or dynamics corruptions. We systematically analyze the kind of idiosyncrasies that emerge in the behavior of such agents when operating under corruptions. Finally, for visual corruptions in ROBUSTNAV, we show that while standard techniques to improve robustness such as data-augmentation and self-supervised adaptation offer some zero-shot resistance and improvements in navigation performance, there is still a long way to go in terms of recovering lost performance relative to clean “non-corrupt” settings, warranting more research in this direction. Our code is available at https://github.com/allenai/robustnav.
1. Introduction
A longstanding goal of the artificial intelligence community has been to develop algorithms for embodied agents that are capable of reasoning about rich perceptual information and thereby accomplishing tasks by navigating in and interacting with their environments. In addition to being able to exhibit these capabilities, it is equally important that such embodied agents are able to do so in a robust and generalizable manner.
A major challenge in Embodied AI is to ensure that agents can generalize to environments with different ap-* Part of the work done when PC was a research intern at AI2. Figure 1 . ROBUSTNAV. (a) A navigation agent pretrained in clean environments is asked to navigate to targets in unseen environments in the presence of (b) visual and (c) dynamics based corruptions. Visual corruptions (ex. camera crack) affect the agent's egocentric RGB observations while Dynamics corruptions (ex. drift in translation) affect transition dynamics in the unseen environment.
pearance statistics and motion dynamics than the environment used for training those agents. For instance, an agent that is trained to navigate in "sunny" weather should continue to operate in rain despite the drastic changes in the appearance, and an agent that is trained to move on carpet should decidedly navigate when on a hardwood floor despite the discrepancy in friction. While a potential solution may be to calibrate the agent for a specific target environment, it is not a scalable one since there can be enormous varieties of unseen environments and situations. A more robust, efficient and scalable solution is to equip agents with the ability to autonomously adapt to new situations by interaction without having to train for every possible target scenario. Despite the remarkable progress in Embodied AI, especially in embodied navigation [62, 48, 50, 57, 8] , most efforts focus on generalizing trained agents to unseen environments, but critically assume similar appearance and dy-namics attributes across train and test environments.
As a first step towards assessing general purpose robustness of embodied agents, we propose ROBUSTNAV, a framework to quantify the performance of embodied navigation agents when exposed to a wide variety of common visual (vis) and dynamics (dyn) corruptions -artifacts that affect the egocentric RGB observations and transition dynamics, respectively. We envision ROBUSTNAV as a testbed for adapting agent behavior across different perception and actuation properties. While assessing robustness to changes (stochastic or otherwise) in environments has been investigated in the robotics community [33, 14, 15, 22] , the simulated nature of ROBUSTNAV enables practitioners to explore robustness against a rich and very diverse set of changes, while inheriting the advantages of working in simulation -speed, safety, low cost and reproducibility.
ROBUSTNAV consists of two widely studied embodied navigation tasks, Point-Goal Navigation (POINTNAV) [3] and Object-Goal Navigation (OBJECTNAV) [5] -the tasks of navigating to a goal-coordinate in a global reference frame or an instance of a specified object, respectively. Following the standard protocol, agents learn using a set of training scenes and are evaluated within a set of held out test scenes, but differently, ROBUSTNAV test scenes are subject to a variety of realistic visual and dynamics corruptions. These corruptions can emulate real world scenarios such as a malfunctioning camera or drift (see Fig.1 ).
As zero shot adaptation to test time corruptions may be out of reach for our current algorithms, we provide agents with a fixed "calibration budget" (number of interactions) within the target world for unsupervised adaptation. This mimics a real-world analog where a shipped robot is allowed to adapt to changes in the environment by executing a reasonable number of unsupervised interactions. Post calibration, agents are evaluated on the two tasks in the corrupted test environments using standard navigation metrics.
Our extensive analysis reveals that both POINTNAV and OBJECTNAV agents experience significant degradation in performance across the range of corruptions, particularly when multiple corruptions are applied together. We show that this degradation reduces in the presence of a clean depth sensor suggesting the advantages of incorporating multiple sensing modalities, to improve robustness. We find that data augmentation and self-supervised adaptation strategies offer some zero-shot resistance and improvement over degraded performance, but are unable to fully recover this gap in performance. Interestingly, we also note that visual corruptions affect embodied tasks differently from static tasks like object recognition -suggesting that visual robustness should be explored within an embodied task. Finally, we analyze several interesting behaviors our agents exhibit in the presence of corruptions -such as increase in the number of collisions and inability to terminate episodes successfully.
In summary, our contributions include: (1) We present ROBUSTNAV-a framework for benchmarking and assessing the robustness of embodied navigation agents to visual and dynamics corruptions. (2) Our findings show that present day navigation agents trained in simulation underperform severely when evaluated in corrupt target environments. 3We systematically analyze the kinds of mistakes embodied navigation agents make when operating under such corruptions. (4) We find that although standard data-augmentation techniques and self-supervised adaptation strategies offer some improvement, much remains to be done in terms of fully recovering lost performance.
ROBUSTNAV provides a fast framework to develop and test robust embodied policies, before they can be deployed onto real robots. While ROBUSTNAV currently supports navigation heavy tasks, the supported corruptions can be easily extended to more tasks, as they get popular within the Embodied AI community.
2. Related Work
Visual Navigation. Tasks involving navigation based on egocentric visual inputs have witnessed exciting progress in recent years [50, 11, 25, 9, 20, 10] . Some of the widely studied tasks in this space include POINTNAV [3] , OBJECT-NAV [5] and goal-driven navigation where the target is specified by a goal-image [62] . Approaches to solve POINT-NAV and OBJECTNAV can broadly be classified into two categories -(1) learning neural policies end-to-end using RL [56, 60, 48, 50, 57] or (2) decomposing navigation into a mapping (building a semantic map) and path planning stage [7, 8, 26, 44] . Recent research has also focused on assessing the ability of polices trained in simulation to transfer to real-world robots operating in physical spaces [34, 13] . Robustness Benchmarks. Assessing robustness of deep neural models has received quite a bit of attention in recent years [31, 47, 32, 4] . Most relevant and closest to our work is [31] , where authors show that computer vision models are susceptible to several synthetic visual corruptions, as measured in the proposed ImageNet-C benchmark. In [35, 40] , authors study the effect of similar visual corruptions for semantic segmentation and object detection on standard static benchmarks. ROBUSTNAV integrates several visual corruptions from [31] and adds ones such as low-lighting and crack in the camera-lens, but within an embodied scenario. Our findings (see Sec. 5) show that visual corruptions affect embodied tasks differently from static tasks like object recognition. In [53] , authors repurpose the ImageNet validation split to be used as a benchmark for assessing robustness to natural distribution shifts (unlike the ones introduced in [31] ) and [18] identifies statistical biases in the same. Recently, [30] proposes three extensive benchmarks assessing robustness to image-style, geographical location and camera operation.
Real-world RL Suite. Efforts similar to ROBUSTNAV have been made in [17] , where authors formalize 9 different challenges holding back RL from real-world use -including actuator delays, high-dimensional state and action spaces, latency, and others. In contrast, ROBUSTNAV focuses on challenges in the visually rich domains and complexities associated with visual observation. Recently, Habitat [50] also introduced actuation (from [41] ) and visual noise models for navigation tasks. In contrast, ROBUSTNAV is designed to benchmark robustness of models against a variety of visual and dynamics corruptions (7 vis and 4 dyn corruptions for both POINTNAV and OBJECTNAV). Adapting Visio-Motor Policies. Significant progress has been made in the problem of adapting policies trained with RL from a source to a target environment. Unlike RO-BUSTNAV, major assumptions involved in such transfer settings are either access to task-supervision in the target environment [24] or access to paired data from the source and target environments [23, 54] . Domain Randomization (DR) [2, 46, 38, 42] is another common approach to train policies robust to various environmental factors. Notably, [38] perturbs features early in the visual encoders of the policy network so as to mimic DR and [42] selects optimal DR parameters during training based on sparse data obtained from the the real world. In absence of task supervision, another common approach is to optimize selfsupervised objectives in the target [57, 49] and has been used to adapt policies to visual disparities (see Sec. 5) in new environments [27] . To adapt to changes in transition dynamics, a common approach is to train on a broad family of dynamics models and perform system-identification (ex. with domain classifiers [19] ) in the target environment [59, 61] . [34, 13] studies the extent to which embodied navigation agents transfer from simulated environments to real-world physical spaces. Among these, we investigate two of the most popular approaches -self-supervised adaptation [27] and aggressive data augmentation and measure if they can help build resistance to vis corruptions.
3. Robustnav
We present ROBUSTNAV, a benchmark to assess the robustness of embodied agents to common visual (vis) and dynamics (dyn) corruptions. ROBUSTNAV is built on top of ROBOTHOR [12] . In this work, we study the effects corruptions have on two kinds of embodied navigation agents -namely, POINTNAV (navigate to a specified goal coordinate) and OBJECTNAV (navigate to an instance of an object category). While we restrict our experiments to navigation, in practice, our vis and dyn corruptions can also be extended to other embodied tasks that share the same modalities, for instance tasks involving interacting with objects.
In ROBUSTNAV, agents are trained within the training scenes and evaluated on "corrupt" unseen target scenes. Corruptions in target scenes are drawn from a set of predefined vis and dyn corruptions. As is the case with any form of modeling of corruptions (or noise) in simulation [33, 12] , there will always be an approximation error when the vis and dyn corruptions are compared to their real world counterparts. Our aim is to ensure that the ROBUSTNAV benchmark acts as a stepping stone towards the larger goal of obtaining robust agents, ready to be deployed in real world. To adapt to a corrupt target scene, we provide agents with a "calibration budget" -an upper bound on the number of interactions an agent is allowed to have with the target environment without any external task supervision. This is done to mimic a real-world analog where a shipped robot is allowed to adapt to changes in the environment by executing a reasonable number of unsupervised interactions. We adopt a modest definition of the calibration-budget based on the number of steps it takes an agent to reasonably recover degraded performance in the most severely corrupted environments when finetuned under complete supervision (see Table. 3) -set to ∼ 166k steps for all our experiments. We attempt to understand if self-supervised adaptation approaches [27] improve performance when allowed to adapt under this calibration budget (see Sec. 5, resisting corruptions). We now describe in detail the vis and dyn corruptions present in ROBUSTNAV. Visual Corruptions. Visual corruptions are artifacts that degrade the navigation agent's egocentric RGB observation (see Fig. 2 ). We provide seven visual corruptions within ROBUSTNAV, four of which are drawn from the set of corruptions and perturbations proposed in [31] by the camera (modeled as additive noise with the noise being proportional to the original pixel intensity). Each of these corruptions can manifest at five levels of severity indicating increase in the extent of visual degradation (1 → 5).
In addition to these, we also add low-lighting (lowlighting conditions in the target environment, has associated severity levels 1 → 5), lower-FOV (agents operating with a lower camera field of view compared to the one used during training, 79 • → 39.5 • ) and camera-crack (a randomized crack in the camera lens). For camera-crack, we use fixed random seeds for the 15 validation scenes which dictate the location and kind of crack on the camera lens. Dynamics Corruptions. Dynamics corruptions affect the transition dynamics of the agents in the target environment (see Fig. 3 ). We consider three classes of dynamics corruptions -Motion Bias, Motion Drift and Motor Failure. Our dyn corruptions are motivated from and in line with the well-known systematic and/or stochastic drifts (due to error accumulation) and biases in robot motion [37, 6, 21, 43] .
A common dynamics corruption observed in the real world is friction. Unfortunately ROBOTHOR does not yet natively support multiple friction zones within a scene, as may be commonly observed in a real physical environment (for instance the kitchen floor in a house may have smooth tiles while the bedroom may have rough hardwood floors). In lieu of this, we present the Motion Bias corruption. In the absence of this corruption, the move ahead action moves an agent forward by 0. Motion Drift models a setting where an agent's translation movements in the environment include a slight bias towards turning left or right. Specifically, the move ahead action, instead of moving an agent forward 0.25m in the direction of its heading (intended behavior), drifts towards the left or right directions stochastically (for an episode) by α = 10 • and takes it to a location which deviates in a direction perpendicular to the original heading by a max of ∼ 0.043m. Motor-failure is the setting where either the rotate left or the rotate right actions malfunction throughout an evaluation episode.
With the exception of Motion-Bias (S) -the stochastic version -the agent also operates under standard actuation noise models as calibrated from a LoCoBot in [13] . Recently, PyRobot [41] has also introduced LoCoBot calibrated noise models that demonstrate strafing and drifting. While we primarily rely on the noise models calibrated in [12] , for completeness, we also include results with the PyRobot noise models. Tasks. ROBUSTNAV consists of two major embodied navigation tasks -namely, POINTNAV and OBJECTNAV. In POINTNAV, an agent is initialized at a random spawn location and orientation in an environment and is asked to navigate to target coordinates specified relative to the agent's position. The agent must navigate based only on sensory inputs from an RGB (or RGB-D) and a GPS + Compass sensor. An episode is declared successful if the agent stops within 0.2m of the goal location (by intentionally invoking an end action). In OBJECTNAV, an agent is instead asked to navigate to an instance of a specified object category (e.g., Television, 1 out of total 12 object categories) given only ego-centric sensory inputs -RGB or RGB-D. An episode is declared successful if the agent stops within 1.0m of the target object (by invoking an end action) and has the target object in it's egocentric view. Due to the lack of perfect localization (no GPS + Compass sensor) and the implicit need to ground the specified object within its view, OBJECTNAV may be considered a harder task compared to POINTNAValso evident in lower OBJECTNAV performance (Table. 2). Metrics. We report performance in terms of the following well established navigation metrics reported in past works -Success Rate (SR) and Success Weighted by Path Length (SPL) [3] . SR indicates the fraction of successful episodes. SPL provides a score for the agent's path based on how close it's length is to the shortest path from the spawn location to the target. If I success denotes whether an episode Scenes. ROBUSTNAV is built on top of the ROBOTHOR scenes [13] . ROBOTHOR consists of 60 training and 15 validation environments based on indoor apartment scenes drawn from different layouts. To assess robustness in the presence of corruptions, we evaluate on 1100 (and 1095) episodes of varying difficulties (easy, medium and hard) 2 for POINTNAV (and OBJECTNAV) across the 15 val scenes. Benchmarking. Present day embodied navigation agents are typically trained without any corruptions. However, we anticipate that researchers may incorporate corruptions as augmentations at training time to improve the robustness of their algorithms in order to make progress on our ROBUST-NAV framework. For the purposes of fair benchmarking, we recommend that future comparisons using ROBUSTNAV do not draw from the set of corruptions reserved for the target scenes -ensuring the corruptions encountered in the target scenes are indeed "unseen".
4. Experimental Setup
Agent. Our POINTNAV agents have 4 actions available to them -namely, move ahead (0.25m), rotate left (30 • ), rotate right (30 • ) and end. The action end indicates that the agent believes that it has reached the goal, thereby terminating the episode. During evaluation, we allow an agent to execute a maximum of 300 steps -if an agent does not call end within 300 steps, we forcefully terminate the episode. For OBJECTNAV, in addition to the aforementioned actions, the agent also has the ability to look up or look down -indicating change in the agent's view 30 • above or below the forward camera horizon. The agent receives 224 × 224 sized ego-centric observations (RGB or RGB-D). All agents are trained under LoCoBot calibrated actuation noise models from [13] -N (0.25m, 0.005m) for translation and N (30 • , 0.5 • ) for rotation. Our agent architectures (akin to [56] ) are composed of a CNN head to process input observations followed by a recurrent (GRU) policy network (more details in Sec. A.3 of appendix). Training. We train our agents using DD-PPO [56] -a decentralized, distributed and synchronous version of the Proximal Policy Optimization (PPO) [52] algorithm. If R = 10.0 denotes the terminal reward obtained at the end of a successful episode (with I success being an indicator variable denoting whether an episode was successful), ∆ Geo t denotes the change in geodesic distance to target at timestep t from t − 1 and λ = −0.01 denotes a slack penalty to encourage efficiency, then the reward received by the agent at time-step t can be expressed as,
r t = R . I success success reward − ∆ Geo t reward shaping + λ slack reward
We train our agents using the AllenAct [55] framework.
5. Results And Findings
In this section, we show that the performance of POINT-NAV and OBJECTNAV agents degrades in the presence of corruptions (see Table. 2). We first highlight how vis corruptions affect static vision and embodied navigation tasks differently (see Table 1) . Following this, we analyze behaviors that emerge in these agents when operating in the presence of vis, dyn, and vis+dyn corruptions. Finally, we investigate whether standard data-augmentation and selfsupervised adaptation [27] techniques help recover the degraded performance (see Table 3 ).
5.1. Degradation In Performance
We now present our findings regarding degradation in performance relative to agents being evaluated in clean (no corruption) target environments (row 1 in Table. 2). Visual corruptions affect static and embodied tasks differently. In Table 1 , we report object recognition performance for models trained on the ImageNet [16] train split and evaluated on the corrupt validation splits. In Table 2 , we report performance degradation of POINTNAV and OB-JECTNAV agents under corruptions (row 1, clean & rows 2-8 corrupt). It is important to note that the nature of tasks (one-shot prediction vs sequential decision making) are different enough that the difficulty of corruptions for classification may not indicate the difficulty of corruptions for navigation. We verify this hypothesis by comparing results in Tables 1 and 2 Table 2 ) lead to a worst-case absolute drop of < 10% in SPL (and < 10% in SR), corruptions like Spatter and Motor Failure (rows 8, 13) are more extreme and significantly affect task performance (absolute drops of > 57% in SPL, > 65% in SR). For OBJECTNAV, however, the drop in performance is more gradual across corruptions (partly because it's a harder task and even clean performance is fairly low).
A "clean" depth sensor helps resisting degradation. We compare the RGB and RGB-D variants of the trained POINT-NAV and OBJECTNAV agents (RGB corrupt, Depth clean) in Table 2 ) which is likely the major contributing factor for increased resistance to corruptions. Sensors of different modalities are likely to degrade in different scenarios -e.g., a depth sensor may continue to perceive details in low lighting settings. The obtained results suggest that adding multiple sensors, while expensive can help train robust models. Additional sensors can also be helpful for unsupervised adaptation during the calibration phase. For instance, in the presence of a "clean" depth sensor, one can consider comparing depth based egomotion estimates with expected odometry readings in the target environment to infer changes in dynamics. In Sec. A.5 of appendix, we further investigate the degree to which more sophisticated POINTNAV agents, composed of map-based architectures, are susceptible to vis corruptions. Specifically, we evaluate the performance of the winning POINTNAV entry of Habitat-Challenge (HC) 2020 [1] -Occupancy Anticipation (OccAnt) [45] on Gibson [58] val scenes under noise-free, Habitat Challenge conditions and vis corruptions. We find that introducing corruptions under noise-free conditions degrades navigation performance significantly only for RGB agents. Under HC conditions, RGB-D agents suffer drop in performance as RGB noise is replaced with progressively severe vis corruptions. Presence of vis+dyn corruptions further degrades performance. Rows 14-19 in Table 2 indicate the extent of performance degradation when vis+dyn corruptions are present. With the exception of a few cases, as expected, the drop in performance is slightly more pronounced compared to the presence of just vis or dyn corruptions. The relative drop in performance from vis → vis+dyn is more pronounced for OBJECTNAV as opposed to POINTNAV. Navigation performance for RGB agents degrades consistently with escalating episode difficulty. Recall that we evaluate navigation performance over epsisodes of varying difficulty levels (see Sec. 3). We break down the performance of POINTNAV & OBJECTNAV agents by episode difficulty levels (in Sec. A.5 of appendix). Under "clean" settings, we find that POINTNAV (RGB and RGB-D) have comparable performance across all difficulty levels. Under corruptions, we note that unlike the RGB-D counterparts, performance of POINTNAV-RGB agents consistently deteriorates as the episodes become harder. OBJECTNAV (both RGB & RGB-D) agents show a similar trend of decrease in navigation performance with increasing episode difficulty.
5.2. Behavior Of Visual Navigation Agents
We now study the idiosyncrasies (see Fig 4) exhibited by these agents (POINTNAV-RGB and OBJECTNAV-RGB) which leads to their degraded performance. Agents tend to collide more often. Agents tend to be farther from the target. Fig 4 (second column) shows the minimum distance from the target over the course of an episode. While we note that as corruptions become progressively severe, agents tend to terminate farther away from the target (see Sec. A.4 of appendix), Fig 4 (second column) indicates that the overall proximity of the agent to the goal over an episode decreases -minimum distance to target increases as we go from Clean → vis or dyn; vis or dyn → vis+dyn. While this may be intuitive in the presence of a dyn corruption, it is interesting to note that this trend is also consistent for vis corruptions (Clean → D.B. or S.N.). Corruptions hurt OBJECTNAV stopping mechanism.
Recall that for both POINTNAV and OBJECTNAV, success depends on the notion of "intentionality" [5] -the agent calls an end action when it believes it has reached the goal. In Fig 4 (last two columns) ruptions affect this stopping mechanism. Specifically, we look at two quantitative measures -(1) Stop-Failure (Positive), the proportion of times the agent invokes an end action when the goal is not range; and (2) Stop-Failure (Negative), the proportion of times the agent does not invoke an end action when the goal is in range, out of the number of times the goal is in range. 4 We observe that prematurely calling an end action is a significant issue only for OBJECTNAV (Fig 4 (third column . Similarly, the inability of an agent to invoke an end action is also more pronounced for OBJECTNAV as opposed to POINTNAV (Fig 4 (fourth column) ). To investigate the extent to which this impacts the agent's performance, we compare the agent's Success Rate (SR) with a setting where the agent is equipped with an oracle stopping mechanism (call end as soon as the goal is in range). We find that this makes a significant difference only for OBJECTNAV-absolute +7.12% for Clean, +7.76% for M.D. and +13.88% for D.B. + M.D. We hypothesize that equipping agents with robust stopping mechanisms can significantly improve performance on ROBUST-NAV. For instance, equipping the agent with a progress monitor module [39] (estimating progress made towards the goal in terms of distance) robust to vis corruptions can potentially help decide when explicitly to invoke an end action in the target environment.
5.3. Resisting Corruptions
To assist near-term progress, we investigate if some standard approaches towards training robust models or adapting to visual disparities can help resisting vis corruptions under a calibration budget (Sec. 3) -set to ∼ 166k steps. 5 Extent of attainable improvement by finetuning under task supervision. As an anecdotal upper bound on attainable improvements under the calibration-budget, we also report the extent to which degraded performance can be recovered when fine-tuned under complete task supervision. We report these results for vis corruptions in Table 3 (row 7) . We note that unlike Lower-FOV, the agent is able to almost recover performance for Defocus Blur, Camera-Crack and Spatter (Table. 3 , rows 1,7) . Do data-augmentation strategies help? In Table 3 , we study if data-augmentation strategies improve zero-shot resistance to vis corruptions (rows 1,6). We compare POINT-NAV RGB agents trained with Random-Crop, Random-Shift and Color-Jitter (row 6) with the vanilla versions (row 1) and find that while data augmentation (row 6) offers some improvements (Spatter being an exception) over degraded performance (row 1) -absolute improvements of (22.81% SPL, 29.21% SR) for Lower-FOV, (7.77% SPL, 5.37% SR) for Defocus Blur and (7.74% SPL, 6.37% SR) for Camera-Crack, obtained performance is still significantly below Clean settings (row 1, Clean col). Improvements are more pronounced for Lower-FOV compared to others (likely due to Random-Shift and Random-Crop). We note that dataaugmentation provides improvements only for a subset of vis corruptions and when it does, obtained improvements are still not sufficient enough to recover lost performance. Do self-supervised adaptation approaches help? In the absence of reward supervision in the target environment, Hansen et al. [27] proposed Policy Adaptation during Deployment (PAD) -source pretraining with an auxiliary supervised objective and optimizing only the self-supervised objective when deployed in the target environment. We in-vestigate the degree to which PAD helps adapting to the target environments in ROBUSTNAV. The adopted selfsupervised tasks are (1) Action-Prediction (AP) -given two successive observations in a trajectory, predict the intermediate action and (2) Rotation-Prediction (RP) -rotate the input observation by 0 • , 90 • , 180 • , or 270 • before feeding it to the agent and task an additional auxiliary head with predicting the rotation. We report numbers with AP (rows 2,3) and RP (rows 4,5) in Table. 3. For AP, we find that (1) pre-training (row 2 vs row 1) results in little or no improvements over degraded performance (maximum absolute improvements of 7.96% SPL, 7.46% SR for Defocus Blur) and (2) further adaptation (row 3 vs rows 2,1) under calibration budget consistently degrades performance. For RP, we observe that (1) with the exception of Clean and Lower-FOV, pre-training (row 4 vs row 1) results in worse performance and (2) while self-supervised adaptation under corruptions improves performance over pre-training (row 5 vs row 4), it is still significantly below Clean settings (row 1, Clean col) -minimum absolute gap of 20.39% SPL, 19.66% SR between Defocus Blur (row 5) and Clean (row 1). While improvements over degraded performance might highlight the utility of PAD (with AP / RP) as a potential unsupervised adaptation approach, there is still a long way to go in terms of closing the performance gap between clean and corrupt settings.
6. Conclusion
In summary, as a step towards assessing general purpose robustness of embodied navigation agents, we propose RO-BUSTNAV, a challenging framework well-suited to benchmark the robustness of embodied navigation agents, with a wide variety of visual and dynamics corruptions. To succeed on ROBUSTNAV, an agent must be insensitive to corruptions and also be able to adapt to unforeseen changes in new environments with minimal interaction. We find that standard POINTNAV and OBJECTNAV agents underperform (or fail) significantly in the presence of corruptions and while standard techniques to improve robustness or adapt to environments with visual disparities (data-augmentation, self-supervised adaptation) provide some improvements, a large room for improvement remains in terms of fully recovering lost navigation performance. Lastly, we plan on evolving ROBUSTNAV in terms of the sophistication and diversity of corruptions as more features are supported in the underlying simulator. We release ROBUSTNAV in ROBOTHOR, and hope that our findings provide insights into developing more robust navigation agents.
A.1. Overview
This appendix is organized as follows. In Sec. A.2, we describe in detail the task specifications for POINTNAV and OBJECTNAV. In Sec. A.3, we provide details about the architecture adopted for POINTNAV and OBJECTNAV agents and how they are trained. In Sec. A.4, we include more plots demonstrating the kinds of behaviors POINTNAV and OBJECTNAV agents exhibit under corruptions (RGB-D variants in addition to the RGB variants in Sec. 5.2 of the main paper). In Sec. A.5, we provide more results demonstrating degradation in performance at severity set to 3 (for vis corruptions with controllable severity levels; excluded from the main paper due to space constraints) and break down performance degradation by episode difficulty.
A.2. Task Specifications
We describe the task-specifications (as outlined in Sec. 3 of the main paper) for the ones included in ROBUSTNAV in detail. Note that while ROBUSTNAV currently supports navigation heavy tasks, the corruptions included can easily be extended to other embodied tasks that share the same modalities, for instance, tasks involving vision and language guided navigation or having interaction components. POINTNAV. In POINTNAV, an agent is spawned at a random location and orientation in an environment and asked to navigate to goal coordinates specified relative to the agent's position. This is equivalent to the agent being equipped with a GPS+Compass sensor (providing relative location and orientation with respect to the agent's current position). Note that the agent does not have access to any "map" of the environment and must navigate based solely on sensory inputs from a visual RGB (or RGB-D) and GPS+Compass sensor. An episode is declared successful if the POINTNAV agent stops (by "intentionally" invoking an end action) within 0.2m of the goal location. OBJECTNAV. In OBJECTNAV, an agent is spawned at a random location and orientation in an environment as is asked to navigate to a specified "object" category (e.g, Television) that exists in the environment. Unlike POINT-NAV, an OBJECTNAV agent does not have access to a GPS+Compass sensor and must navigate based solely on the specified target and visual sensor inputs -RGB (or Figure 5 . OBJECTNAV Target Objects. We present a few examples of the target objects considered for OBJECTNAV agents in ROBOTHOR as viewed from the agent's ego-centric RGB frame under successful episode termination conditions. Figure 6 . Agent Architecture. We show the general architecture adopted for our POINTNAV and OBJECTNAV agents -convolutional units to encode observations followed by recurrent policy networks. The auxiliary task heads are used when we consider pre-training or adaptation using PAD [27] .
Rgb-D).
An episode is declared successful if the OBJECT-NAV agent (1) stops (by "intentionally" invoking an end action) within 1.0m of the target object and (2) has the target object within it's ego-centric view. We consider 12 object categories present in the ROBOTHOR scenes for our OBJECTNAV experiments. These are AlarmClock, Apple, BaseballBat, BasketBall, Bowl, GarbageCan, HousePlant, Laptop, Mug, SprayBottle, Television and Vase (see Fig. 5 for a few examples in the agent's ego-centric frame).
A.3. Navigation Agents
We highlight describe the architecture of the agents studied in ROBUSTNAV and provide additional training details. Base Architecture. We consider standard neural architectures (akin to [56, 55] ) for both POINTNAV and OBJECT-NAV-convolutional units to encode observations followed by recurrent policy networks to predict action distributions. Concretely, our agent architecture consists of four major components -a visual encoder, a goal encoder, a target observation combiner and a policy network (see Fig. 6 ). The visual encoder (for RGB and RGB-D) agents consists of a frozen ResNet-18 [29] encoder (till the last residual block) pretrained on ImageNet [16] , followed by a learnable compressor network consisting of two convolutional layers of kernel size 1, each followed by ReLU activations (512 × 7 × 7 → 128 × 7 × 7 → 32 × 7 × 7). The goal encoder encodes the specified target -a goal location in polar coordinates (r, θ) for POINTNAV and the target object token (e.g., Television) for OBJECTNAV. For POINTNAV, the goal is encoded via a linear layer (2 × 32). For OBJECTNAV, the goal is encoded via an embedding layer (12 × 32) set to encode one of the 12 object categories. The goal embedding and output of the visual encoder are then concatenated and further passed through the target observation combiner network consisting of two convolutional layers of kernel size 1 (64 × 7 × 7 → 128 × 7 × 7 → 32 × 7 × 7). The output of the target observation combiner is flattened and then fed to the policy network -specifically, to a single layer GRU (hidden size 512), followed by linear actor and critic heads used to predict action distributions and value estimates. Auxiliary Task Heads. In Sec. 5.3 of the main paper, we investigate if self-supervised approaches, particularly, Policy Adaptation during Deployment (PAD) [27] help in resisting performance degradation due to vis corruptions. Incorporating PAD involves training the vanilla agent architectures (as highlighted before) with self-supervised tasks (for pretraining as well as adaptation in a corrupt target environment) -namely, Action Prediction (AP) and Rotation Prediction (RP). In Action-Prediction (AP), given two successive observations in a trajectory, an auxiliary head is tasked with predicting the intermediate action and in Rotation-Prediction (RP), the input observation is rotated by 0 • , 90 • , 180 • , or 270 • uniformly at random before feeding to the agent and an auxiliary head is asked to to predict the rotation bin. For both AP and RP, the auxiliary task heads operate on the encoded visual observation (as shown in Fig. 6 ). To gather samples in the target environment (corrupt or otherwise), we use data collected from trajectories under the source (clean) pre-trained policy -i.e., the visual encoder is updated online as observations are encountered under the pre-trained policy. Training and Evaluation Details. As mentioned earlier, we train our agents with DD-PPO [56] (a decentralized, distributed version of the Proximal Policy Optimization Algorithm [52] ) with Generalized Advantage Estimation [51] . We use rollout lengths T = 128, 4 epochs of PPO with 1 mini-batch per-epoch. We set the discount factor to γ = 0.99, GAE factor to τ = 0.95, PPO clip parameter to 0.1, value loss coefficient to 0.5 and clip the gradient norms at 0.5. We use the the Adam optimizer [36] with a learning rate of 3e − 4 with linear decay. The reward structure used is as follows -if R = 10.0 denotes the terminal reward ob-tained at the end of a "successful" episode and λ = −0.01 denotes a slack penalty to encourage efficiency, then the reward received by the agent at time-step t can be expressed as,
r t = R • I Success if a t = end −∆ Geo t + λ otherwise (1)
where ∆ Geo t is the change in geodesic distance to the goal, a t is the action taken by the agent and I Success indicates where the episode was successful (1) or not (0). During evaluation, we allow an agent to execute a maximum of 300 steps -if an agent doesn't call end within 300 steps, we forcefully terminate the episode. All agents are trained under LocoBot calibrated actuation noise models from [13] -N (0.25m, 0.005m) for translation and N (30 • , 0.5 • ) for rotation. During evaluation, with the exception of circumstances when Motion Bias (S) is present, we use the same actuation noise models (in addition to dyn corruptions when applicable). We train our POINTNAV agents for ∼ 75M steps and OBJECTNAV agents for ∼ 300M steps (both RGB and RGB-D variants).
A.4. Behavior Analysis
In Sec. 5.2 of the main paper, we try to understand the idiosyncrasies exhibited by the navigation agents under corruptions. Specifically, we look at the number of collisions as observed through the number of failed actions in ROBOTHOR, the closest the agent arrives to the target in an episode and Stop-Fail (Pos) and Stop-Fail (Neg). Since for both POINTNAV and OBJECTNAV, success depends on a notion of "intentionality" [5] -the agent calls an end action when it believes it has reached the goal -we use both Stop-Fail (Pos) and Stop-Fail (Neg) to assess how corruptions impact this "stopping" mechanism of the agents. Stop-Fail (Pos) measures the fraction of times the agent calls an end action when the goal is not in range 6 , out of the number of times the agent calls an end action. Stop-Fail (Neg) measures the fraction of times the agent fails to invoke an end action when the goal is in range, out of the number of steps the goal is in range in an episode. Both are averaged across evaluation episodes. In addition to the above aspects, we also measure the average distance to the goal at episode termination. Here we report these measures for POINTNAV and OBJECTNAV agents trained with RGB and RGB-D sensors in Fig. 7 (RGB-D variants in addition to the RGB agents in Fig.4 of the main paper) .
We find that across RGB and RGB-D variants, (1) agents tend to collide more often under corruptions (Fig. 7, col 1 To understand the extent to which a degraded stopping mechanism under corruptions affects OBJECTNAV RGB agent performance, we look the difference between the agent's success rate (SR) compared to the setting where the agent is equipped with an oracle stopping mechanism. SROr denotes success rate when an end action is forcefully called in an episode whenever the goal is in range. We consider one clean and five corrupt settings: termination under corruptions (Fig. 7, col 2) and (3) agents tend to be farther from the target under corruptions even in terms of minimum distance over an episode (Fig. 7, col 3) . We further note that the effect of corruptions on the agent's stopping mechanism is more pronounced for OBJECTNAV as opposed to POINTNAV (Fig. 7, cols 4 & 5) .
To further understand the extent to which a worse stopping mechanism impacts the agent's performance, in Fig. 8 we compare the agents' success rate (SR) with a setting where the agent is equipped with an oracle stopping mechanism (forcefully call end when goal is in range). For both OBJECTNAV RGB and RGB-D, we find that the presence of vis and vis+dyn corruptions affects success significantly compared to the clean settings (Fig. 8, black bars) .
(a) Motion Bias (C) is intended to model scene-level friction, different floor material in the target environment; (b) Motion Bias (S) is intended to model high and low friction zones in a scene. Including more sophisticated models of friction is in the feature roadmap for ROBUSTNAV.
Based on shortest path lengths -(1) POINTNAV: 0.00 − 2.28 for easy, 2.29 − 4.39 for medium, 4.40 − 9.61 for hard; (2) OBJECTNAV: 0.00 − 1.50 for easy, 1.51 − 3.78 for medium , 3.79 − 9.00 for hard.
The goal in range criterion for POINTNAV checks if the target is within the threshold distance. For OBJECTNAV, this includes an additional visibility criterion.
Based on the number of steps it takes an agent to reasonably recover degraded performance in corrupted environments when finetuned with complete task supervision.
The goal in range criterion for POINTNAV checks whether the target is within the threshold distance. For OBJECTNAV, this includes a visibility criterion in addition to taking distance into account.