Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

A Cordial Sync: Going Beyond Marginal Policies for Multi-Agent Embodied Tasks


  • Unnat Jain
  • Luca Weihs
  • Eric Kolve
  • Ali Farhadi
  • S. Lazebnik
  • Aniruddha Kembhavi
  • A. Schwing
  • ECCV
  • 2020
  • View in Semantic Scholar


Autonomous agents must learn to collaborate. It is not scalable to develop a new centralized agent every time a task's difficulty outpaces a single agent's abilities. While multi-agent collaboration research has flourished in gridworld-like environments, relatively little work has considered visually rich domains. Addressing this, we introduce the novel task FurnMove in which agents work together to move a piece of furniture through a living room to a goal. Unlike existing tasks, FurnMove requires agents to coordinate at every timestep. We identify two challenges when training agents to complete FurnMove: existing decentralized action sampling procedures do not permit expressive joint action policies and, in tasks requiring close coordination, the number of failed actions dominates successful actions. To confront these challenges we introduce SYNC-policies (synchronize your actions coherently) and CORDIAL (coordination loss). Using SYNC-policies and CORDIAL, our agents achieve a 58% completion rate on FurnMove, an impressive absolute gain of 25 percentage points over competitive decentralized baselines. Our dataset, code, and pretrained models are available at this https URL .

1 Introduction

Collaboration is the defining principle of our society. Humans have refined strategies to efficiently collaborate, developing verbal, deictic, and kinesthetic means. In contrast, progress towards enabling artificial embodied agents to learn collaborative strategies is still in its infancy. Prior work mostly studies collaborative agents in grid-world like environments. Visual, multi-agent, collaborative tasks have not been studied until very recently [23, 42] . While existing tasks are well designed to study some aspects of collaboration, they often don't require agents to closely collaborate throughout the task. Instead such tasks either require initial coordination (distributing tasks) followed by almost independent execution, To study our algorithmic ability to address tasks which require close and frequent collaboration, we introduce the furniture moving (FurnMove) task (see Fig. 1 ), set in the AI2-THOR environment. Given only their egocentric visual observations, agents jointly hold a lifted piece of furniture in a living room scene and must collaborate to move it to a visually distinct goal location. As a piece of furniture cannot be moved without both agents agreeing on the direction, agents must explicitly coordinate at every timestep. Beyond coordinating actions, high performance in our task requires agents to visually anticipate possible collisions, handle occlusion due to obstacles and other agents, and estimate free space. Akin to the challenges faced by a group of roommates relocating a widescreen television, this task necessitates extensive and ongoing coordination amongst all agents at every time step.

Fig. 1: Two agents communicate and synchronize their actions to move a heavy object through a complex indoor environment towards a goal. (a) Agents are initialized holding the object in a randomly chosen location. (b) Note the agent’s egocentric views. Successful navigation requires agents to communicate their intent to reposition themselves, and the object, while contending with collisions, mutual occlusion, and partial information. (c) Agents successfully moved the object above the goal

In prior work, collaboration between multiple agents has been enabled primarily by (i) sharing observations or (ii) learning low-bandwidth communication. (i) is often implemented using a centralized agent, i.e., a single agent with access to all observations from all agents [9, 71, 93] . While effective it is also unrealistic: the real world poses restrictions on communication bandwidth, latency, and modality. We are interested in the more realistic decentralized setting enabled via option (ii). This is often implemented by one or more rounds of message passing between agents before they choose their actions [27, 58, 42] . Training decentralized agents when faced with FurnMove's requirement of coordination at each timestep leads to two technical challenges. Challenge 1: as each agent independently samples an action from its policy at every timestep, the joint probability tensor of all agents' actions at any given time is rank-one. This severely limits which multi-agent policies are representable. Challenge 2: the number of possible mis-steps or failed actions increases dramatically when requiring that agents closely coordinate with each other, complicating training.

Addressing challenge 1, we introduce SYNC (Synchronize Your actioNs Coherently) policies which permit expressive (i.e., beyond rank-one) joint policies for decentralized agents while using interpretable communication. To ameliorate challenge 2 we introduce the Coordination Loss (CORDIAL) that replaces the standard entropy loss in actor-critic algorithms and guides agents away from actions that are mutually incompatible. A 2-agent system using SYNC and CORDIAL obtains a 58% success rate on test scenes in FurnMove, an impressive absolute gain of 25 percentage points over the baseline from [42] (76% relative gain). In a 3-agent setting, this difference is even more extreme.

In summary, our contributions are: (i) FurnMove, a new multi-agent embodied task that demands ongoing coordination, (ii) SYNC, a collaborative mechanism that permits expressive joint action policies for decentralized agents, (iii) CORDIAL, a training loss for multi-agent setups which, when combined with SYNC, leads to large gains, and (iv) improvements to the open-source AI2-THOR environment including a 16× faster gridworld equivalent enabling fast prototyping.

2 Related Work

We start by reviewing single agent embodied AI tasks followed by non-visual Multi-Agent RL (MARL) and end with visual MARL. Single-agent embodied systems: Single-agent embodied systems have been considered extensively in the literature. For instance, literature on visual navigation, i.e., locating an object of interest given only visual input, spans geometric and learning based methods. Geometric approaches have been proposed separately for mapping and planning phases of navigation. Methods entailing structure-from-motion and SLAM [91, 80, 25, 13, 72, 81] were used to build maps. Planning algorithms on existing maps [14, 46, 52] and combined mapping & planning [26, 50, 49, 30, 6] are other related research directions.

While these works propose geometric approaches, the task of navigation can be cast as a reinforcement learning (RL) problem, mapping pixels to policies in an end-to-end manner. RL approaches [68, 1, 20, 33, 44, 92, 62, 86] have been proposed to address navigation in synthetic layouts like mazes, arcade games and other visual environments [100, 8, 47, 54, 43, 84] . Navigation within photo-realistic environments [11, 79, 15, 48, 102, 5, 35, 101, 59 ] led development of embodied AI agents. The early work [107] addressed object navigation (find an object given an image) in AI2-THOR. Soon after, [35] showed how imitation learning permits agents to learn to build a map from which they navigate. Methods also investigate the utility of topological and latent memory maps [35, 78, 37, 99] , graph-based learning [99, 103] , meta-learning [98] , unimodal baselines [90] , 3D point clouds [97] , and effective exploration [95, 78, 16, 74] to improve embodied navigational agents. Embodied navigation also aids AI agents to develop behavior such as instruction following [38, 4, 82, 95, 3] , city navigation [18, 64, 63, 94] , question answering [21, 22, 34, 97, 24] , and active visual recognition [105, 104] . Recently, with visual and acoustic rendering, agents have been trained for audio-visual embodied navigation [19, 31] .

In contrast to the above single-agent embodied tasks and approaches, we focus on collaboration between multiple embodied agents. Porting the above single-agent architectural novelties (or a combination of them) to multi-agent systems such as ours is an interesting direction for future work. Non-visual MARL: Multi-agent reinforcement learning (MARL) is challenging due to non-stationarity when learning. Multiple methods have been proposed to address resulting issues [88, 89, 87, 29] . For instance, permutation invariant critics have been developed recently [57] . In addition, for MARL, cooperation and competition between agents has been studied [51, 70, 60, 12, 69, 36, 58, 28, 57] . Similarly, communication and language in the multi-agent setting has been investigated [32, 45, 10, 61, 53, 27, 83, 67, 7] in maze-based setups, tabular tasks, or Markov games. These algorithms mostly operate on low-dimensional observations such as kinematic measurements (position, velocity, etc.) and top-down occupancy grids. For a survey of centralized and decentralized MARL methods, kindly refer to [106] . Our work differs from the aforementioned MARL works in that we consider complex visual environments. Our contribution of SYNC-Policies is largely orthogonal to RL loss function or method. For a fair comparison to [42] , we used the same RL algorithm (A3C) but it is straightforward to integrate SYNC into other MARL methods [75, 28, 58] (for details, see Sec. A.3 of the supplement). Visual MARL: Recently, Jain et al . [42] introduced a collaborative task for two embodied visual agents, which we refer as FurnLift. In this task, two agents are randomly initialized in an AI2-THOR living room scene, must visually navigate to a TV, and, in a singe coordinated PickUp action, work to lift that TV up. Note that FurnLift doesn't demand that agents coordinate their actions at each timestep. Instead, such coordination only occurs at the last timestep of an episode. Moreover, as success of an action executed by an agent is independent (with the exception of the PickUp action), a high performance joint policy need not be complex, i.e., it may be near low-rank. More details on this analysis and the complexity of our proposed FurnMove task are provided in Sec. 3 .

Similarly, a recent preprint [17] proposes a visual hide-and-seek task, where agents can move independently. Das et al . [23] enable agents to learn who to communicate with, on predominantly 2D tasks. In visual environments they study the task where multiple agents parallely navigate to the same object. Jaderberg et al . [41] recently studied the game of Quake III and Weihs et al . [96] develop agents to play an adversarial hiding game in AI2-THOR. Collaborative perception for semantic segmentation and recognition classification have also been investigated recently [55, 56] .

To the best of our knowledge, all previous visual or non-visual MARL in the decentralized setting operate with a single marginal probability distribution per agent, i.e., a rank-one joint distribution. Moreover, FurnMove is the first multiagent collaborative task in a visually rich domain requiring close coordination between agents at every timestep.

3 The Furniture Moving Task (Furnmove)

We describe our new multi-agent task FurnMove, grounded in the real-world experience of moving furniture. We begin by introducing notation. RL background and notation. Consider N ≥ 1 collaborative agents A 1 , . . . , A N . At every timestep t ∈ N = {0, 1, . . .} the agents, and environment, are in some state s t ∈ S and each agent A i obtains an observation o i t recording some partial information about s t . For instance, o i t might be the egocentric visual view of an agent A i embedded in some simulated environment. From observation o i t and history h i t−1 , which records prior observations and decisions made by the agent, each agent A i forms a policy π i t : A → [0, 1] where π i t (a) is the probability that agent A i chooses to take action a ∈ A from a finite set of options A at time t. After the agents execute their respective actions (a 1 t , . . . , a N t ), which we call a multi-action, they enter a new state s t+1 and receive individual rewards

r 1 t , . . . , r N t ∈ R.

For more on RL see [85, 65, 66] . Task definition. FurnMove is set in the near-photorealistic and physics enabled simulated environment AI2-THOR [48] . In FurnMove, N agents collaborate to move a lifted object through an indoor environment with the goal of placing this object above a visually distinct target as illustrated in Fig. 1 . Akin to humans moving large items, agents must navigate around other furniture and frequently walk in-between obstacles on the floor.

In FurnMove, each agent at every timestep receives an egocentric observation (a 3 × 84 × 84 RGB image) from AI2-THOR. In addition, agents are allowed to communicate with other agents at each timestep via a low bandwidth communication channel. Based on their local observation and communication, each agent must take an action from the set A. The space of actions A = A NAV ∪ A MWO ∪ A MO ∪ A RO available to an agent is comprised of the four single-agent navigational actions A NAV = {MoveAhead, RotateLeft, Rota-teRight, Pass} used to move the agent independently, four actions A MWO = {MoveWithObjectX | X ∈ {Ahead, Right, Left, Back}} used to move the lifted object and the agents simultaneously in the same direction, four actions A MO = {MoveObjectX| X ∈ {Ahead, Right, Left, Back}} used to move the lifted object while the agents stay in place, and a single action used to rotate the lifted object clockwise A RO = {RotateObjectRight}. We assume that all movement actions for agents and the lifted object result in a displacement of 0.25 meters (similar to [42, 59] ) and all rotation actions result in a rotation of 90 degrees (counter-)clockwise when viewing the agents from above. Close and on-going collaboration is required in FurnMove due to restrictions on the set of actions which can be successfully completed jointly by all the agents. These restrictions reflect physical constraints: for instance, if two people attempt to move in opposite directions while carrying a heavy object they will either fail to move or drop the object. For two agents, we summarize these restrictions using the coordination matrix shown in Fig. 2a . For comparison, we include a similar matrix in Fig. 2b corresponding to the FurnLift task from [42] . We defer a more detailed discussion of these restrictions to Sec. A.1 of the supplement. Generalizing the coordination matrix shown in Fig. 2a , at every timestep t we let S t be the {0, 1}-valued |A| N -dimensional tensor where (S t ) i1,...,i N = 1 if and only if the agents are configured such that multi-action (a i1 , . . . , a i N ) satisfies the restrictions detailed in Sec. A.1. If (S t ) i1,...,i N = 1 we say the actions (a i1 , . . . , a i N ) are coordinated.

Fig. 2: Coordination matrix for tasks. The matrix St records the validity of multi-action (a1, a2) for different relative orientations of agents A1 & A2. (a) Overlay of St for all four relative orientation of two agents, for FurnMove. Notice that only 16/169 = 9.5% multi-actions are coordinated at any given relative orientation, (b) FurnLift where single agent actions are always valid and coordination is needed only for PickUp action, i.e. at least 16/25 = 64% actions are always valid.

X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X X MAhead

3.1 Technical Challenges

As we show in our experiments in Sec. 6, standard communication-based models similar to the ones proposed in [42] perform rather poorly when trained to complete the FurnMove task. In the following we identify two key challenges that contribute to this poor performance. Challenge 1: rank-one joint policies. In classical multi-agent settings [12, 70, 58] [42] backbone architecture. Right: marginal vs SYNC-policies. With marginal policies, the standard in prior work, each agent constructs its own policy and independently samples from this policy. With SYNC-policies, agents communicate to construct a distribution α over multiple "strategies" which they then sample from using a shared random seed to this independent sampling, at time t, the probability of the agents taking multi

-action (a 1 , . . . , a N ) equals N i=1 π i t (a i )

. This means that the joint probability tensor of all actions at time t can be written as the rank-one tensor

Π t = π 1 t ⊗ • • • ⊗ π N t .

This rank-one constraint limits the joint policy that can be executed by the agents, which has real impact. Sec. A.2 considers two agents playing rock-paper-scissors with an adversary: the rank-one constraint reduces the expected reward achieved by an optimal policy from 0 to -0.657 (minimal reward being -1). Intuitively, a high-rank joint policy is not well approximated by a rank-one probability tensor obtained via independent sampling. Challenge 2: exponential failed actions. The number of possible multiactions |A| N increases exponentially as the number of agents N grows. While this is not problematic if agents act relatively independently, it's a significant obstacle when the agents are tightly coupled, i.e., when the success of agent A i 's action a i is highly dependent on the actions of the other agents. Just consider a randomly initialized policy (the starting point of almost all RL problems): agents stumble upon positive rewards with an extremely low probability which leads to slow learning. We focus on small N , nonetheless, the proportion of coordinated action tuples is small (9.5% when N = 2 and 2.1% when N = 3).

4 A Cordial Sync

To address the aforementioned two challenges we develop: (a) a novel action sampling procedure named Synchronize Your actioNs Coherently (SYNC) and (b) an intuitive & effective multi-agent training loss named the Coordination Loss (CORDIAL). Addressing challenge 1: SYNC-policies. For readability, we consider N = 2 agents and illustrates an overview in Fig. 3 . The joint probability tensor Π t is hence a matrix of size |A| × |A|. Recall our goal: using little communication, multiple agents should sample their actions from a high-rank joint policy. This is difficult as (i) little communication means that, except in degenerate cases, no agent can form the full joint policy and (ii) even if all agents had access to the joint policy it is not obvious how to ensure that the decentralized agents will sample a valid coordinated action.

Fig. 3: Model overview for 2 communicative agents in the decentralized setting. Left : all decentralized methods in this paper have the same TBONE [42] backbone architecture. Right : marginal vs SYNC-policies. With marginal policies, the standard in prior work, each agent constructs its own policy and independently samples from this policy. With SYNC-policies, agents communicate to construct a distribution α over multiple “strategies” which they then sample from using a shared random seed

To achieve this note that, for any rank m ≤ |A| matrix L ∈ R |A|×|A| , there are

vectors v 1 , w 1 , . . . , v m , w m ∈ R |A| such that L = m j=1 v j ⊗ w j .

Here, ⊗ denotes the outer product. Also, the non-negative rank of a matrix L ∈ R |A|×|A| ≥0 equals the smallest integer s such that L can be written as the sum of s non-negative rank-one matrices. Furthermore, a non-negative matrix L ∈ R

|A|×|A| ≥0

has nonnegative rank bounded above by |A|. Since Π t is a |A|×|A| joint probability matrix, i.e., Π t is non-negative and its entries sum to one, it has non-negative rank m ≤ |A|, i.e., there exist non-negative vectors α ∈ R m

≥0 and p 1 , q 1 , . . . , p m , q m ∈ R |A| ≥0 whose entries sum to one such that Π t = m j=1 α j • p j ⊗ q j .

We call a sum of the form

m j=1 α j • p j ⊗ q j a mixture-of-marginals. With this decomposition at hand, randomly sampling action pairs (a 1 , a 2 ) from m j=1 α j • p j ⊗ q j

can be interpreted as a two step process: first sample an index j ∼ Multinomial(α) and then sample a 1 ∼ Multinomial(p j ) and a 2 ∼ Multinomial(q j ).

This stage-wise procedure suggests a strategy for sampling actions in a multiagent setting, which we refer to as SYNC-policies. Generalizing to an N agent setup, suppose that agents (A i ) N

i=1 have access to a shared random stream of numbers. This can be accomplished if all agents share a random seed or if all agents initially communicate their individual random seeds and sum them to obtain a shared seed. Furthermore, suppose that all agents locally store a shared function f θ : R K → ∆ m−1 where θ are learnable parameters, K is the dimensionality of all communication between the agents in a timestep, and ∆ m−1 is the standard (m − 1)-probability simplex. Finally, at time t suppose that each agent A i produces not a single policy π i t but instead a collection of policies π i t,1 , . . . , π i t,m . Let C t ∈ R K be all communication sent between agents at time t. Each agent A i then samples its action as follows: (i) compute the shared probabilities α t = f θ (C t ), (ii) sample an index j ∼ Multinomial(α t ) using the shared random number stream, (iii) sample, independently, an action a i from the policy π i t,j . Since both f θ and the random number stream are shared, the quantities in (i) and (ii) are equal across all agents despite being computed individually. This sampling procedure is equivalent to sampling from the tensor

m j=1 α j • π 1 t,j ⊗ . . . ⊗ π N t,j

which, as discussed above, may have rank up to m. Intuitively, SYNC enables decentralized agents to have a more expressive joint policy by allowing them to agree upon a strategy by sampling from α t . Addressing challenge 2: CORDIAL. We encourage agents to rapidly learn to choose coordinated actions via a new loss. In particular, letting Π t be the joint policy of our agents, we propose the coordination loss (CORDIAL)

EQUATION (1): Not extracted; please refer to original document.

where log is applied element-wise, * , * is the usual Frobenius inner product, and S t is defined in Sec. 3. Notice that CORDIAL encourages agents to have a near uniform policy over the actions which are coordinated. We use this loss to replace the standard entropy encouraging loss in policy gradient algorithms (e.g., the A3C algorithm [66] ). Similarly to the parameter for the entropy loss in A3C, β is chosen to be a small positive constant so as to not overly discourage learning.

Note that the coordination loss is less meaningful when

Π t = π 1 ⊗ • • • ⊗ π N , i.e.

, when Π t is rank-one. For instance, suppose that S t has ones along the diagonal, and zeros elsewhere, so that we wish to encourage the agents to all take the same action. In this case it is straightforward to show that

CL β (S t , Π t ) = −β N i=1 M j=1 (1/M ) log π i

Table 1: Quantitative results on three tasks. ↑ (or ↓) indicates that higher (or lower) value of the metric is desirable while l denotes that the metric is simply informational and no value is, a priori, better than another. †denotes that a centralized agent serves only as an upper bound to decentralized methods and cannot be fairly compared with. Note that, among decentralized agents, our SYNC model has the best metric values across all reported metrics (bolded
Table 2: Quantitative results on the FurnLift task. For legend, see Tab. 1
Table 3: Effect of number of mixture components m on SYNC ’s performance (in FurnMove). Generally, larger m means larger TVD values and better performance.

t (a j ) so that CL β (S t , Π t ) simply encourages each agent to have a uniform distribution over its actions and thus actually encourages the agents to place a large amount of probability mass on uncoordinated actions. Indeed, Tab. 4 shows that using CORDIAL without SYNC leads to poor results.