Learning Generalizable Visual Representations via Interactive Gameplay
A growing body of research suggests that embodied gameplay, prevalent not just in human cultures but across a variety of animal species including turtles and ravens, is critical in developing the neural flexibility for creative problem solving, decision making, and socialization. Comparatively little is known regarding the impact of embodied gameplay upon artificial agents. While recent work has produced agents proficient in abstract games, these environments are far removed from the real world and thus these agents can provide little insight into the advantages of embodied play. Hiding games, such as hide-and-seek, played universally, provide a rich ground for studying the impact of embodied gameplay on representation learning in the context of perspective taking, secret keeping, and false belief understanding. Here we are the first to show that embodied adversarial reinforcement learning agents playing Cache, a variant of hide-and-seek, in a high fidelity, interactive, environment, learn generalizable representations of their observations encoding information such as object permanence, free space, and containment. Moving closer to biologically motivated learning strategies, our agents' representations, enhanced by intentionality and memory, are developed through interaction and play. These results serve as a model for studying how facets of vision develop through interaction, provide an experimental framework for assessing what is learned by artificial agents, and demonstrates the value of moving from large, static, datasets towards experiential, interactive, representation learning.
grounded in the real world. This requires a fundamental shift away from existing popular environments and a rethinking of how the capabilities of artificial agents are evaluated. Our agents must first be embodied within an environment allowing for diverse interaction and providing rich visual output. For this we leverage AI2-THOR  , a near photo-realistic interactive simulated 3D environment of indoor living spaces, see Fig. 1a . After our agents are trained to play cache, we then probe how they have learned to represent their environment. Our first set of experiments show that our agents develop sophisticated low-level visual understanding of individual images measured by their capacity to perform a collection of standard tasks from the computer vision literature, such as depth  and surface normal  prediction, from a single image. Our second collection of experiments, created in analogy to experiments performed on infants and young children, then demonstrate our agents' ability to integrate observations through time and understand spatial relationships between objects  , occlusion  , object permanence  , seriation  of free space, and perspective taking  .