Language (Re)modelling: Towards Embodied Language Understanding

Ronen Tamari
Cheng Shani
Tom Hope
Miriam R. L. Petruck
Omri Abend
Dafna Shahaf
ACL
2020
View in Semantic Scholar

Abstract

While natural language understanding (NLU) is advancing rapidly, today’s technology differs from human-like language understanding in fundamental ways, notably in its inferior efficiency, interpretability, and generalization. This work proposes an approach to representation and learning based on the tenets of embodied cognitive linguistics (ECL). According to ECL, natural language is inherently executable (like programming languages), driven by mental simulation and metaphoric mappings over hierarchical compositions of structures and schemata learned through embodied interaction. This position paper argues that the use of grounding by metaphoric reasoning and simulation will greatly benefit NLU systems, and proposes a system architecture along with a roadmap towards realizing this vision.

1 Introduction

"Not those speaking the same language, but those sharing the same feeling understand each other." -Jalal ad-Din Rumi While current NLU systems "speak" human language by learning strong statistical models, they do not possess anything like the rich mental representations that people utilize for language understanding. Indeed, despite the tremendous progress in NLU, recent work shows that today's state-ofthe-art (SOTA) systems differ from human-like language understanding in crucial ways, in particular in their generalization, grounding, reasoning, and explainability capabilities (Glockner et al., 2018; McCoy et al., 2019a,b; Nie et al., 2019; Yogatama et al., 2019; Lake et al., 2019) .

Question-answering (QA) is currently one of the predominant methods of training deep-learning models for general, open-domain language understanding (Gardner et al., 2019b) . While QA is a ver-satile, broadly-applicable framework, recent studies have shown it to be fraught with pitfalls (Gardner et al., 2019a; Mudrakarta et al., 2018) . A recent workshop on QA for reading comprehension suggested that "There is growing realization that the traditional supervised learning paradigm is broken [...] -we're fitting artifacts" (Gardner, 2019) .

In many respects, the problems of NLU mirror those of artificial intelligence (AI) research in general. Lake et al.'s (2017a) seminal work identified a significant common factor at the root of problems in general AI. The current deep-learning paradigm is a statistical pattern-recognition approach predominantly applied to relatively narrow task-specific prediction. In contrast, human cognition supports a wide range of inferences (planning, action, explaining, etc.), hinting at a view of intelligence focused on model-building, specifically, mental models: rich, structured, manipulable, and explainable representations useful for performing in dynamic, uncertain environments. This distinction motivates the quest for a new cognitively-inspired modelbuilding learning paradigm for general AI, which has inspired fruitful subsequent research and discussion (e.g., Lake et al. (2017b) ).

The observation that NLU and general AI share a common central problem (task-specific predictionbased learning), and the growing realization that deeper text understanding requires building mental models (Gardner et al., 2019a; Forbes et al., 2019) , motivate the search for an NLU analog of the cognitively-inspired model building paradigm.

Amid recent position papers highlighting significant differences between human language understanding and current NLU systems (McClelland et al., 2019; Bisk et al., 2020) , here we take a more focused look at mental models; challenges arising due to their embodied nature, their importance in general NLU, and how we might begin integrating them into current approaches.

Mainstream NLU work, be it entirely distributional, such as BERT (Devlin et al., 2018) , or also involving symbolic knowledge representation (Liu et al., 2019a; Bosselut et al., 2019) , seldom addresses mental models directly. Crucially, such approaches lack the interactive worlds within which mental models 1 are learned jointly through language and embodied action. The most closely related lines of work to the present proposal are grounded approaches, which feature worlds in the form of interactive environments, and address mapping text to programs (executable semantic parses) (e.g., Gauthier and Mordatch, 2016; Liang, 2016; Kiela et al., 2016; Chevalier-Boisvert et al., 2019) . However, while well-aligned with a model-building paradigm, typically such approaches have been limited to short or synthetic literal language and narrow domains assuming predefined environments. Embodied approaches to general NLU, as advocated here, are few and far between. Mostly, examples fall under the construction grammar framework (Steels and de Beule, 2006; Bergen and Chang, 2005) . However, despite their intellectual merit, they were not operationalized to scale readily for mainstream applications (see §3).

This position paper argues that executable semantic parsing and grounded approaches to NLU constitute a first step in a much larger program, whose outline is set forth, for general language understanding through embodied cognitive linguistics (ECL). Following much cognitive science research (see §3, §4), this paper posits that (1) execution or simulation is a central part of semantics, essential for addressing some of the persistent difficulties in text understanding, and (2) metaphoric inference capabilities are central to knowledge representation, and facilitate grounded understanding of general language. Importantly, capacities for both simulation and metaphor are emergent, borne of embodied interaction within an external world.

Our contributions are: we analyze inherent limitations of SOTA statistical language models applied to NLU and propose a framework to address these limitations. The novelty of this approach stems from bringing together ideas from the cognitive science literature, the general-AI community, and NLU. This framework constitutes a path to generalize current execution-based methods towards more general language understanding.

The world contains 2 crates. Each crate contains 4 boxes. Oranges and apples are objects. Each box may contain up to 5 objects. Objects can be moved from one box to another. Objects can be removed from boxes or crates. There are two apples in the first box in the first crate. There is one orange and one apple in the second box of the second crate. First, the apples were transfered from the first box of the first crate to the first box of the second crate. Next, all apples were removed from the second crate.

C1 C2

Figure 1: Open-domain challenge -a world with boxes, crates and objects.

This paper proposes a system architecture and a roadmap towards implementing the vision outlined here, suggesting preliminary directions for future work (learned world models, incorporating interaction into datasets). We believe that this framework will facilitate consolidation with multiple related lines of research across the different communities.

2 Challenges For Current Nlu Systems

This section presents concrete example problems demonstrating inherent limitations in SOTA NLU. Fig. 1 includes a short story about a world with crates, boxes, and objects inside them. It is a short and simple narrative, far from capturing the fullblown complexity of natural language. Following Gardner et al. (2019a) , we assume that a system understands the story if it can correctly answer arbitrary questions about it. To do so requires basic commonsense and mathematical reasoning, referent grounding, tracking events, handling declarative knowledge, and more. The task is similar to narrative comprehension tasks in datasets such as bAbI (Bordes et al., 2015) and SCONE (Long et al., 2016) , and could be solved given large amounts of annotated training data. But, the goal here is different, specifically, to develop models that, like humans, can understand such language on-the-fly (like zero-shot learning).

2.1 Open-Domain Literal Language Simulation

QA approaches. Current QA systems, used in an off-the-shelf manner, do not generalize well to tasks on which they have not been trained; NLU models are known to be brittle even to slight changes in style and vocabulary (Gardner et al., 2020; Keysers et al., 2020) . The closest QA setting is the DROP challenge (Dua et al., 2019) , requiring reading comprehension and basic numerical reasoning over paragraphs. As a simple sanity check, we tested a near-SOTA model and baseline 2 on this example, asking questions about the initial and final state. The models were notably better answering questions about the initial state than about the final state. This result is perhaps expected, as the answers to questions about the initial state are closer to the input text. Answering questions about later states is more challenging. A key missing component of these systems is the ability to simulate the effects of actions, especially commonsense effects (e.g., moving a container moves the elements in it). Executable semantic parsing approaches. The problem of Fig. 1 could also naturally be cast as an executable semantic parsing (ex. SP) task. Similar tasks already exist, for example, the "Alchemy" sub-task of the SCONE dataset features beakers of chemicals that are mixed, poured, and drained. Executable approaches can leverage simulation to learn structured world models, but are limited by hardcoded, domain-specific executors; adding tasks requires substantial manual effort.

For humans, through largely subconscious metaphorical inference (related to transfer and meta-learning in general AI (Lake et al., 2017a)), it is obvious that both SCONE and Fig. 1 share much the same structure. This similarity allows for effortless generalization, effectively re-purposing a relatively simple executor (for literal language) flexibly across many tasks.

2.2 Non-Literal Language

The previous challenge involved literal language, amenable to symbolic execution.

However, non-literal language is pervasive in everyday speech (Lakoff and Johnson, 1980) . Consider the example in Fig. 2 : the phrase "head of the French Army" is non-literal, implying that the army can be treated as a human body. The execution semantics of verbs like "attacked" and "defend" are also non-literal; they are highly contextual, requiring interpretation beyond word-sense disambiguation alone. "Russian hackers attacked the Pentagon networks" or "The senator attacked the media" entail very different simulations. This ambiguity is challenging for non-neural (symbolic) simulationbased approaches. Humans compose a structured Figure 2 : Non-literal language challenge. To understand this sentence, humans rely on metaphoric inference over embodied concepts (in blue, also called schema; see §3). For example, here "attack" evokes a FORCE or MOTION schema, used to construct a mental model of the scene via mental simulation ( §4). mental model from the language through schemata and mental simulation, as discussed in §3, §4.

Figure 2: Non-literal language challenge. To understand this sentence, humans rely on metaphoric inference over embodied concepts (in blue, also called schema; see §3). For example, here “attack” evokes a FORCE or MOTION schema, used to construct a mental model of the scene via mental simulation (§4).

To summarize, the limitations outlined above motivate the attempt to extend the capability of simulation to general linguistic inputs. Doing so would enable the construction of grounded, manipulable, and interpretable representations from text. Two desiderata follow from the challenges: (1) more flexible utilization of symbolic executors by exploiting shared (analogical) structures between texts ( §2.1), and (2) learned, neural executors for non-literal language comprehension ( §2.2).

3 Embodied Cognitive Linguistics: A Model Building Paradigm

Turning to cognitive science for inspiration, we focus on embodied cognitive linguistics (ECL), an important paradigm directly addressing both desiderata. This section presents a brief overview and key tenets of ECL, specifically the theoretical foundations Lakoff and Johnson (1980) and Feldman and Narayanan (2004) developed. Most contemporary cognitive accounts of language incorporate concepts from ECL to some degree. A full review is out of scope of this work; see Gärdenfors (2014) and §4, §5 for discussion in the NLU context. Early cognitive theories assumed a disembod-ied, symbolic representation of knowledge (Lewis, 1976; Kintsch and Van Dijk, 1978) , separate from the brain's modal systems (vision, motor control, etc.). In contrast, the embodied cognition (EC) view, based on widespread empirical findings, focuses on the role of the body in cognition. In this view, knowledge is stored using multimodal representations (mental imagery, memories, etc.) that arise from embodied experience and action in the world (Barsalou, 2008; Proffitt, 2006) . ECL postulates that linguistic representations and other, higher-level cognitive functions are deeply grounded in neural modal systems (Lakoff and Johnson, 1980; Barsalou, 2008) . This view is compelling, as it addresses the grounding problem (Harnad, 1990 ) by linking between high-level symbolic constituents of mental representations and experience or action in the physical world (Varela et al., 2017) . Note that embodiment is far from an end-all for language comprehension: for example, social and cultural aspects too are crucial (Arbib et al., 2014) . Still, ECL laid important conceptual foundations also underlying subsequent accounts:

• Embodied schemata: Pre-linguistic structures formed from bodily interactions and recurring experience, such as CONTAINMENT, PART-WHOLE, FORCE, MOVEMENT (Langacker, 1987; Talmy, 1985 Talmy, , 1983 ). • Metaphoric inference: 3 The process by which new information may be inferred via structural similarities to a better-understood instantiated system (Lakoff and Johnson, 1980; Gallese and Lakoff, 2005; Day and Gentner, 2007) . For example, "I have an example IN mind" suggests that the abstract concept mind is mapped to the more concrete domain of containers. • Mental simulation. The reenactment of perceptual, motor, and introspective states acquired during experience with the world, body, and mind. In EC, diverse simulation mechanisms (also called mental or forward models (Rumlehart et al., 1986; Grush, 2004) ) support a wide spectrum of cognitive activities, including language and decision making (Barsalou, 2008) . We believe that ECL is a useful paradigm for addressing the challenges of §2, as it articulates the role of analogy and mental simulation in NLU. The following two ECL hypotheses summarize them (Lakoff and Johnson, 1980; Feldman and 3 Also called analogical reasoning, we use "metaphorical" and "analogical" interchangeably. Narayanan, 2004) :

Hypothesis 1 (Simulation): Humans understand the meaning of language by mentally simulating its content. Language in context evokes a simulation structured by embodied schemata and metaphoric mappings, utilizing the same neural structures for action and perception in the environment. Understanding involves inferring and running the best fitting simulation.

Hypothesis 2 (Metaphoric Representation):

Human concepts are expressible through hierarchical, compositional, metaphoric mappings over a limited vocabulary of embodied schema. Abstract concepts are expressed using more literal concepts.

Early ECL Implementations. Early attempts to implement ECL in actual language understanding systems were founded on Narayanan (1997)'s x-schema simulation framework and Embodied Construction Grammar (Bergen and Chang, 2005) . While notable for approaching challenging problems involving mental simulation, and complex, metaphoric language, early implementation efforts were not operationalized to scale to mainstream applications (Lakoff and Narayanan, 2010) . These works also focused on a particular type of simulation (sensorimotor), understood as only one mechanism of many used in language understanding (Stolk et al., 2016) .

FrameNet (Ruppenhofer et al., 2016) and MetaNet (David and Dodge, 2014) are closely related projects in that each provides an extensive collection of schemata used in everyday and metaphoric language comprehension, respectively, via the concept of a semantic frame (Fillmore, 1985) . However, neither incorporates simulation semantics, as needed for a full realization of the ECL vision (Chang et al., 2002) .

4 Linking Ecl To Nlu And Embodied Ai Research

We propose a unifying view of ECL, bringing it closer to contemporary cognitive science and deep learning approaches. This section presents notations and motivating intuitions, further developing the computational framework in §5, §6. The proposal centers around the view of natural language as a kind of neural programming language (Lupyan and Bergen, 2016) , or higher-level cognitive control system for systematically querying and inducing changes in the mental and physical states of (Glenberg, 2008) to conform with contemporary cognitive science accounts.

recipients (Elman, 2004; Stolk et al., 2016; Borghi et al., 2018) . This approach builds on the ECL hypotheses and suggests a broader view of mental simulation, one that is readily amenable to the same computational formulation as current embodied AI and executable semantic parsing approaches.

Preliminaries. At the core of embodied approaches is the Partially Observable Markov Decision Process (POMDP; Kaelbling et al., 1998) . It governs the relations between states (s), actions (a), observations (o), and rewards (r). Of particular interest are the recognition O −1 : O → S, policy π : S → A, and transition T : S × A → S functions. Focusing on mental simulation rather than actual external action, we assume a degree of equivalence between external and internal representations (Rumlehart et al., 1986; Hamrick, 2019) . We consider internal mental states and actions (s,ã), effecting change to mental models via a learned neural emulatorT (Grush, 2004) . Finally, language is considered a form of action (Glenberg, 2008) via external and internal utterances (i.e., semantic parses).

Connecting symbolic & embodied language understanding. Table 1 presents a structured version of the neural programming language conceptualization. Importantly, this view highlights the important commonalities and differences between ECL and both symbolic programming languages, as well as embodied neural mechanisms, for perception and action. We illustrate these relations more explicitly through a comparison between ECL and executable semantic parsing ( ECL semantic parsing. Shares the same structure as executable semantic parsing, with the important distinction that simulation is enacted via internal neural representations:T O −1 (o) ,ã =s * . The fully neural formulation enables grounded understanding of non-literal language, demonstrated here for the Fig. 2 example. Metaphoric inference (hyp. 2) facilitates parsing a novel linguistic input o into internal, structured, neural state representationss,ã. Accordingly, the utterance u="Napoleon, the head of the French Army" might be parsed to an internal states composed of a PART-WHOLE schema as shown in the figure. The phrase "attacked the Russian fort" could be grounded to a parseã driving simulation over MOTION and FORCE schemata. The requirement thats andã should afford mental simulation (hyp. 1) by the neural world emulatorT marks an important difference from current neural word embeddings, one that contributes to deeper language understanding; in the resulting mental modelT (s,ã), Napoleon and the French Army likely moved together due to the PART-WHOLE relation between them. This inference is non-trivial since it requires implicit knowledge (heads and bodies often move together). Indeed, a SOTA NLI model 5 considers it "very likely" that the Fig. 2 sentence contradicts the entailment that "The French Army moved towards the fort but did not enter it." To summarize:

Table 1: Natural language as a neural programming language conceptualization, with correspondence between symbolic programming, ECL, and embodied AI, using standard POMDP notation. Tilde notation refers to internal counterparts of T, s, a used in mental simulation. †Also called mental simulation (Bergen and Chang, 2005), we adopt emulator (Glenberg, 2008) to conform with contemporary cognitive science accounts.

• Executable semantic parsing approaches address grounding literal language to symbolic primitives; and metaphoric inference suggests a mechanism for grounding general language using neural primitives (schemata). • Executable semantic parsing approaches utilize hard-coded, external symbolic executors, whereas ECL highlights the role of learned neural world emulators, as in current embodied research AI efforts (see §7.2).

5 Proposal For An Embodied Language Understanding Model

Formalizing the view characterized above suggests a novel computational model of language understanding. While current statistical models focus on the linguistic signal, research shows that most of the relevant information required for understanding a linguistic message is not present in the words (Stolk et al., 2016; David et al., 2016) . Accordingly, the ECL view suggests shifting the focus to the mental models that communicators use, and the neural mechanisms used to construct them, e.g., mental simulation. What follows here adapts a relevant cognitiveinspired framework from general AI to the present NLU setting ( §5.1), and discusses computational challenges ( §5.2). Note that similar insights have been applied to multi-agent communication problems (Andreas et al., 2017) , but their application to general NLU has been limited.

5.1 Formal Framework

The recently introduced Consciousness Prior (CP; Bengio, 2017) is a framework to represent the mental model of a single agent, through the notion of abstract state representations. 6 Here, an abstract state corresponds withs ( §4), a low-dimensional, structured, interpretable state encoding, useful for planning, communication, and predicting upcoming observations (François-Lavet et al., 2019) . One example is a dynamic knowledge graph embedding to represent a scene (Kipf et al., 2020) .

We adapt CP to a two-player cooperative linguistic communication setting (Tomasello, 2008) . We assume a communicator (A) and recipient (B), as shown in Fig. 3 . The computational problem of communicators is a "meeting of minds" (Gärdenfors, 2014) , or achieving some alignment of their mental models (Rumelhart, 1981; Stolk et al., 2016) : the communicator A wishes to induce in B some (possibly ordered) set of goal abstract states G * .

Figure 3: Schema of linguistic communication framework. Communicator’s intent (1) is a high dimensional mental state, i.e., remove apples from the second crate. The low capacity of the linguistic channel (2) leaves the burden of understanding primarily on Communicator and Recipient (embodiment principle). The Recipient’s goal is to understand (3), i.e., reconstruct the intent by integrating linguistic input, knowledge of the state of the world, and internal knowledge (memories, commonsense). Reconstruction results in a successful alignment (4).

We leave exploration of the communicator side to future work, and focus here on understanding. We assume that A sequentially generates utterances u t ∈ U (we assume equivalence between utterances u and observations o) using an utterance model (Bengio, 2017) . Analogously, B uses a comprehension model C s.t.,s t = C (s t−1 , u t ). We assume that alignment is possible: there exists some sequence of utterances that will induce G * .

This framework is readily applicable to static text (reading comprehension). For example, in Fig. 1 , G * would be the sequence of desired states, and each sentence corresponds to an utterance (u 1 ="The world contains 2 crates.",...).

5.2 Computational Challenges Of Embodiment

We can now more precisely characterize the challenges that the recipient faces. At the root of the problem is the embodiment principle (Lawrence, 2017): human internal representations and computation capacity, as represented bys andT , respectively, are many orders of magnitude larger than their linguistic communication "bandwidth". We note that thoughs t is only a subspace of the full mental state, following Stolk et al. (2016) ; Bengio (2017) we assume that it still holds that dim (s t ) dim (u t ).The embodiment principle dictates extreme economy in language use (Grice et al., 1975) , and results in three major challenges:

Common ground (prior world knowledge).

Meaning cannot be spelled out in words but rather must be evoked in the listener (Rumelhart, 1981) by assuming and exploiting common ground (Clark and Schaefer, 1989; Tomasello, 2008) , i.e., shared structures of mental representations. In other words, to achieve some aligned goal state g * , the communicators must rely heavily on pre-existing similarities ins,ã, andT . Developing computational versions of human world models (T ) is likely AI-complete or close, but useful middle ground may be attained by partial approximations. Common ground (discourse). In the context of discourse, new information must be accumulated efficiently to update the mental model (Clark and Schaefer, 1989; Stolk et al., 2016) . Consider "Remove all apples from the second crate" (Figure 1 ). Full comprehension is only possible in the context of a sufficiently accurate mental model. Using our previous notations, the comprehension of u t depends both on the previous utterances u 1:(t−1) and intermediate mental models t−1 . Abstract vs. Literal Language. Interpretation of literal language is relatively straightforwardit is the language first acquired by children, directly related to the physical world. However, much of human language is more abstract, relying on metaphors borne of embodiment. The symbolic programming analog fails for utterances like "these elections seem like a circus". Symbolic programming languages cannot handle nonliteral interpretations: how are elections like a circus? This is related to selective analogical inference (Gentner and Forbus, 2011) , closely related to ECL: not everything in the source domain (circus) is mapped to the target (elections). Humans easily perceive the salient metaphoric mappings (clown→candidate), but this feat remains extremely complex for machines.

6 Architecture Sketch

This section presents a schematic ECL-inspired architecture towards the implementation of the comprehension model (C), which addresses the challenges presented in §5.2. Fig. 4 shows the proposed architecture. For simplicity, the focus is on a static reading comprehension setting, but the architecture supports richer environments as well.

Figure 4: Architecture for comprehender (§5), demonstrated on a symbolic version of the example task of Fig. 1. The agent receives natural language input from the environment. The agent has global memory – short-term, keeping track of the mental model of the world, and long-term, containing compiled knowledge (“library classes and functions”). The parser interprets input to parse ãt enacting mental simulation using emulator. The mental model is then updated, ready for the next input. The sub-goals refer to the order in which components are learned (as opposed to hard-coded) in our proposed roadmap (§7).

6.1 Environment

The environment provides an "interaction API" to the agent, as well as the reward signal. The supported interaction may vary considerably depending on the task; for reading comprehension, it allows structured access to the text while supporting flexible reading strategies . The flexibility is important for long documents, where navigation may be required (Geva and Berant, 2018) . For executable semantic parsing, there might be external systems to interact with besides the text, such as a database .

6.2 Agent

The agent architecture approximates the important ECL functions outlined in §4, and consists of four main modules:

Memory. We distinguish between two forms of memory, the first an episodic, short-term mental model -the system's current abstract state representation (s t ). The symbolic programming analog is the execution trace of a program, containing the states of relevant working variables at each execution step. Fig. 4 displays the updated mental model, after the removal of the apples. Compiled knowledge, or long-term memory, reflects highly familiar object representations, behaviors and schemata, such as common sense, intuitive psychology and physics. The symbolic programming language analogs of this are libraries; largely static, hierarchical and compositional repositories of functions and Figure 4 : Architecture for comprehender ( §5), demonstrated on a symbolic version of the example task of Fig. 1 . The agent receives natural language input from the environment. The agent has global memory -short-term, keeping track of the mental model of the world, and long-term, containing compiled knowledge ("library classes and functions"). The parser interprets input to parseã t enacting mental simulation using emulator. The mental model is then updated, ready for the next input. The sub-goals refer to the order in which components are learned (as opposed to hard-coded) in our proposed roadmap ( §7).

classes. In the course of language interpretation, these libraries are "importable": for the symbolic example in Fig. 4 , the parser might instantiate a new variable of an imported type (e.g., crate2 = Container()). Both types of memory are accessible for all components of the agent.

Parser. Abstraction of higher-level perception, control, reasoning and linguistic functions. Handles interpretation of new linguistic inputs based on prior knowledge and the current mental state. Consonant with the view of analogy-making as a kind of higher-level perception or recognition (Mitchell, 1993) , metaphoric inference is involved in grounding a novel input u t into internal, neural state representationss t ,ã t affording simulation. See Fig. 4 and Fig. 2 for examples on literal and non-literal language, respectively. Emulator. Functionally similar to the executor module in executable semantic parsing, but learned, and obviously far greater in scale. This module is an abstraction of neural emulation mechanisms (T ), representing a wide range of functions, from lower-level motor control and imagery to higher-level models used for planning and theory of mind (Grush, 2004) . It operates over the current mental model and semantic parse from the parser. The output is then an updated mental model. Importantly, the proposed architecture is designed to address the challenges outlined in §5.2; compiled knowledge underlies human common ground, the building blocks ofs,ã andT . Memory and emulation are instrumental for accumulation in discourse. The ability to understand abstract language involves all modules in the system.

7 Implementation Roadmap

The architecture outlined in §6 is very ambitious; its implementation requires much further research. This section proposes a roadmap to this goal, identifying three sub-goals ( Fig. 4) , presented in order of increasing difficulty. Broadly speaking, the level of difficulty is determined by which components are assumed as given in the input (here this also means they are hard-coded in a symbolic programming language), and which must be learned.

7.1 Sub-Goal 1: Learning Open-Domain Simulation

Observing that literal language is close to the embodied primitives level, its interpretation is simpler (than that of non-literal language, see §4). Therefore, in this phase, the emulator and compiled knowledge are hard-coded; here the focus is learning the parser. In other words, this sub-goal focuses on extending executable semantic parsing from relatively narrow domains to handle more general literal language on-the-fly, similarly to zero-shot semantic parsing (Givoli and Reichart, 2019) . For the example in §2.1, the parser could be expected to infer the types (boxes as containers, fruits as objects) either by context (Yao et al. (2018) explore a preliminary schema-based approach) or explicit declarative language, using them to configure the emulator to handle the specific required problem setting (Tamari et al., 2020) .

As in similar projects exploring embodied understanding (Pustejovsky and Krishnaswamy, 2016; Baldridge et al., 2018) , new simulator frameworks must be developed. While full embodiment calls for multiple modalities, the degree to which it is required remains an important open question (Lupyan and Lewis, 2019) . Accordingly, and for immediate applicability to purely textual NLU problems we propose also focusing on the simpler setting of interactive text (Nelson, 2005) . Recent research on text-based games shows how agents can learn to "program" in such languages Ammanabrolu and Riedl, 2019) , and how real language understanding problems can be framed as executable semantic parsing using configurable text-based simulators (Tamari et al., 2019) .

7.2 Sub-Goal 2: Learning To Simulate

This phase assumes that the compiled knowledge is given (hard-coded), and the parsing and emulator modules are neural (learned). A hard-coded emulator will likely be needed to train a learned emulator. The learned event execution of Narayanan (1997) provides a useful starting point towards computational models capable of such inference. In general, learned simulation is relatively unexplored in the context of natural language, though recent work has explored it in a "blocks-world" instruction following setup (Gaddy and Klein, 2019) . Outside of NLU, learning structured world models is a long-studied, fast-growing field in embodied AI research (Schmidhuber, 1990; Ha and Schmidhuber, 2018; Hamrick, 2019; Anand et al., 2019; Kipf et al., 2020) , and recently also in learned executors for neural programming (Kant, 2018) . We expect much useful cross fertilization with these fields.

7.3 Sub-Goal 3: Learning Compiled Knowledge

This phase focuses on the component seemingly hardest to learn -compiled knowledge. Out of scope here is fully neural setting where all components are jointly learned, as in continual learning research (Parisi et al., 2019) . Instead, we focus on a simpler setting, in which the compiled knowledge is learned but represented by symbolic code; i.e., learning the static code library underlying the simulation framework. This sub-goal is relevant for training the parser ( §7.1) as well as the emulator ( §7.2), and can be pursued in parallel to them.

In this setting, learning compiled knowledge is closely related to automated knowledge base construction (Winn et al., 2019) or frame induction from text (QasemiZadeh et al., 2019) . Our proposed paradigm suggests enriching classic symbolic knowledge representations (Speer et al., 2017) to executable form (Tamari et al., 2020) . Preliminary steps in this direction are seen in inferential knowledge bases such as ATOMIC , which provides limited execution logic using edges typed with if-then relations.

Alongside FrameNet and MetaNet, others have collected schema and metaphor mappings, by learning them from large corpora (Beigman Klebanov et al., 2016; Gao et al., 2018) . Pastra et al. (2011) built a database of concepts directly groundable to sensorimotor representations, primarily for robotics applications.

8 Conclusions

This position paper has proposed an approach to representation and learning based on the tenets of ECL. The proposed architecture, drawing on contemporary cognitive science, aims to address key limitations of current NLU systems through mental simulation and grounded metaphoric inference. We outlined major challenges and suggested a roadmap towards realizing the proposed vision.

Growing empirical evidence shows that language is intricately intertwined with a vast range of other neural processes. Accordingly, this work suggests a symbiotic view of cognitive science, embodied AI, and computational linguistics. By sharing common foundational problems, these fields may better share and co-evolve common solutions. Finally, we believe that attaining deeper language understanding must be a large scale effort, beyond the scope of any one research group. We hope that the paradigm presented here will help provide coherence to such efforts. One of our main goals was to stimulate a discussion; moving forward, we welcome comments, feedback, and suggestions.

Typically, mental models are construed as "world simulators"; see §3.

Slightly abusing notation, we apply T iteratively on a sequence of actions a = (a0, ..., aL−1).

We useLiu et al. (2019b) with https://demo. allennlp.org/textual-entailment/.6 For brevity we omit discussion of deriving abstract states from the full mental state, seeBengio (2017) for details.