# Value-based Search in Execution Space for Mapping Instructions to Programs

## Authors

## Abstract

Training models to map natural language instructions to programs, given target world supervision only, requires searching for good programs at training time. Search is commonly done using beam search in the space of partial programs or program trees, but as the length of the instructions grows finding a good program becomes difficult. In this work, we propose a search algorithm that uses the target world state, known at training time, to train a critic network that predicts the expected reward of every search state. We then score search states on the beam by interpolating their expected reward with the likelihood of programs represented by the search state. Moreover, we search not in the space of programs but in a more compressed state of program executions, augmented with recent entities and actions. On the SCONE dataset, we show that our algorithm dramatically improves performance on all three domains compared to standard beam search and other baselines.

## 1 Introduction

Training models that can understand natural language instructions and execute them in a realworld environment is of paramount importance for communicating with virtual assistants and robots, and therefore has attracted considerable attention (Branavan et al., 2009; Vogel and Jurafsky, 2010; Chen and Mooney, 2011) . A prominent approach is to cast the problem as semantic parsing, where instructions are mapped to a high-level programming language Long et al., 2016; Guu et al., 2017) . Because annotating programs at scale is impractical, it is desirable to train a model from instructions, an initial world state, and a target world state only, letting the program itself be a latent variable.

Learning from such weak supervision results in a difficult search problem at training time. The model must search for a program that when executed leads to the correct target state. Early work employed lexicons and grammars to constrain the search space (Clarke et al., 2010; Liang et al., 2011; Krishnamurthy and Mitchell, 2012; Berant et al., 2013; , but recent success of sequence-to-sequence models (Sutskever et al., 2014) shifted most of the burden to learning. Search is often performed simply using beam search, where program tokens are emitted from left-to-right, or program trees are generated top-down (Krishnamurthy et al., 2017; Yin and Neubig, 2017; Cheng et al., 2017; Rabinovich et al., 2017) or bottom-up Guu et al., 2017; Goldman et al., 2018) . Nevertheless, when instructions are long and complex and reward is sparse, the model may never find enough correct programs, and training will fail.

In this paper, we propose a novel search algorithm for mapping a sequence of natural language instructions to a program, which extends the standard beam-search in two ways. First, we capitalize on the target world state being available at training time and train a critic network that given the language instructions, current world state, and target world state estimates the expected future reward for each search state. In contrast to traditional beam search where states representing partial programs are scored based on their likelihood only, we also consider expected future reward, leading to a more targeted search at training time. Second, rather than search in the space of programs, we search in a more compressed execution space, where each state is defined by a partial program's execution result.

We evaluated our method on the SCONE dataset, which includes three different domains where long sequences of 5 instructions are mapped to programs. We show that while standard beam search gets stuck in local optima and is unable to discover good programs for many examples, our model is able to bootstrap, improving final performance by 20 points on average. We also perform extensive analysis and show that both value-based search as well as searching in execution space contribute to the final performance. Our code and data are available at http://gitlab.com/ tau-nlp/vbsix-lang2program.

## 2 Background

Mapping instructions to programs invariably involves a context, such as a database or a robotic environment, in which the program (or logical form) is executed. The goal is to train a model given a training set {(x (j) = (c (j) , u (j) ), y (j) )} N j=1 , where c is the context, u is a sequence of natural language instructions, and y is the target state of the environment after following the instructions, which we refer to as denotation. The model is trained to map the instructions u to a program z such that executing z in the context c results in the denotation y, which we denote by z c = y. Thus, the program z is a latent variable we must search for at both training and test time. When the sequence of instructions is long, search becomes hard, particularly in the early stages of training.

Recent work tackled the training problem using variants of reinforcement learning (RL) (Suhr and Artzi, 2018; Liang et al., 2018) or maximum marginal likelihood (MML) (Guu et al., 2017; Goldman et al., 2018) . We now briefly describe MML training, which we base our training procedure on, and outperformed RL in past work under comparable conditions (Guu et al., 2017) .

We denote by π θ (•) a model, parameterized by θ, that generates the program z token by token from left to right. The model π θ receives the context c, instructions u and previously predicted tokens z 1...t−1 , and returns a distribution over the next program token z t . The probability of a program prefix is defined to be:

p θ (z 1...t | u, c) = t i=1 π θ (z i | u, c, z 1...i−1 ).

The model is trained to maximize the MML objective:

J M M L = log (c,u,y) p θ (y | c, u) = (c,u,y) log( z p θ (z | u, c) • R(z)),

where R(z) = 1 if z c = y, and 0 otherwise (For brevity, we omit c and y from R(•)). Typically, 1. The person in an orange hat moves to the left of the person in a red hat 2. He then disappears 3. Then a person in orange appears on the far right X: y: Figure 1 : Example from the SCENE domain in SCONE (3 utterances), where people with different hats and shirts are located in different positions. The example contains (from top to bottom): an initial world (the people's properties, from left to right: a blue shirt with an orange hat, a blue shirt with a red hat, and a green shirt with a yellow hat), a sequence of natural language instructions, and a target world (the people's properties, from left to right: blue shirt with a red hat, green shirt with a yellow hat, and an orange shirt).

the MML objective is optimized with stochastic gradient ascent, where the gradient for an example (c, u, y) is:

∇ θ J M M L = z q(z) • R(z)∇ θ log p θ (z|c, u) q(z) := R(z) • p θ (z|c, u) z R(z) • p θ (z|c, u)

.

The search problem arises because it is impossible to enumerate the set of all programs, and thus the sum over programs is approximated by a small set of high probability programs, which have high weights q(•) that dominate the gradient. Search is commonly done with beam-search, an iterative algorithm that builds an approximation of the highest probability programs according to the model. At each time step t, the algorithm constructs a beam B t of at most K program prefixes of length t. Given a beam B t−1 , B t is constructed by selecting the K most likely continuations of prefixes in B t−1 , according to p θ (z 1..t |•). The algorithm runs L iterations, and returns all complete programs discovered.

In this paper, our focus is the search problem that arises at training time when training from denotations, i.e., finding programs that execute to the right denotation. Thus, we would like to focus on scenarios where programs are long, and standard beam search fails. We next describe the SCONE dataset, which provides such an opportunity. Then he disappears Figure 2 : An illustration of a state transition in SCONE. π θ predicted the action token MOVE. Before the transition (left), the command history is empty and the program stack contains the arguments 3 and 1, which were computed in previous steps. Note that the state does not include the partial program that computed those arguments. After transition (right), the executor popped the arguments from the program stack and applied the action MOVE 3 1: the man in position 3 (a person with a yellow hat and a red shirt) moved to position 1 (to the left of the person with the blue shirt), and the action is added to the execution history with its arguments. Since this action terminated the command, π θ advanced to the next utterance.

The SCONE dataset Long et al. (2016) presented the SCONE dataset, where a sequence of instructions needs to be mapped to a program consisting of a sequence of executable commands. The dataset has three domains, where each domain includes several objects (such as people or beakers), each with different properties (such as shirt color or chemical color). SCONE provides a good environment for stress-testing search algorithms because a long sequence of instructions needs to be mapped to a program. Figure 1 shows an example from the SCENE domain. Formally, the context in SCONE is a world specified by a list of positions, where each position may contain an object with certain properties. A formal language is defined to interact with the world. The formal language contains constants (e.g., numbers and colors), functions that allow to query the world and refer to objects and intermediate computations, and actions, which are functions that modify the world state. Each command is composed of a single action and several arguments constructed recursively from constants and functions. E.g., the command MOVE(HASHAT(YELLOW), LEFTOF(HASSHIRT(BLUE))), contains the action MOVE, which moves a person to a specified position.

The person is computed by HASHAT(YELLOW), which queries the world for the position of the person with a yellow hat, and the target position is computed by LEFTOF(HASSHIRT(BLUE)). We refer to Guu et al. (2017) for a full description of the language.

Our goal is to train a model that given an initial world w 0 and a sequence of natural language utterances u = (u 1 , . . . , u M ), will map each utterance u i to a command z i such that applying the program z = (z 1 , . . . , z M ) on w 0 will result in the target world, i.e., z w 0 = w M = y.

## 3 Markov Decision Process Formulation

To present our algorithm, we first formulate the problem as a Markov Decision Process (MDP), which is a tuple (S, A, R, δ). To define the state set S, we assume all program prefixes are executable, which can be easily done as we show below. The execution result of a prefixz in the context c, denoted by z ex c , contains its denotation and additional information stored in the executor. Let Z pre be the set of all valid programs prefixes. The set of states is defined to be S = {(x, z ex c ) | z ∈ Z pre }, i.e., the input paired with all possible execution results.

The action set A includes all possible program tokens 1 , and the transition function δ : S×A → S is computed by the executor. Last, the reward R(s, a) = 1 iff the action a ends the program and leads to a state δ(s, a) where the denotation is equal to the target y. The model π θ (•) is a parameterized policy that provides a distribution over the program vocabulary at each time step.

## Scone As An Mdp

We define the partial execution result z ex c in SCONE, as described by Guu et al. (2017) . We assume that SCONE's formal language is written in postfix notations, e.g., the instruction MOVE(HASHAT(YELLOW), LEFTOF(HASSHIRT(BLUE))) is written as YEL-LOW HASHAT BLUE HASSHIRT LEFTOF MOVE.

With this notation, a partial program can be executed left-to-right by maintaining a program stack, ψ. The executor pushes constants (YELLOW) to ψ, and applies functions (HASHAT) by popping their arguments from ψ and pushing back the computed result. Actions (MOVE) are applied by popping arguments from ψ and performing the action in the current world.

To handle references to previously executed commands, SCONE's formal language includes functions that provide access to actions and arguments in previous commands. To this end, the executor maintains an execution history, h i = (e 1 , . . . , e i ), a list of executed actions and their arguments. Thus, the execution result of a program prefix is z ex

w 0 = (w i−1 , ψ, h i−1 )

. We adopt the model from Guu et al. (2017) (architecture details in appendix A): The policy π θ observes ψ and u i , the current utterance being parsed, and predicts a token. When the model predicts an action token that terminates a command, the model moves to the next utterance (until all utterances have been processed). The model uses functions to query the world w i−1 and history h i−1 . Thus, each MDP state in SCONE is a pair s = (u i , z ex w 0 ). Figure 2 illustrates a state transition in the SCENE domain. Importantly, the state does not store the full program's prefix, and many different prefixes may lead to the same state. Next, we describe a search algorithm for this MDP.

## 4 Searching In Execution Space

Model improvement relies on generating correct programs given a possibly weak model. Standard beam-search explores the space of all program token sequences up to some fixed length. We propose two technical contributions to improve search: (a) We simplify the search problem by searching for correct executions rather than correct programs; (b) We use the target denotation at training time to better estimate partial program scores in the search space. We describe those next.

## 4.1 Reducing Program Search To Execution Search

Program space can be formalized as a directed tree

T = (V T , E T )

, where vertices V T are program prefixes, and labeled edges E T represent prefixes' continuations: an edge e = (z,z ) labeled with the token a, represents a continuation of the prefix z with the token a, which yields the prefixz . The Figure 2 . Since multiple prefixes have the same execution result (e.g., YELLOW HASHAT and RED HASSHIRT), the execution space is smaller.

root of the graph represents the empty sequence. Similarly, Execution space is a directed graph G = (V G , E G ) induced from the MDP described in Section 3. Vertices V G represent MDP states, which express execution results, and labeled edges E G represent transitions. An edge (s 1 , s 2 ) labeled by token a means that δ(s 1 , a) = s 2 . Since multiple programs have the same execution result, execution space is a compressed representation of program space. Figure 3 shows a few commands in both program and execution space. Execution space is smaller, and so searching in it is easier.

Each path in execution space represents a different program prefix, and the path's final state represents its execution result. Program search can therefore be reduced to execution search: given an example (c, u, y) and a model π θ , we can use π θ to explore in execution space, discover correct terminal states, i.e. states corresponding to correct full programs, and extract paths leading to those states. As the number of paths may be exponential in the size of the graph, we can use beam-search to extract the most probable correct programs (according to the model) in the discovered graph.

Our approach is similar to the DPD algorithm , where CKYstyle search is performed in denotation space, followed by search in a pruned space of programs. However, DPD was used without learning, and the search was not guided by a trained model, which is a major part of our algorithm.

## 4.2 Value-Based Beam Search In Execution Space

We propose Value-based Beam Search in eXecution space (VBSIX), a variant of beam

Algorithm 1 Program Search with VBSIX 1: function PROGRAMSEARCH(c, u, y, π θ , V φ ) 2: G, T ← VBSIX(c, u, y, π θ , V φ )

3:

Z ← Find paths in G that lead to states in T with beam search 4:

Return Z 5: function VBSIX(c, u, y, π θ , V φ ) 6:

T ← ∅, P ← {} init terminal states and DP chart 7:

s0 := The empty program parsing state 8:

B0 ← {s0}, G = ({s0}, ∅)

Init beam and graph 9:

P0[s0] ← 1

The probability of s0 is 1 10:

for t ∈ [1 . . . L] do

## 11:

Bt ← ∅

## 12:

for s ∈ Bt−1, a ∈ A do 13:

s ← δ(s, a)

14:

Add edge (s, s ) to G labeled with a

## 15:

if s is correct terminal then 16:

T ← T ∪ {s }

17:

else 18

:

Pt[s ] ← Pt[s ] + Pt−1[s] • π θ (a|s)

19:

Bt ← Bt ∪ {s }

## 20:

Sort Bt by AC-SCORER(•)

## 21:

Bt ← Keep the top-K states in Bt 22:

Return G, T 23: function AC-SCORER(s, Pt, V φ , y)

24:

Return Pt[s] + V φ (s, y)

search modified for searching in execution space, that scores search states with a value-based network. VBSIX is formally defined in Algorithm 1, which we will refer to throughout this section. Standard beam search is a breadth-first traversal of the program space tree, where a fixed number of vertices are kept in the beam at every level of the tree. The selection of vertices is done by scoring their corresponding prefixes according to p θ (z 1...t | u, c). VBSIX applies the same traversal in execution space (lines 10-21). However, since each vertex in the execution space represents an execution result and not a particular prefix, we need to modify the scoring function.

Let s be a vertex discovered in iteration t of the search. We propose two scores for ranking s. The first is the actor score, the probability of reaching vertex s after t iterations 2 according to the model π θ . The second and more novel score is the valuebased critic score, an estimate of the state's expected reward. The AC-Score is the sum of these two scores (lines 23-24).

The actor score, p t θ (s), is the sum of probabilities of all prefixes of length t that reach s (rather than the probability of one prefix as in beam search). VBSIX approximates p t θ (s) by performing the summation only over paths that reach s via states in the beam B t−1 , which can be done efficiently with a dynamic programming (DP) chart P t [s] that keeps actor score approximations in each iteration (line 18). This lower-bounds 2 We score paths in different iterations independently to avoid bias for shorter paths. An MDP state that appears in multiple iterations will get a different score in each iteration. the true p t θ (s) since some prefixes of length t that reach s might have not been discovered.

Contrary to standard beam-search, we want to score search states also with a critic score E p θ [R(s)], which is the sum of the suffix probabilities that lead from s to a correct terminal state:

E p θ [R(s)] = τ (s) p θ (τ (s) | s) • R(τ (s)),

where τ (s) are all possible trajectories starting from s and R(τ (s)) is the reward observed when taking the trajectory τ (s) from s. Enumerating all trajectories τ (s) is intractable and so we will approximate E p θ [R(s)] with a trained value network V φ (s, y), parameterized by φ. Importantly, because we are searching at training time, we can condition V φ on both the current state s and target denotation y. At test time we will use π θ only, which does not need access to y.

Since the value function and DP chart are used for efficient ranking, the asymptotic run-time complexity of VBSIX is the same as standard beam search (O(K •|A|•L)). The beam search in Line 3, where we extract programs from the constructed execution space graph, can be done with a small beam size, since it operates over a small space of correct programs. Thus, its contribution to the algorithm complexity is negligible. Figure 4 visualizes scoring and pruning with the actor-critic score in iteration t. Vertices in B t are discovered by expanding vertices in B t−1 , and each vertex is ranked by the AC-scorer. The highlighted vertex score is 0.6, a sum of the actor score (0.5) and the critic score (0.1). The actor score is the sum of its prefixes ((0.05 + 0.55) • 0.7 + 0.01 • 0.08 = 0.5) and the critic score is a value network estimate for the sum of probabilities of outgoing trajectories reaching correct terminal states (0.02 + 0.08 = 0.1). Only the top-K states are kept in the beam (K = 4 in the figure).

VBSIX leverages execution space in a number of ways. First, since each vertex in execution space compactly represents multiple prefixes, a beam in VBSIX effectively holds more prefixes than in standard beam search. Second, running beam-search over a graph rather than a tree is less greedy, as the same vertex can surface back even if it fell out of the beam.

The value-based approach has several advantages as well: First, evaluating the probability of outgoing trajectories provides look-ahead that is missing from standard beam search. Second (and most importantly), V φ is conditioned on y, which π θ doesn't have access to, which allows finding correct programs with low model probability, that π θ can learn from. We note that our two contributions are orthogonal: the critic score can be used in program space, and search in execution space can be done with actor-scores only.

## 5 Training

We train the model π θ and value network V φ jointly (Algorithm 2). π θ is trained using MML over discovered correct programs (Line 4, Algorithm 1). The value network is trained as follows: Given a training example (c, u, y), we generate a set of correct programs Z pos with VB-SIX. The value network needs negative examples, and so for every incorrect terminal state z neg found during search with VBSIX we create a single program leading to z neg . We then construct a set of training examples D v , where each example ((s, y), l) labels states encountered while generating programs z ∈ Z with the probability mass of correct programs suffixes that extend it, i.e., l = zt p θ (z t...|z| ), where z t ranges over all z ∈ Z and t ∈ [1 . . . |z|] Finally, we train V φ to minimize the log-loss objective:

((s,y),l)∈Dv l• logV φ (s,y)+(1−l)(log(1−V φ (s,y))) .

Similar to actor score estimation, labeling examples for V φ is affected by beam-search errors: the labels lower bound the true expected reward. However, since search is guided by the model, those programs are likely to have low probability. Moreover, estimates from V φ are based on multiple examples, compared to probabilities in the DP Algorithm 2 Actor-Critic Training 1: procedure TRAIN() 2:

Initialize θ and φ randomly 3:

while π θ not converged do

## 4:

(x := (c, u), y) ← select random example 5:

Zpos ← PROGRAMSEARCH(c, u, y, π θ , V φ )

## 6:

Zneg ← programs leading to incorrect terminal states

## 7:

Dv ← BUILDVALUEEXAMPLES((Zpos ∪ Zneg), c, y)

## 8:

Update φ using Dv, update θ using (x, Zpos, y) 9: function BUILDVALUEEXAMPLES(Z, c, u, y) 10:

for z ∈ Z do 11: for t ∈ [1 . . . |z|] do 12: s ← z1...t ex c 13: L[s] ← L[s] + p θ (z t...|z| | c, u) • R(z)

14:

Dv ← {((s, y), L[s])} s∈L 15:

Return Dv chart, and are more robust to search errors.

Neural network architecture: We adapt the model proposed by Guu et al. (2017) for SCONE. The model receives the current utterance u i and program stack ψ, and returns a distribution over the next token. Our value network receives the same input, but also the next utterance u i+1 , the world state w i and target world state y, and outputs a scalar. Appendix A provides a full description.

## 6.1 Experimental Setup

We evaluated our method on the three domains of SCONE with the standard accuracy metric, i.e., the proportion of test examples where the predicted program has the correct denotation. We trained with VBSIX, and used standard beam search (K = 32) at test time for programs' generation. Each test example contains 5 utterances, and similar to prior work we reported the model accuracy on all 5 utterances as well as the first 3 utterances. We ran each experiment 6 times with different random seeds and reported the average accuracy and standard deviation.

In contrast to prior work on SCONE (Long et al., 2016; Guu et al., 2017; Suhr and Artzi, 2018) , where models were trained on all sequences of 1 or 2 utterances, and thus were exposed during training to all gold intermediate states, we trained from longer sequences keeping intermediate states latent. This leads to a harder search problem that was not addressed previously, but makes our results incomparable to previous results 3 . In SCENE and TANGRAM, we used the first 4 and 5 utterances as examples. In ALCHEMY, we used the first utterance and 5 utterances. Training details To warm-start the value network, we trained it for a few thousand steps, and only then start re-ranking with its predictions. Moreover, we gain efficiency by first returning K 0 (=128) states with the actor score, and then re-ranking with the actor-critic score, returning K(=32) states. Last, we use the value network only in the last two utterances of every example since we found it has less effect in earlier utterances where future uncertainty is large. We used the Adam optimizer (Kingma and Ba, 2014) and fixed GloVe embeddings (Pennington et al., 2014) for utterance words.

Baselines We evaluated the following training methods (Hyper-parameters are in appendix B): 1. MML: Our main baseline, where search is done with beam search and training with MML. We used randomized beam-search, which addsgreedy exploration to beam search, which was proposed by Guu et al. (2017) and performed better 4 . 2. EXPERT-MML: An alternative way of using the target denotation y at training time, based on imitation learning (Daume et al., 2009; Ross et al., 2011; Berant and Liang, 2015) , is to train an expert policy π expert θ , which receives y as input in addition to the parsing state, and trains with the MML objective. Then, our policy π θ is trained using programs found by π expert θ . The intuition is that the expert can use y to find good programs that the policy π θ can train from. 3. VBSIX: Our proposed training algorithm.

We also evaluated REINFORCE, where Monte-Carlo sampling is used as search strategy (Williams, 1992; Sutton et al., 1999) . We followed the implementation of Guu et al. (2017) , who performed variance reduction with a constant baseline and added -greedy exploration. We found that REINFORCE fails to discover any correct programs to bootstrap from. Table 1 reports test accuracy of VBSIX compared to the baselines. First, VBSIX outperforms all baselines in all cases. MML is the strongest baseline, but even with an increased beam (K = 64), VBSIX (K = 32) surpasses it by a large margin (more than 20 points on average). On top of the improvement in accuracy, in ALCHEMY and TAN-GRAM the standard deviation of VBSIX is lower than the other baselines across the 6 random seeds, showing the robustness of our model. EXPERT-MML performs worse than MML in all cases. We hypothesize that using the denotation y as input to the expert policy π expert θ results in many spurious programs, i.e., they are unrelated to the utterance meaning. This is since the expert can learn to perform actions that take it to the target world state while ignoring the utterances completely. Such programs will lead to bad generalization of π θ . Using a critic at training time eliminates this problem, since its scores depend on π θ .

## 6.2 Results

Ablations We performed ablations to examine the benefit of our two technical contributions (a) execution space (b) value-based search. Table 2 presents accuracy on the validation set when each component is used separately, when both of them are used (VBSIX), and when none are used (beam-search). We find that both contributions are important for performance, as the full system achieves the highest accuracy in all domains. In SCENE, each component has only a slight advantage over beam-search, and therefore both are required to achieve significant improvement. However, in ALCHEMY and TANGRAM most of the gain is due to the value network. We also directly measured the hit accuracy at training time, i.e., the proportion of training examples where the beam produced by the search algorithm contains a program with the correct denotation, showing the effectiveness of search at training time. In Figure 5 , we report train hit accuracy in each training step, averaged across 6 random seeds. The graphs illustrate the performance of each search algorithm in every domain throughout training. The validation accuracy results are correlated with the improvement in train hit-accuracy.

## 6.3 Analysis

Execution Space We empirically measured two quantities that we expect should reflect the advantage of execution-space search. First, we measured the number of programs stored in the execution space graph compared to beam search, which holds K programs. Second, we counted the average number of states that are connected to correct terminal states in the discovered graph, but fell out of the beam during the search. The property reflects the gain from running search over a graph structure, where the same vertex can resurface. We preformed the analysis on VBSIX over 5-utterance training examples in all 3 domains. The following table summarizes the results:

We found the measured properties and the contribution of execution space in each domain are correlated, as seen in the ablations. Differences between domains are due to the different complexities of their formal languages. As the formal language becomes more expressive, the execution space is more compressed as each state can be reached in more ways. In particular, the formal language in SCENE contains more functions compared to the other domains, and so it benefits the most from execution-space search.

Value Network We analyzed the accuracy of the value network at training time by measuring, for each state, the difference between its expected reward (estimated from the discovered paths) and the value network prediction. Figure 6 shows the average difference in each training step for all encountered states (in blue), and for high reward states only (states with expected reward larger than 0.7, in orange). Those metrics are averaged across 6 runs with different random seeds. The accuracy of the value network improves during training, except when the policy changes substantially, in which case the value network needs to re-evaluate the policy. When the value network converges, the difference between its predictions and the expected reward is 0.15 − 0.2 on average. However, for high reward states the difference is higher (∼ 0.3). This indicates that the value network has a bias toward lower values, which is expected as most states lead to low rewards. Since VBSIX uses the value network as a beam-search ranker, the value network doesn't need to be exact as long as it assigns higher values to states with higher expected rewards. Further analysis is provided in appendix D. Figure 6 : The difference between the prediction of the value network and the expected reward (estimated from the discovered paths) during training. We report the average difference for all of the states (blue) and for the high reward states only (> 0.7, orange). The results are averaged over 6 runs with different random seeds (best viewed in color).

## 7 Related Work

Training from denotations has been extensively investigated (Kwiatkowski et al., 2013; Pasupat and Liang, 2015; Bisk et al., 2016) , with a recent emphasis on neural models (Neelakantan et al., 2016; Krishnamurthy et al., 2017) . Improving beam search has been investigated by proposing specialized objectives (Wiseman and Rush, 2016) , stopping criteria , and using continuous relaxations (Goyal et al., 2018) . Bahdanau et al. (2017) and Suhr and Artzi (2018) proposed ways to evaluate intermediate predictions from a sparse reward signal. Bahdanau et al. (2017) used a critic network to estimate expected BLEU in translation, while Suhr and Artzi (2018) used edit-distance between the current world and the goal for SCONE. But, in those works stronger supervision was assumed: Bahdanau et al. (2017) utilized the gold sequences, and Suhr and Artzi (2018) used intermediate worlds states. Moreover, intermediate evaluations were used to compute gradient updates, rather than for guiding search.

Guiding search with both policy and value networks was done in Monte-Carlo Tree Search (MCTS) for tasks with a sparse reward (Silver et al., 2017; T. A. and and Barber, 2017; Shen et al., 2018) . In MCTS, value network evaluations are refined with backup updates to improve policy scores. In this work, we gain this advantage by using the target denotation. The use of an actor and a critic is also reminiscent of A * where states are scored by past cost and an admissible heuristic for future cost (Klein and Manning, 2003; Pauls and Klein, 2009; lee et al., 2016) . In semantic parsing, Misra et al. (2018) recently proposed a critic distribution to improve the policy, which is based on prior domain knowledge (that is not learned).

## 8 Conclusions

In this work, we propose a new training algorithm for mapping instructions to programs given denotation supervision only. Our algorithm exploits the denotation at training time to train a critic network used to rank search states on the beam, and performs search in a compact execution space rather than in the space of programs. We evaluated on three different domains from SCONE, and found that it dramatically improves performance compared to strong baselines across all domains.

VBSIX is applicable to any task that supports graph-search exploration. Specifically, for tasks that can be formulated as an MDP with a deterministic transition function, which allow efficient execution of multiple partial trajectories. Those tasks include a wide range of instruction mapping (Branavan et al., 2009; Vogel and Jurafsky, 2010; Anderson et al., 2018) and semantic parsing tasks (Dahl et al., 1994; Iyyer et al., 2017; Yu et al., 2018) . Therefore, evaluating VBSIX on other domains is a natural next step for our research.

## Relu

Att.

Softmax z t w i RelU Sigmoid v w M u i+1

BiLSTM Figure 7 : The model proposed by Guu et al. (2017) (top) , and our value network (bottom).

states are embedded by concatenating embeddings of SCONE elements. The inputs are concatenated and fed to a feed-forward network, followed by a sigmoid layer that outputs a scalar. Guu et al. (2017) . In all experiments learning rate was 0.001 and mini-batch size was 8. We explicitly define the following hyper-parameters which are not selfexplanatory: 1. Training steps: The number of training steps taken. 2. Sample size: Number of samples drawn from p θ in REINFORCE 3. Baseline: A constant subtracted from the reward for variance reduction. 4. Execution beam size: K in Algorithm 1. 5. Program bea size: Size of beam in line 3 of Algorithm 1. 6. Value ranking start step:

## B Hyper-Parameters

Step when we start ranking states using the critic score. 7. Value re-rank size: Size of beam K 0 returned by the actor score before re-ranking with the actorcritic score.

## C Prior Work

In prior work on SCONE, models were trained on sequences of 1 or 2 utterances, and thus were exposed during training to all gold intermediate states (Long et al., 2016; Guu et al., 2017; Suhr and Artzi, 2018) . Fried et al. (2018) assumed ac-cess to the full annotated logical form. In contrast, we trained from longer sequences, keeping the logical form and intermediate states latent. We report the test accuracy as reported by prior work and in this paper, noting that our results are an average of 6 runs, while prior work reports the median of 5 runs. Naturally, our results are lower compared to prior work that uses much stronger supervision. This is because our setup poses a hard search problem at training time, and also requires overcoming spuriousness -the fact that even incorrect programs sometimes lead to high reward.

## D Value Network Analysis

We analyzed the ability of the value network to predict expected reward. The reward of a state depends on two properties, (a) connectivity: whether there is a trajectory from this state to a correct terminal state, and (b) model likelihood: the probability the model assigns to those trajectories. We collected a random set of 120 states in the SCENE domain from, where the real expected reward was very high (> 0.7), or very low (= 0.0) and the value network predicted well (less than 0.2 deviation) or poorly (more than 0.5 deviation). For ease of analysis we only look at states from the final utterance.

To analyze connectivity, we looked at states that cannot reach a correct terminal state with a single action (since states in the last utterance can perform one action only, the expected reward is 0). Those are states where either their current and target world differ in too many ways, or the stack content is not relevant to the differences between the worlds. We find that when there are many differences between the current and target world, the value network correctly estimates low expected reward in 87.0% of the cases. However, when there is just one mismatch between the current and target world, the value network tends to ignore it and erroneously predicts high reward in 78.9% of the cases.

To analyze whether the value network can predict the success of the trained policy, we consider states from which there is an action that leads to the target world. While it is challenging to fully interpret the value network, we notice that the network predicts a value that is > 0.5 in 86.1% of the cases where the number of people in the world is no more than 2, and a value that is < 0.5 in System SCENE ALCHEMY TANGRAM REINFORCE Training steps = 22.5k

Training steps = 31.5k Training steps = 40k Sample size = 32

Sample size = 32 Sample size = 32 = 0.2 = 0.2 = 0.2 Baseline = 10 −5 Baseline = 10 −2 Baseline = 10 82.1% of the cases where the number of people in the world is more than 2. This indicates that the value network believes more complex worlds, involving many people, are harder for the policy.

Decoding is constrained to valid program tokens only.

For completeness, we show the performance on these datasets from prior work in Appendix C.

We did not include meritocratic updates(Guu et al., 2017), since it performed worse in initial experiments.