Probabilistic Neural Programs

Kenton W. Murray
J. Krishnamurthy
ArXiv
2016
View in Semantic Scholar

Abstract

We present probabilistic neural programs, a framework for program induction that permits flexible specification of both a computational model and inference algorithm while simultaneously enabling the use of deep neural networks. Probabilistic neural programs combine a computation graph for specifying a neural network with an operator for weighted nondeterministic choice. Thus, a program describes both a collection of decisions as well as the neural network architecture used to make each one. We evaluate our approach on a challenging diagram question answering task where probabilistic neural programs correctly execute nearly twice as many programs as a baseline model.

1 Introduction

In recent years, deep learning has produced tremendous accuracy improvements on a variety of tasks in computer vision and natural language processing. A natural next step for deep learning is to consider program induction, the problem of learning computer programs from (noisy) input/output examples. Compared to more traditional problems, such as object recognition that require making only a single decision, program induction is difficult because it requires making a sequence of decisions and possibly learning control flow concepts such as loops and if statements.

Prior work on program induction has described two general classes of approaches. First, in the noise-free setting, program synthesis approaches pose program induction as completing a program "sketch," which is a program containing nondeterministic choices ("holes") to be filled by the learning algorithm [13] . Probabilistic programming languages generalize this approach to the noisy setting by permitting the sketch to specify a distribution over these choices as a function of prior parameters and further to condition this distribution on data, thereby training a Bayesian generative model to execute the sketch correctly [6] . Second, neural abstract machines define continuous analogues of Turing machines or other general-purpose computational models by "lifting" their discrete state and computation rules into a continuous representation [9, 11, 7, 12] . Both of these approaches have demonstrated success at inducing simple programs from synthetic data but have yet to be applied to practical problems.

We observe that there are (at least) three dimensions along which we can characterize program induction approaches: = for { w1 <-param("w1") b1 <-param("b1") h1 = ((w1 * v) + b1).tanh w2 <-param("w2") b2 <-param("b2") out = (w2 * h1) + b2 } yield { out } 3. Inference -how does the approach reason about the many possible executions of the machine?

Neural abstract machines conflate some of these dimensions: they naturally support deep learning, but commit to a particular computational model and approximate inference algorithm. These choices are suboptimal as (1) the bias/variance trade-off suggests that training a more expressive computational model will require more data than a less expressive one suited to the task at hand, and (2) recent work has suggested that discrete inference algorithms may outperform continuous approximations [5] . In contrast, probabilistic programming supports the specification of different (possibly taskspecific) computational models and inference algorithms, including discrete search and continuous approximations. However, these languages are restricted to generative models and cannot leverage the power of deep neural networks.

We present probabilistic neural programs, a framework for program induction that permits flexible specification of the computational model and inference algorithm while simultaneously enabling the use of deep neural networks. Our approach builds on computation graph frameworks [1, 3] for specifying neural networks by adding an operator for weighted nondeterministic choice that is used to specify the computational model. Thus, a program sketch describes both the decisions to be made and the architecture of the neural network used to score these decisions. Importantly, the computation graph interacts with nondeterminism: the scores produced by the neural network determine the weights of nondeterministic choices, while the choices determine the network's architecture. As with probabilistic programs, various inference algorithms can be applied to a sketch. Furthermore, a sketch's neural network parameters can be estimated using stochastic gradient descent from either input/output examples or full execution traces.

We evaluate our approach on a challenging diagram question answering task, which recent work has demonstrated can be formulated as learning to execute a certain class of probabilistic programs. On this task, we find that the enhanced modeling power of neural networks improves accuracy.

Figure 1: Probabilistic neural programs defining a multilayer perceptron with a computation graph (left) and applying it to create a probability distribution over program executions (right).

2 Probabilistic Neural Programs

Probabilistic neural programs build on computation graph frameworks for specifying neural networks by adding an operator for nondeterministic choice. We have developed a Scala library for probabilistic neural programming that we use to illustrate the key concepts. Are rabbits a tertiary consumer? tertiary-consumer(rabbits) Figure 2 : A food web diagram with annotations generated from a computer vision system (left) along with related questions and their associated program sketches (right).

Figure 2: A food web diagram with annotations generated from a computer vision system (left) along with related questions and their associated program sketches (right).

from neural network parameters to a probability distribution over values. The log probability of a value is proportional to the sum of the scores of the choices made in the execution that produced it. Performing (approximate) inference over this object -in this case, using beam search -produces an explicit representation of the distribution. Multiple nondeterministic choices can be combined to produce more complex sketches; this capability can be used to define complex computational models, including general-purpose models such as Turing machines. The library also has functions for conditioning on observations.

Although various inference algorithms may be applied to a program sketch, in this work we use a simple beam search over executions. This approach accords with the recent trend in structured prediction to combine greedy inference or beam search with powerful non-factoring models [2, 10, 4] . The beam search maintains a queue of partial program executions, each of which is associated with a score. Each step of the search continues each execution until it encounters a call to choose, which adds zero or more executions to the queue for the next search step. The lowest scoring executions are discarded to maintain a fixed beam width. As an execution proceeds, it may generate new computation graph nodes; the search maintains a single computation graph shared by all executions to which these nodes are added. The search simultaneously performs the forward pass over these nodes as necessary to compute scores for future choices.

The neural network parameters are trained to maximize the loglikelihood of correct program executions using stochastic gradient descent. Each training example consists of a pair of program sketches, representing an unconditional and conditional distribution. The gradient computation is similar to that of a loglinear model with neural network factors. It first performs inference on both the conditional and unconditional distributions to estimate the expected counts associated with each nondeterministic choice. These counts are then backpropagated through the computation graph to update the network parameters.

3 Diagram Question Answering With Probabilistic Neural Programs

We consider the problem of learning to execute program sketches in a food web computational model using visual information from a diagram. This problem is motivated by recent work [8] , which has demonstrated that diagram question answering can be formulated as translating natural language questions to program sketches in this model, then learning to execute these sketches. Figure 2 shows some example questions from this work, along with the accompanying diagram that must be interpreted to determine the answers. 14.9% 82.5% Table 1 : Results from appling Probabilistic Neural Programs to the execution of sketches from the FOODWEBS dataset. "choose Accuracy" indicates average accuracy at each decision, whereas "Execution Accuracy" indicates if the entire program was executed correctly.

Table 1. Not extracted; please refer to original document.

functions, e.g., for reasoning about population changes, that call organism and eat to extract information from the diagram. [8] has a more thorough description of the theory; our goal is to learn to make the choices in this theory.

We consider three models for learning to make the choices for both organism and eat: a non-neural (LOGLINEAR) model, as well as two probabilistic neural models (2-LAYER PNP and MAXPOOL PNP). All three learn models for both organism and eat using outputs from a computer vision system trained to detect organism, text, and arrow relations between them. [8] defines a set of hand-engineered features heuristically created from the outputs of this vision system. LOGLINEAR and 2-LAYER PNP use only these features, and the difference is simply in the greater expressivity of a two-layer neural network. However, one of the major strengths of neural models is their ability to learn latent feature representations automatically, and our third model also uses the direct outputs of the vision system not made into features. The architecture of MAXPOOL PNP reflects this and contains additional input layers that maxpool over detected relationships between objects and confidence scores. The expectation is that our neural network modeling of nondeterminism will learn better latent representations than the manually defined features.

4 Experiments

We evaluate probabilistic neural programs on the FOODWEBS dataset introduced by [8] . This data set contains a training set of~2,900 programs and a test set of~1,000 programs. These programs are human annotated gold standard interpretations for the questions in the data set, which corresponds to assuming that the translation from questions to programs is perfect. We train our probabilistic neural programs using correct execution traces of each program, which are also provided in the data set.

We evaluate our models using two metrics. First, execution accuracy measures the fraction of programs in the test set that are executed completely correctly by the model. This metric is challenging because correctly executing a program requires correctly making a number of choose decisions. Our 1,000 test programs had over 35,000 decisions, implying that to completely execute a program correctly means getting on average 35 choose decisions correct without making any mistakes. Second, choose accuracy measures the accuracy of each decision independently, assuming all previous decisions were made correctly. Table 1 compares the accuracies of our three models on the FOODWEBS dataset. The improvement in accuracy between the baseline (LOGLINEAR) and the probabilistic neural program (2-LAYER PNP) is due to the neural network's enhanced modeling power. Though the choose accuracy does not improve by a large margin, the improvements translate into large gains in entire program correctness. Finally, as expected, the inclusion of lower level features (MAXPOOL PNP) not possible in LOGLINEAR significantly improved performance. Note that this task requires performing computer vision, and thus it is not expected that any model achieve 100% accuracy.

5 Conclusion

We have presented probabilistic neural programs, a framework for program induction that permits flexible specification of computational models and inference algorithms while simultaneously enabling the use of deep learning. A program sketch describes a collection of nondeterministic decisions to be made during execution, along with the neural architecture to be used for scoring these decisions. The network parameters of a sketch can be trained from data using stochastic gradient descent. We demonstrate that probabilistic neural programs improve accuracy on a diagram question answering task which can be formulated as learning to execute program sketches in a domain-specific computational model.