Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Parsing Algebraic Word Problems into Equations


  • Rik Koncel-Kedziorski
  • Hannaneh Hajishirzi
  • Ashish Sabharwal
  • Oren Etzioni
  • S. Ang
  • Transactions of the Association for Computational Linguistics
  • 2015
  • View in Semantic Scholar


This paper formalizes the problem of solving multi-sentence algebraic word problems as that of generating and scoring equation trees. We use integer linear programming to generate equation trees and score their likelihood by learning local and global discriminative models. These models are trained on a small set of word problems and their answers, without any manual annotation, in order to choose the equation that best matches the problem text. We refer to the overall system as Alges. We compare Alges with previous work and show that it covers the full gamut of arithmetic operations whereas Hosseini et al. (2014) only handle addition and subtraction. In addition, Alges overcomes the brittleness of the Kushman et al. (2014) approach on single-equation problems, yielding a 15% to 50% reduction in error.

1 Introduction

Grade-school algebra word problems are brief narratives (see Figure 1) . A typical problem first describes a partial world state consisting of characters, entities, and quantities. Next it updates the condition of an entity or explicates the relationship between entities. Finally, it poses a question about a quantity in the narrative.

Figure 1: Example problem and solution

An ordinary child has to learn the required algebra, but will easily grasp the narrative utilizing extensive world knowledge, large vocabulary, wordsense disambiguation, coreference resolution, mastery of syntax, and the ability to combine individual sentences into a coherent mental model. In contrast, the challenge for an NLP system is to "make sense" of the narrative, which may refer to arbitrary activities like renting bikes, collecting coins, or eating cookies. Previous work coped with the open-domain aspect of algebraic word problems by relying on deterministic state transitions based on verb categorization (Hosseini et al., 2014) or by learning templates that cover equations of particular forms . We have discovered, however, that both approaches are brittle, particularly as training data is scarce in this domain, and the space of equations grows exponentially with the number of quantities mentioned in the math problem.

We introduce ALGES, 1 which maps an unseen multi-sentence algebraic word problem into a set of possible equation trees. Figure 1 shows an equation tree alongside the word problem it represents.

ALGES generates the space of trees via Integer Linear Programming (ILP), which allows it to con-strain the space of trees to represent type-consistent algebraic equations satisfying as many desirable properties as possible. ALGES learns to map spans of text to arithmetic operators, to combine them given the global context of the problem, and to choose the "best" tree corresponding to the problem. The training set for ALGES consists of unannotated algebraic word problems and their solution. Solving the equation represented by such a tree is trivial. ALGES is described in detail in Section 4.

ALGES is able to solve word problems with single-variable equations like the ones in Figure 1 . In contrast to Hosseini et al. (2014) , ALGES covers +, −, * , and /. The work of has broader scope but we show that it relies heavily on overlap between training and test data. When that overlap is reduced, ALGES is 15% to 50% more accurate than this system.

Our contributions are as follows: (1) We formalize the problem of solving multi-sentence algebraic word problems as that of generating and ranking equation trees;

(2) We show how to score the likelihood of equation trees by learning discriminative models trained from a small number of word problems and their solutions -without any manual annotation; and (3) We demonstrate empirically that ALGES has broader scope than the system of Hosseini et al. (2014) , and overcomes the brittleness of the method of .

Table 1. Not extracted; please refer to original document.
Table 2: Rules for reordering Qsets.
Table 3: ILP notation for candidate equations model
Table 4: Decreasing template overlap: Accuracy of ALGES versus the template-based method on singleequation algebra word problems. The first column corresponds to the SINGLEEQ dataset, and the other columns are for subsets with decreasing template overlap.
Table 5: Decreasing lexical overlap: Accuracy of ALGES versus the template-based method on single-equation algebra word problems. The first column corresponds to the SINGLEEQ dataset, and the other columns are for subsets with decreasing lexical overlap.

2 Previous Work

Our work is related to situated semantic interpretation, which aims to map natural language sentences to formal meaning representations . More closely related is work on language grounding, whose goal is the interpretation of a sentence in the context of a world representation . However, while most previous work considered individual sentences in isolation, solving word problems often requires reasoning across the multi-sentence discourse of the problem text. Recent efforts in the math domain have studied number word problems , logic puzzle problems , arithmetic word problems , algebra word problems , and geometry word problems . We discuss in more detail below two pioneering works closely related to our own. solve elementary addition and subtraction problems by learning verb categories. They ground the problem text to a semantics of entities and containers, and decide if quantities are increasing or decreasing in a container based upon the learned verb categories. While relying only on verb categories works well for + and −, modeling * or / requires going beyond verbs. For instance, "Tina has 2 cats. John has 3 more cats than Tina. How many cats do they have together?" and "Tina has 2 cats. John has 3 times as many cats as Tina. How many cats do they have together?" have identical verbs, but the indicated operation (+ and * resp.) is different. ALGES makes use of a richer semantic representation which facilitates deeper learning and a wider scope of application, solving problems involving the +, −, /, and * operators (see Table 6 ). introduce a general method for solving algebra problems. This work can align a word problem to a system of equations with one or two unknowns. They learn a mapping from word problems to equation templates using global and local features from the problem text. However, the large space of equation templates makes it challenging for this model to learn to find the best equation directly, as a sufficiently similar template may not have been observed during training. Instead, our method maps word problems to equation trees, taking advantage of a richer representation of quantified nouns and their properties, as well as the recursive nature of equation trees. These allow ALGES to use a bottom-up approach to learn the correspondence between spans of texts and arithmetic operators (corresponding to intermediate nodes in the tree). ALGES then scores equations using global structure of the problem to produce the final result.

Table 6: Accuracy of ALGES compared to verb categorization method.

Our work is also related to research in using ILP to enforce global constraints in NLP appli-cations . Most previous work utilizes ILP as an inference procedure to find the best global prediction over initially trained local classifiers. Similarly, we use ILP to enforce global and domain specific constraints. We, however, use ILP to form candidate equations which are then used to generate training data for our classifiers. Our work is also related to parser re-ranking , where a re-ranker model attempts to improve the output of an existing probabilistic parser. Similarly, the global equation model designed in ALGES attempts to re-rank equations based on global problem structure.

3 Setup And Problem Definition

Given numeric quantities V and an unknown x whose value is the answer we seek, an equation over V and x is any valid mathematical expression formed by combining elements of V ∪ {x} using binary operators from O = {+, −, * , /, =} such that x appears exactly once. When each element of V appears at most once in the equation, it may naturally be represented as an equation tree where each operator is a node with edges to its two operands. 2 T denotes the set of all equation trees over V and x.

Problem Formulation. We address the problem of solving grade-school algebra word problems that map to single equations. Solving such a word problem w amounts to selecting an equation tree t representing the mathematical computation implicit in w. Figure 1 shows an example of w with quantities underlined, and the corresponding tree t. Formally, we use a joint probability distribution p(t, w) that defines how "well" an equation tree t ∈ T captures the mathematical computation expressed in w. Given a word problem w as input, our goal is to computẽ t = arg max t∈T p(t|w).

An exhaustive enumeration over T quickly becomes impractical as problem complexity increases and n = |V ∪ {x}| grows. Specifically, On Monday, 375 students went on a trip to the zoo. All 7 buses were filled and 4 students had to travel in cars. How many students were in each bus ? learn to findt directly, as a sufficiently similar tree may not have been observed during training. Instead, our method first generates syntactically valid equation trees, and then uses a bottom-up approach to score equations with a local model trained to map spans of text to math operators, and a global model trained for coherence of the entire equation w.r.t. global problem text structure.

|T | > h(n) = n! (n−1)! (n−1) 2 n−4 , h(4) = 432, h(6) > 1.7M, h(8) > 22B,

Tr global T l (w) : operator nodes in T l (w) T (w) \ T l (w)

T l (w) T (w) \ T l (w)

4 Overview of the Approach Figure 2 gives an overview of our method, also detailed in Figure 3 . In order to build equation trees, we use a compact representation for each node called a Quantified Set or Qset to model natural language text quantities and their properties (e.g., '375 students' in '7 buses'). Qsets are used for tracking and combining quantities when learning the correspondence between equation trees and text.

Figure 2: An overview of the process of learning for a word problem and its Qsets.
Figure 3. Not extracted; please refer to original document.

Definition 1. Given a math word problem w, let S be the set of all possible spans of text in w, φ denote the empty span, and S φ = S ∪ {φ}. A Qset for w is either a base Qset or a compound Qset. A base Qset is a tuple (ent, qnt, adj, loc, vrb, syn, ctr) with:

• ent ∈ S: entity or quantity noun (e.g., 'student');

• qnt ∈ R ∪ {x}: number or quantity (e.g., 4 or x);

Learning (word problems W , corresponding solutions L):

EQUATION (: Not extracted; please refer to original document.

a) S ← Base Qsets obtained by Grounding text w i and Reordering the resulting Qsets (Section 5) (b) T i ← Top M type-consistent equation tree candidates generated by ILP(w i ) (Section 6) (c) T i ← Subset of T i that yields the correct numerical solution i (d) Add to Tr local features s 1 , s 2 with label op for each operator op combining Qsets s 1 , s 2 in trees in T i (e) Add to Tr global features w, t labeled positive for each t ∈ T i and labeled negative for each t ∈

T \ T i 2. L local ← Train a local Qset relationship model on Tr local (Section 7.1) 3. G global ← Train a global equation model on Tr global (Section 7.2) 4. Output local and global models (L local , G global ) Inference (word problem w, local set relation model L local , global equation model G global ):

1. S ← Base Qsets obtained by Grounding text w i and Reordering the resulting Qsets (Section 5) 2. T ← Top M type-consistent equation tree candidates generated by ILP(w) (Section 6)

3. t * ← arg max ti∈T tj ∈t L local (t j |w) × G global (t|w),

scoring each tree t i ∈ T based on Equation 1 4. ← Numeric solution to w obtained by solving equation tree t * for the unknown 5. Output (t * , ) Figure 3 : Overview of our method for solving algebraic word problems.

• adj ⊆ S φ : adjectives for ent in w;

• loc ∈ S φ : location of ent (e.g., 'in the drawer');

• vrb ∈ S φ : governing verb for ent (e.g., 'fill');

• syn: syntactic and positional information for ent (e.g., 'buses' is in subject position) ; • ctr ⊆ S φ : containers of ent (e.g., 'Bus' is a container for the 'students' Qset). Properties being φ indicates these optional properties are unspecified. A compound Qset is formed by combining two Qsets with a non-equality binary operator as discussed in section 5.

Qsets can be further combined with the equality operator to yield a semantically augmented equation tree. 3 The example in Figure 2 has four base Qsets extracted from problem text. Each possible equation tree corresponds to a different recursive combination of these four Qsets.

Given w, ALGES first extracts a list of n base Qsets S = {s 1 , . . . , s n } (Section 5). It then uses an ILP-based optimization method to combine extracted Qsets into a list of type-consistent candidate equation trees (Section 6). Finally, ALGES uses discriminative models to score each candidate equation, using both local and global features (Section 7).

Specifically, the recursive nature of our representation allows us to decompose the likelihood function p(t, w) into local scoring functions for each in- 3 Inspired by Semantically Augmented Parse Trees (Ge and Mooney, 2005) adapted to equational logic. ternal node of t followed by scoring the root node:

p(t|w) ∝   t j ∈t L local (t j |w)   × G global (t|w) (1)

where the local function L local (t j |w) scores the likelihood of the subtree t j , modeling pairwise Qset relationships, while the global function G global (t|w) scores the likelihood of the root of t, modeling the equation in its entirety.

Learning. ALGES learns in a weakly supervised fashion, using word problems w i and only their correct answer i (not the corresponding equation tree) as training data {(w i , i )} i∈{1,...,N } . We ground each w i into ordered Qsets and generate a list of type-consistent candidate training equations T i that yield the correct answer i .

We build a local discriminative model L local to score the likelihood that a math operator op ∈ O can correctly combine two Qsets s 1 and s 2 based on their semantics and intertextual relationships. For example, in Figure 2 this model learns that * has a high likelihood score for '7 buses' and 'x students'. The training data consists of feature vectors s 1 , s 2 labeled with op, derived from the equation trees that yield the correct solution.

We also build a global discriminative model that scores equation trees based on the global problem structure:

G global = ψ f global (w, t) where f global

represents global features of w and t, and φ are parameters to be learned. The training data consists of feature vectors w, t for equation trees that yield the correct solution as positive examples, and the rest as negatives ( Figure 2 ). The details of learning and inference steps are described in Section 7.

5 Grounding And Combining Qsets

We discuss how word problem text is grounded into an ordered list of Qsets. A Qset is a compact representation of the properties of a quantity as described in a single sentence. The use of Qsets facilitates the building of semantically augmented equation trees. Additionally, by tracking certain properties of text quantities, ALGES can resolve pronominal references or elided nouns to properties of previous Qsets. It can also combine information about quantities referenced in different sentences into a single semantic structure for further use. Grounding. ALGES translates the text of the problem w into interrelated base Qsets {s 1 , . . . , s n }, each associated with a quantity in the problem text w. The properties of each Qset (Definition 1) are extracted from the dependency parse relations present in the sentence where the quantity is referred to according to the rules described in Table 1 .

Additionally, ALGES assigns a single target Qset s x corresponding to the question sentence. The properties of the target Qset are also extracted according to the rules of the Table 1. In particular, the qnt property is set to unknown, the ent is set to the noun appearing after the words what, many or much in the target sentence, and the other properties are extracted as listed in Table 1 .

Table 7: Ablation study of each component of ALGES.
Table 8: Accuracy of local classifier in predicting the correct operator between two Qsets and ablating feature sets.

Reordering. In order to reduce the space of possible equation trees, ALGES reorders Qsets {s 1 , . . . , s n } according to semantic and textual information and enforces a constraint that Qsets can only combine with adjacent Qsets in the equation tree. In Figure 2, the target Qset corresponding to the unknown (x 'students') is moved from its textual location at the end of the problem and placed adjacent to the Qset with entity 'buses'. This move is triggered by the relationship between the target entity 'student' and its container 'bus' that is quantified by each in the last sentence. In addition to the container match rule, we employ three other rules to move the target For each quantity mentioned in the text, properties (qnt, ent, ctr, adj, vrb, loc) of the corresponding Qset are extracted as follows: 1. qnt (quantity) is a numerical value or determiner found in the problem text, or a variable. 2. ent (entity) is a noun related to the qnt in the dependency parse tree. If qnt is a numerical value, ent is the noun related by the num, number, or prep of relations. If qnt is a determiner, ent is the noun related via the det relation. When such a noun does not exist due to parse failure or pragmatic recoverability, ent is the noun that is closest to qnt in the same sentence or the ent associated with the most recent Qset. 3. ctr (container) is the subject of the verb governing ent, except in two cases: when this subject is a pronominal reference, the ctr is set to the ctr of the closest previous Qset; if ent is related to another Qset whose qnt is one of each, every, a, an, per, or one, ctr is set to the ent of that Qset. 4. adj (adjectives) is a list of adjectives related to ent by the amod relation. 5. vrb (verb) is a governing verb, either related to ent by nsubj or dobj 6. loc (location) is a noun related to ent by prep on, prep in, or prep at relations. For op = +, the properties of either Qset a or b suffice to define c. ALGES always forms c using the properties of b in these situations. For op = −, the properties of the left operand a define the resultant set, as evidenced by the subtraction operations present in the first problem in Table 9 . To determine 1. Move Qset s i to immediately after Qset s j if the container of s i is the entity of s j and is quantified by each. 2. Move target Qset to the front of the list if the question statement includes keywords start or begin. 3. Move target Qset to the end of the list if the question statement includes keywords left, remain, and finish. 4. Move target Qset to the textual location of an intermediate reference with the same ent if its num property is the determiner some. the stickers in Luke's possession, we need to track stickers related to the left Qset with the verb 'got'. For op = * , the Qset relationship is captured by the container and entity properties: the one whose properties preserve after multiplication has the other's entity as its container. In Figure 2 , the 'bus' Qset is the container of 'students'. When these are combined with the * operator, the result is of entity type 'student'. For op = /, we use the properties of the left operand to encourage a distinction between division and multiplication.

Table 9. Not extracted; please refer to original document.

6 Generating Equation Trees With Ilp

We use an ILP optimization model to generate equation trees involving n base Qsets. These equation trees are used for both learning and inference steps. ALGES generates an ordered list of M of the most desirable candidate equations for a given word problem w using an ILP, which models global considerations such as type consistency and appropriate low expression complexity. To facilitate generation of equation trees, we represent them in parenthesis-free postfix or reverse Polish notation, where a binary operator immediately follows the two operands it operates on (e.g., abc+ * x=).

Given a word problem w with n base Qsets (cf. Table 3 for notation), we build an optimization model ILP(w) over the space of postfix equations E = e 1 e 2 . . . e L of length L involving k numeric constants, k = n − k unknowns, r possible binary operators, and q "types" of Qsets, where type corresponds to the entity property of Qsets and determines which binary relationships are permitted between two given Qsets. For single variable equations over binary operators O, k = 1, r = |O| = 5, and L = 2n − 1. For brevity, define m = n + r and let

[j] denote {1, . . . , j}. Expression E can be evaluated by considering e 1 , e 2 , . . . , e L in order, pushing non-operator symbols on to a stack σ, and, for operator symbols, popping the top two elements of σ, applying the operator to them, and pushing the result back on to σ. The stack depth of the e i is the stack size after e i has been processed this way. Constraints and Objective Function. Constraints in ILP(w) include syntactic validity, type consistency, and domain specific simplicity considerations. We describe them briefly here, leaving details to the Appendix. The objective function minimizes the sum of the weights of violated soft constraints.

Below, (H) denotes hard constraints, (W) weighted soft constraints, and (P) post-processing steps. Definitional Constraints (H): Constraints over indicator variables c i , u i , and o i ensure they represent their intended meaning, including the invariant c i + u i + o i = 1. For stack depth variables, we add d 1 = 1 and

d i = d i−1 − 2o i + 1 for i > 1.

Syntactic Validity (H): Validity of the postfix expression is enforced easily through constraints o 1 = 0 and d L = 1. In addition, we add x L = m and x i < m for i < L to ensure equality occurs exactly once and as the top-level operator.

Operand Access (H): The second operand of an operator symbol e i is always e i−1 . Its first operand, however, is defined instead by the stack-based evaluation process. ILP(w) encodes it using an alternative characterization: the first operand of e i is e j iff j ≤ i−2 and j is the largest index such that

d i = d j .

Type Consistency (W): Suppose T 1 and T 2 are the types of the two operands of an operator o, whose type is T o . Addition and subtraction preserve the type of their operands, i.e., if o is + or −, then T o = T 1 = T 2 . Multiplication inherits the type of one of its operands, and division inherits the type of its first operand. In both cases, the two operands must be of different types. Formally, if o is * , then T o ∈ {T 1 , T 2 } and T 1 = T 2 ; if o is /, then T o = T 1 = T 2 .

Domain Considerations (H,W): We add a few domain specific constraints based on patterns observed in a small subset of the questions. These include an upper bound on the stack depth, which helps avoid overly complex expressions unsuitable for gradeschool algebra, and reducing redundancy by, e.g., disallowing the numeric constant 0 to be an operand of + or − or the second operand of /.

Symmetry Breaking (H,W): If a commutative operator is preceded by two numeric constants (e.g., ab+), we require the constants to respect their Qset ordering. Every other pair of constants that disrespects its Qset ordering incurs a small penalty.

Negative and Fractional Answers (P): Rather than imposing non-negativity as a complex constraint in ILP(w), we filter out candidate expressions yielding a negative answer as a post-processing step. Similarly, when all numeric constants are integers, we filter out expressions yielding a fractional answer, again based on typical questions in our datasets.

7 Learning

Our goal is to learn a scoring function that identifies the best equation tree t * corresponding to an unseen word problem w. Since our dataset consists only of problem-solution pairs {(w i , i )} i=1,...,N , training our scoring models requires producing equation trees matching i . For every training instance (w i , i ), we use ILP(w i ) to generate M typeconsistent equation tree candidates T i . To train our local model (section 7.1), we filter out trees from T i that do not evaluate to i , extract all (s 1 , s 2 , op) triples from the remaining trees, and use feature vectors capturing (s 1 , s 2 ) and labeled with op as training data (see Figure 2) . For the global model, we use for training data a subset of T i with an equal number of correct and incorrect equation trees (section 7.2). Once trained, we use Equation 1 to combine these models to compute a score for each candidate equation tree generated for an unseen word problem at inference time (see Figure 3 ).

7.1 Local Qset Relationship Model

We train a local model of a probability distribution over the math operators that may be used to combine a pair of Qsets. The idea is to learn the correspondence between spans of texts and math operators by examining such texts and the Qsets of the involved operands. Given Qsets s 1 and s 2 , the local scoring function scores the probability of each op ∈ {+, −, * , /}, i.e.,

L local = θ f local (s 1 , s 2 )

where f local is a feature vector for s 1 and s 2 . Note that either Qset may be a compound (the result of a combine procedure). The goal is to learn parameters θ by maximizing the likelihood of the operators between every two Qsets that we observe in the training data. We model this as a multi-class SVM with an RBF kernel.

Features. Given the richness of the textual possibilities for indicating a math operation, the features are designed over semantic and intertextual relationships between Qsets, as well as domain-specific lexical features. The feature vector includes three main feature categories (Table 4) 'add' and 'times' appear in the vicinity of the set reference in the text. Also, following Hosseini et al. 2014, we include a vector that captures the distance between the verbs associated with each Qset and a small collection of verbs found to be useful in categorizing arithmetic operations in that work, based upon their Lin Similarity (Lin, 1998) . Second, relationships between Qsets are described w.r.t. various Qset properties described in section 4. These include binary features like whether one Qset's container property matches the other Qset's entity (a strong indicator of multiplication), or the distance between the verbs associated with each set based upon their Lin Similarity.

Third, target quantity features check the matching between the target Qset and the current Qset as well as math keywords in the target sentence.

7.2 Global Equation Model

We also train a global model that scores equation trees based on the global structure of the tree and the problem text. The global model scores the compatibility of the tree with the soft constraints introduced in Section 6 as well as its correspondence with the problem text. We use a discriminative model:

G global = ψ f global (w, t)

where f global are the fea-tures capturing trees and their correspondences with the problem text. We train a global classifier to relate these features through parameters ψ.

Features f global are explained in Table 4 . They include the number of violated soft constraints in the ILP, the probabilities of the left and right subtrees of the root as provided by the local model, and global lexical features. Additionally, the three local feature sets are applied to the left and right Qsets.

7.3 Inference

For an unseen problem w, we first extract base Qsets from w. The goal is to find the most likely equation tree with minimum violation of hard and soft constraints. Using ILP(w) over these Qsets, we generate M candidate equation trees ordered by the sum of the weights of the constraints they violate. We compute the likelihood score given by Eqn. (1) for each candidate equation tree t, use this as an estimate of the likelihood p(t|w), and return the candidate tree t * with the highest score. In Eqn. (1), the score of t is the product of the likelihood scores given by the local classifier for each operand in t and the Qsets over which it operates, multiplied by the likelihood score given by the global classifier for the correctness of t. If the resulting equation provides the correct answer for w, we consider inference successful.

8 Experiments

This section reports on three experiments: a comparison of ALGES with 's template-based method, a comparison of ALGES with Hosseini et al. (2014)'s verb-categorization methods, and ablation studies. The experiments are complicated by the fact that ALGES is limited to single equations, and the verb categorization method can only handle single-equations without multiplication or division. Our main experimental result is to show an improvement over the template-based method on single-equation algebra word problems. We further show that the template-based method depends on lexical and template overlap between its training and test sets. When these overlaps are reduced, the method's accuracy drops sharply. In contrast, ALGES is quite robust to changes in lexical and template overlap (see Tables 4 and 5) .

Experimental Setup. We use the Stanford Dependency Parser in CoreNLP 3.4 (De Marneffe et al., 2006) to obtain syntactic information used for grounding and feature computation. For the ILP model, we use CPLEX 12.6.1 (IBM ILOG, 2014) to generate the top M = 100 equation trees with a maximum stack depth of 10, aborting exploration upon hitting 10K feasible solutions or 30 seconds. 5 We use Python's SymPy package for solving equations for the unknown. For the local and global models, we use the LIBSVM package to train SVM classifiers (Chang and Lin, 2011) with RBF kernels that return likelihood estimates as the score. Dataset. This work deals with grade-school algebra word problems that map to single equations with varying length. Every equation may involve multiple math operations including multiplication, division, subtraction, and addition over non-negative rational numbers and one variable. The data is gathered from http://math-aids.com, http: //k5learning.com, and http://ixl.com websites and a subset of the data from that maps word problems to single equations. We refer to this dataset as SINGLEEQ (see Table 9 for example problems). The SINGLEEQ dataset consists of 508 problems, 1,117 sentences, and 15,292 words. Baselines. We compare our method with the template-based method and the verb-categorization method (Hosseini et al., 2014) . For the template-based method, we use the fully supervised setting, providing equations for each training example.

8.1 Comparison With Template-Based Method

We first compare ALGES with the template-based method over SINGLEEQ. We evaluate both systems on the number of correct answers provided and report the average of a 5-fold cross validation. ALGES achieves 72% accuracy whereas the template-based method achieves 67% accuracy, a 15% relative reduction in errors (first columns in Tables 4 and 5) . This result is statistically significant with a p-value of 0.018 under a paired t-test. Table 5 : Decreasing lexical overlap: Accuracy of ALGES versus the template-based method on single-equation algebra word problems. The first column corresponds to the SINGLEEQ dataset, and the other columns are for subsets with decreasing lexical overlap.

Lexical Overlap. By further analyzing SINGLEEQ, we noted that there is substantial overlap between the content words (common noun, adjective, adverb, and verb lemmas) in different problems. For example, many problems ask for the total number of seashells collected by two people on a beach, with only the names of the people and the number of seashells that each found changed. To analyze the effect of this repetition on the learning methods evaluated, we define a lexical overlap parameter as the total number of content words in a dataset divided by the number of unique content words. The two "seashell problems" have a high lexical overlap.

Template Overlap. We also noted that many problems in SINGLEEQ can be solved using the same template, or equation tree structure above the leaf nodes. For example, a problem which corresponds to the equation (9 * 3) + 7 and a different problem that maps to (4 * 5) + 2 share the same template. We introduce a template overlap parameter defined as the average number of problems with the same template in a dataset.

Results. In our data, template overlap and lexical overlap co-vary. To demonstrate the brittleness of the template-based method simply, we picked three subsets of SINGLEEQ where both parame-ters were substantially lower than in SINGLEEQ and recorded the relative performance of the templatebased method and of ALGES in Tables 4 and 5 . The data used in both tables is the same, but the tables are separated for readability. The first column reports results for the SINGLEEQ dataset, and the other columns report results for the subsets with decreasing template and lexical overlaps. The subsets consist of 254, 127, and 63 questions respectively. We see that as the lexical overlap drops from 4.3 to 2.5 and as the template overlap drops from 10.4 to 2.1, the relative advantage of ALGES over the template methods goes up from 15% to 50%.

While the template-based method is able to solve a wider range of problems than ALGES, its accuracy falls off significantly when faced with fewer repeated templates or less spurious lexical overlap between problems (from 0.67 to 0.26). The accuracy of ALGES also declines from 0.72 to 0.63 across the table, which needs to be investigated further. In future work, we also need to investigate additional settings for the two parameters and to attempt to "break" their co-variance. Nevertheless, we have uncovered an important brittleness in the templatebased method and have shown that ALGES is substantially more robust.

8.2 Comparison With Verb-Categorization

The verb-categorization method learns to solve addition and subtraction problems, while ALGES is capable of solving multiplication and division problems as well. We compare against their method over our dataset as well as the dataset provided by that work, here referred to as ADDSUB. ADDSUB consists of addition and subtraction word problems with the possibility of irrelevant distractor quantities in the problem text. The verb categorization method uses rules for handling irrelevant information. An example rule is to remove a Qset whose adjective is not consistent with the adjective of the target Qset. We augment ALGES with rules introduced in this method for handling irrelevant information in ADDSUB.

Results, reported in Table 6 , show comparable accuracy between both methods on Hosseini et al. (2014) data. Our method shows a significant improvement versus theirs on the SINGLEEQ dataset due to the presence of multiplication and division operators, as 40% of the problems in our dataset include these operators.


ADDSUB SINGLEEQ ALGES 0.77 0.72 Verb-categorization 0.78 0.48 Error reduction -53% Table 6 : Accuracy of ALGES compared to verb categorization method.

8.3 Ablation Study

In order to determine the effect of various components of our system on its overall performance, we perform the following ablations:

No Local Model: Here, we test our method absent the local information (Section 7.1). That is, we generate equations using all ILP constraints, and score trees solely on information provided by the global model:

p(t|w) ∝ G global (w, t).

No Global Model: Here, we test our method without the global information (Section 7.2). That is, we generate equations using only the hard constraints of ILP and score trees solely on information provided by the local model:

p(t|w) ∝ t i ∈t L local (w, t i ).

No Qset Reordering: We test our method without the deterministic Qset reordering rules outlined in Section 5. Instead, we allow the ILP to choose the top M equations regardless of order.

Results in Table 7 show that each component of ALGES contributes to its overall performance on the SINGLEEQ corpus. We find that both the Global and Local models contribute significantly to the overall system, demonstrating the significance of a bottomup approach to building equation trees.

Importance of Features. We also evaluate the accuracy of the local Qset relationship model (Section 7.1) on the task of predicting the correct operator for a pair of Qsets s 1 , s 2 over the SIN-GLEEQ dataset using a 5-fold cross validation. Table 8 shows the value of each feature group used in the local classifier, and thus the importance of details of the Qset representation. 8.4 Qualitative Examples and Error Analysis. Table 9 shows some examples of problems solved by our method. We analyzed 72 errors made by ALGES on the SINGLEEQ dataset. ((20 + ((12 + 20) − 8)) − 5) = x Maggie bought 4 packs of red bouncy balls, 8 packs of yellow bouncy balls, and 4 packs of green bouncy balls. There were 10 bouncy balls in each package. How many bouncy balls did Maggie buy in all?

x = (((4 + 8) + 4) * 10) Sam had 79 dollars to spend on 9 books. After buying them he had 16 dollars. How much did each book cost? 79 = ((9 * x) + 16) Fred loves trading cards. He bought 2 packs of football cards for $2.73 each, a pack of Pokemon cards for $4.01, and a deck of baseball cards for $8.95. How much did Fred spend on cards?

((2 * 2.73) + (4.01 + 8.95)) = x Table 9 : Examples of problems solved by ALGES together with the returned equation.

Parsing errors cause a wrong grounding into the designed representation. For example, the parser treats 'vanilla' as a noun modified by the number '19', leading our system to treat 'vanilla' as the entity of a Qset rather than 'cupcake'. Despite the improvements that come from ALGES, a portion of errors are attributed to grounding and ordering issues. For instance, the system fails to correctly dis- tinguish between the sets of wheels, and so does not get the movement-triggering container relationships right. Semantic limitations are another source of errors. For example, ALGES does not model the semantics of 'three consecutive numbers'. The fourth category refers to errors caused due to lack of world knowledge (e.g., 'week' corresponds to '7 days'). Finally, ALGES is not able to infer quantities when they are not explicitly mentioned in the text. For example, the number of people should be inferred by counting the proper names in the problem.

9 Conclusion

In this work we have outlined a method for solving grade school algebra word problems. We have empirically demonstrated the value of our approach versus state-of-the-art word problem solving techniques. Our method grounds quantity references, utilizes type-consistency constraints to prune the search space, learns which algebraic operators are indicated by text, and ranks equations according to a global objective function. ALGES is a hybrid of previous template-based and verb categorization statebased methods for solving such problems. By learning correspondences between text and mathematical operators, we extend the method of state updates based on verb categories. By learning to re-rank equation trees using a global likelihood model, we extend the method of mapping word problems to equation templates.

Different components of ALGES can be adapted to other domains of language grounding that require cross-sentence reasoning. Future work involves extending ALGES to solve higher grade math word problems including simultaneous equations. This can be accomplished by extending the variable grounding step to allow multiple variables, and training the global equation model to recognize which quantities belong to which equation. The code and data for ALGES are publicly available.

Problems involving simultaneous equations require combining multiple equation trees, one per equation.

These reordering rules are intentionally minimal, but do provide some gain over both preserving the text ordering of quantities or setting ordering as a soft constraint. SeeTable 7.

These hyper-parameters were chosen based on experimentation with a small subset of the questions. A more systematic choice may improve overall performance.

Figure 4: Features used for local and global models, for left Qset A and right Qset B
Table 10: Examples of different error categories and relative frequencies. Sources of errors are underlined.