Question Decomposition with Dependency Graphs

Matan Hasson
Jonathan Berant
ArXiv
2021
View in Semantic Scholar

Abstract

QDMR is a meaning representation for complex questions, which decomposes questions into a sequence of atomic steps. While stateof-the-art QDMR parsers use the common sequence-to-sequence (seq2seq) approach, a QDMR structure fundamentally describes labeled relations between spans in the input question, and thus dependency-based approaches seem appropriate for this task. In this work, we present a QDMR parser that is based on dependency graphs (DGs), where nodes in the graph are words and edges describe logical relations that correspond to the different computation steps. We propose (a) a nonautoregressive graph parser, where all graph edges are computed simultaneously, and (b) a seq2seq parser that uses gold graph as auxiliary supervision. We find that a graph parser leads to a moderate reduction in performance (0.47→0.44), but to a 16x speed-up in inference time due to the non-autoregressive nature of the parser, and to improved sample complexity compared to a seq2seq model. Second, a seq2seq model trained with auxiliary graph supervision has better generalization to new domains compared to a seq2seq model, and also performs better on questions with long sequences of computation steps.

1 Introduction

Training neural networks to reason over multiple parts of their inputs across modalities such as text, tables, and images, has been a focal point of interest in recent years (Antol et al., 2015; Pasupat and Liang, 2015; Johnson et al., 2017; Suhr et al., 2019; Welbl et al., 2018; Talmor and Berant, 2018; Yang et al., 2018; Hudson and Manning, 2019; Dua et al., 2019; Chen et al., 2020; Hannan et al., 2020; Talmor et al., 2021) . The most common way to check whether a model is capable of complex reasoning, is to pose in natural language a complex question, which requires performing multiple steps of computation over the input.

To address the need for better understanding of complex questions, recently proposed QDMR, a meaning representation where complex questions are represented through a sequence of simpler atomic executable steps (see Fig. 1 ), and the final answer is the answer to the final step. QDMR has been shown to be useful for multi-hop question answering (QA) and also for improving interpretability in visual QA (Subramanian et al., 2020) .

Figure 1: An example question with its corresponding QDMR structure (top), dependency graph over the question tokens (bottom), and an intermediate logical form (LF) used for the QDMR→DG conversion and for evaluation.

State-of-the-art QDMR parsers use the typical sequence-to-sequence (seq2seq) approach. However, it is natural to think of QDMR as a dependency graph over the input question tokens. Consider the example in Fig. 1 . The first QDMR step selects the span "Indiana Jones". Then, the next step uses a PROJECT operation to find the "movies" of Indiana Jones, and then next step uses another PROJECT operation to find the date when the movies were "released". Such relations can naturally be represented as labeled edges over the relevant question tokens, as shown in Fig. 1 , bottom.

In this work, we propose to use the dependency graph view of QDMR to improve QDMR parsing. We describe a conversion procedure that automatically maps QDMR structures into dependency graphs, using a structured intermediate logical form representation (Fig 1, middle) . Once we have graph supervision for every example, we train a dependency graph parser, in the spirit of Dozat and Manning (2018) , where we predict a labeled relation for every pair of question tokens, representing the logical relation between the tokens. Unlike seq2seq models, this is a non-autoregressive parser, which decodes the entire output structure in a single step.

A second method to exploit the graph supervision is to train a seq2seq model, but have an auxiliary loss term where the graph is decoded from the encoder representations. Towards that end, we propose a Latent-RAT encoder, which uses relationaware transformer (Shaw et al., 2018) to explicitly represent the relation between every pair of input tokens. Relation-aware transformer has been shown to be useful for encoding graph structures in the context of semantic parsing .

Last, we propose a new evaluation metric, LF-EM, for QDMR parsing, which is based on the aforementioned intermediate logical form, and show it correlates better with human judgements compared to existing metrics.

We find that our graph parser leads to a small reduction in LF-EM compared to seq2seq models (0.47→0.44), but is 16x faster due to its nonautoregressive nature. Moreover, our graph parser has better sample complexity and outperforms the seq2seq model when trained on 10% of the data or less. When training a seq2seq model along with the auxiliary graph supervision, we find that the parser obtains similar performance when trained on the entire dataset (0.471 LF-EM), but substantially improves performance when generalizing to new domains. Moreover, The Latent-RAT parser performs better on examples with a large number of computation steps.

Our code is available at https://github.

2 Overview

The core of this work is to examine the utility of a dependency graph (DG) representation for QDMR. We propose conversion procedures that enable training and evaluating with DGs (see Fig. 2 ). First, we convert gold QDMR structures into logical forms (LF), where each computation step in QDMR is represented with a formal operator, properties and arguments ( §3.1). Then, we obtain gold DGs by projecting the logical forms onto the question tokens ( §4). Once we have question-DG pairs, we can train a graph parser. At test time, QDMRs and DGs are converted into LFs for evaluation. We propose a new evaluation metric over LFs ( §3.2), and show it is more robust to semantic-preserving changes compared to prior metrics. Our proposed parsers are in §5. On top of standard seq2seq models, we describe (a) a graph parser, and (b) a multi-task model, where the encoder of the seq2seq model is also trained to predict the DG.

Table 1: LF operators, properties and arguments (partial list, see Table 6 for full list).

Table 2: Normalized EM and LF-EM on the development and test sets of BREAK.

Table 3: Domain Generalization. LF-EM on the development set per sub-domain when training on the entire training set (top), and when training on all domains except the target one (middle). The bottom section is the performance drop from the full setup to the DomGen setup.

Table 4: Development set LF-EM as a function of the size of the training set.

Table 5: Error classes and their frequency over a sample of 30 random errors. Model names were shortened from CopyNet+BERT, BiaffineGP and Latent-RAT.

3 Qdmr Logical Forms

QDMR is a text-based meaning representation focused on representing the meaning of complex questions. It is based on a decomposition of questions into a sequence of simpler executable steps (Fig. 1) , where each step corresponds to a SQL-inspired operator (Table 6 in Appendix §A.1). We briefly review QDMR and then define a logical form (LF) representation based on these operations. We use the LFs both for map-1. return census groups 2. return #1 that is Pacific islander 3. return #1 that is African American 4. return size of #2 5. return size of #3 6. return which is lowest of #4 , #5

Table 6: LF operators, properties and arguments. Each QDMR step can be mapped to one of the above operators, where its LF consists of its operator, properties and arguments. The example column shows an example for such LF.

Vq = {. . . group, groups, . . . small, smaller, smallest, . . . } V 5 ref = {#1, #2, #3, #4} Vconst = Vop ∪ Vstore ∪ Vaux Vop = {difference, sum, lowest, highest, for each, .

. . } Vstore = {population, size, elevation, flights, price, date . . . } Vaux = {a, is, are, of, that, the, with, was, did, to . . . } Figure 3 : QDMR annotation vocabularies. Each example is annotated with a lexicon that consists of: V q , the question tokens and their inflections; V i ref , references to previous steps; V const , constant terms including operational terms (V op ), domain-specific words that are not in the question, such as size (V store ); and auxiliary words like prepositions (V aux ). Boldface indicates words used in the QDMR structure.

Figure 3: QDMR annotation vocabularies. Each example is annotated with a lexicon that consists of: Vq , the question tokens and their inflections; Viref, references to previous steps; Vconst, constant terms including operational terms (Vop), domain-specific words that are not in the question, such as size (Vstore); and auxiliary words like prepositions (Vaux). Boldface indicates words used in the QDMR structure.

ping QDMRs to DGs, and also to fairly evaluate the output of parsers that output either QDMRs directly or DGs.

3.1 Definitions

QDMR Definition Given a question with n tokens, q = q 1 , . . . , q n , its QDMR is a sequence of m steps s 1 , . . . , s m , where step s i conceptually maps to a single logical operator

o i ∈ O. A step, s i , is a sequence of n i tokens s i = s i 1 , . . . s i n i , where token s i

j is either a question token ∈ V q (or some inflection of it), a word from a constant predefined lexicon ∈ V const , or a reference token

∈ V i ref = {#1, . . . , #(i − 1)},

referring to a previous step. Fig. 3 shows an example for a question and its QDMR structure.

QDMR Logical Form (LF) Given a QDMR S = q; s 1 , . . . , s m , its logical form is a sequence of logical form steps Z = q; z 1 , . . . , z m . The LF step z i , corresponding to s i , is a triplet z i = o i , ρ i , A i where o i ∈ O is the logical operator; ρ i ∈ PROP(o i ) are operator-specific properties; and A i is a dictionary of arguments, mapping an operator-specific argument η ∈ ARG(o i ) to a span τ from the QDMR step s i . For convenience, we denote z i with the string Table 6 for full list).

o i [ρ i ](η i 1 = τ i 1 , . . . ).

pendix provides the full list. Imposing more structure on QDMR through the LF is useful (a) for evaluation ( §3.2), where LFs detect differenet QDMR phrasings that have identical meaning, and (b) for creating DGs, since the operators, properties and arguments, will be used to define the labels of edges in the DG (see Fig 1) .

QDMR→LF We convert QDMRs to LFs with a rule-based method, extending the procedure for detecting operators from to also find properties and arguments. To detect properties we use a lexicon, see details in Table 7 (Appendix §A.1), and in our public implementation.

Table 7: Property lexicon. Tokens for detecting the properties of a QDMR step, for creating its logical form.

3.2 Lf-Based Evaluation (Lf-Em)

The official evaluation metric for QDMR 1 is normalized EM (NormEM), where the predicted and gold QDMR structures are normalized using a rulebased procedure, and then exact string match is computed between the two normalized QDMRs. Since in this work we convert both QDMRs ( §3.1) and DGs ( §4) to LFs, we propose LF-EM, a LFbased evaluation metric, and show that it correlates better with notions of semantic equivalence. LF-EM essentially involves computing exact match between the predicted and gold LFs, using the LF described above. To further capture semantic equivalences, we perform a few more normalization steps. We briefly describe these normalizations, and give the full description in §A.2.

Given a logical form Z, we transform each step to a normalized form, and the final textual rep- resentation is given by representing each step as described in §3: OPERATOR[property](arg=. . . ; . . . ). We apply the following steps ( Fig. 4

Figure 4: An illustration of LF normalization. Normalization is done on the LF Z, and we present QDMR steps for ease of reading.

):

Remove and normalize tokens Each LF step includes a list of tokens in its arguments. In this normalization step, we remove lexical items, such as "max", which are used to detect the operator and property (Table 7 in §A.1), as those are already represented outside the arguments. In addition, we remove words from a stop word list (V aux , see Fig. 3 ). Finally, we use a synonym list to represent words in such a list with a single representative (countries→country).

Merge Steps QDMR annotations sometime vary in their granularity. For example one example might contain 'return metal objects', while another might have 'return objects; return #1 that are metal'. This is especially common in FILTER and PROJECT steps. We merge chains of FILTER steps, as well as FILTER or PROJECT steps that follow a SELECT step. See details in §A.2.

Reorder steps QDMR describes a directed acyclic graph of computation steps, and there are multiply ways to order the steps (Fig. 4) . We recursively compute the layer of each step as layer(s) = max {layer(s ref 1 ), . . . } + 1, where the maximization is over all steps s refers to. We then re-order steps by layer and then lexicographically.

We manually evaluate the metrics normalized EM and LF-EM on 50 random development set ex-amples using predictions from the CopyNet-BERT model (see §6). We find that both (binary) metrics have perfect precision: they only assign credit when indeed the QDMR reflects the correct question decomposition, as judged by the authors. However, LF-EM covers more examples, where the LF-EM on this sample is 52.0, while normalized EM is 40.0. Thus, LF-EM provides a tighter lower bound on the performance of a QDMR parser and is a better metric for QDMR parsing.

4 From Lfs To Dependency Graphs

Given a QDMR decomposition S = q; s 1 , . . . , s m , we construct a dependency graph G = N , E , where the nodes N correspond to question tokens, and the edges E describe the logical operations, resulting in a graph with the same meaning as S.

The LF→DG procedure is shown in Fig. 5 and consists of the following steps:

Figure 5: Dependency graph creation: (1) Token alignment: align question tokens and QDMR step tokens. (2) Logical Form: extract the LF of the QDMR. (3) SDG extraction: induced from the LF and the alignment. (4) DG Creation: convert the SDG to a DG.

• Token alignment: align each token in the question to a token in a QDMR step. • Spans Dependency Graph (SDG) extraction: construct a graph where each node corresponds to a list of tokens in a QDMR step, and edges describe the dependencies between the steps. • Dependency Graph (DG) extraction: convert the SDG to a DG over the question tokens.

Here, we add span edges for tokens that are in the same step, and deal with some representation issues (see §4.3). Because we convert predicted DGs to LFs for evaluation, the LF→DG conversion must be invertible. We now describe the details of the LF→DG conversion. Our conversion succeeds in 97.12% of the BREAK dataset .

4.1 Token Alignment

We denote the question tokens by q = q 1 . . . q n and the ith QDMR step tokens by ∀i ∈

[1..m], s i = s i 1 . . . s i n i . An alignment is defined by M = {(q i , s k j ) | q i ≈ s k j ; i ∈ [1..n], k ∈ [1..m], j ∈ [1..n k ]},

where by t ≈ t we mean t, t are either identical or equivalent. Roughly speaking, these equivalences are based on BREAK annotation lexicon ( Fig. 3 ) -in particular. the inflections of the question tokens V q (e.g , "object" and "objects"), and equivalence classes on top of the constant lexicon V const (e.g , "biggest" and "longest"). See Table 8 in Appendix §A.2 for more details. To find the best alignment M , we formulate an optimization problem in the form of an Integer Linear Program (ILP) and use a standard ILP solver. 2 The full details are given as a part of our open source implementation. The objective function uses several heuristics to assign a high score to an alignment the has the following properties ( Fig. 6 ):

Table 8: BREAK Equivalence Classes. (1) Modifications - the same modifications of the question tokens that were used for creating BREAK annotation lexicon (e.g plural/singular form, nounify adjectives, lemmatize adjectives, lemmatize verbs); (2) Operational equivalence induced from properties lexicon; (3) Manually-defined Synonyms lexicon; We mostly retrieve the final equivalence classes by merging classes that share some tokens.

• Minimalism: Aligning each question token to at most one QDMR step token and vice versa is preferable. • Exact Match: Aligning a question token to a QDMR token that is identical is preferable. • Sequential Preference: Aligning long sequences from the question to a single step is preferable (when a step has a reference token (#1), we take into account the tokens in the referenced step, see Fig. 6 , top right). • Steps Coverage: Covering more steps is preferable.

Figure 6: Heuristics for token alignment. Potential tokens for alignment colored, where the preferable choice according to the heuristic is underlined. On the top left, the second occurrence of “water” is preferred in QDMR step #1 due to the adjacent word “buffalo”. On the top right, the second occurrence of “name” is preferred in QDMR step #5, because this step refers to #4 that contains the word “drivers”.

4.2 Spans Dependencies Extraction

Given the QDMR, LF, and alignment M , we construct the Span Dependency Graph (SDG). Each QDMR step is a node labeled by a list of tokens (spans). The list of tokens is computed with the alignment M , where given a QDMR step s k , the list contains all question tokens q i , such that (q i , s k j ) ∈ M , where s k j is a word in s k . The list is ordered according to the position in the question.

Edges in the SDG are computed using reference tokens. If step s i has a reference token to step s j , Figure 6 : Heuristics for token alignment. Potential tokens for alignment colored, where the preferable choice according to the heuristic is underlined. On the top left, the second occurrence of "water" is preferred in QDMR step #1 due to the adjacent word "buffalo". On the top right, the second occurrence of "name" is preferred in QDMR step #5, because this step refers to #4 that contains the word "drivers".

we add an edge (s i , s j ) (we abuse notation and refer to SDG nodes and QDMR steps with the same notation). Each edge has a tag, which is a triple consisting of the operator o i of the source node s i , the property ρ i of the source node, and the named argument η i ref that contains the reference token. For readability we denote the tag triplet

tag(i, j) = o i , ρ i , η i ref by o i -η i ref [ρ i ].

4.3 Sdg→Dg

We construct a DG by projecting the SDG on the question tokens. This is done by: (a) For each SDG node and its list of tokens, add edges between the tokens from left-to-right with a new span tag (black edges in Fig 5) ; (b) use the rightmost word in every span as its representative for the edges between different spans.

However, this transformation is non-trivial for two reasons. First, some SDG nodes do not align to any question token, Second, some question tokens align to multiple SDG nodes, which does not allow the DG to be converted back to an SDG unambiguously for evaluation. We now explain how we resolve such representation issues, mostly based on adding more tokens to the input question. Fig. 7 illustrates the different types of challenges and our proposed solution.

Domain-specific concepts QDMR annotators were allowed to use a small number of tokens that are pragmatically assumed to exist in the domain (V store in Fig. 3 ). For example, when annotating ATIS questions (Hemphill et al., 1990 ), the word "flight" is allowed to be used in the QDMR structure even if it does not appear in the question, since this is a flight-reservation domain. We concatenate all the words in V store to the end of each question after a special separator token, which allows token alignment ( §4.1) to map such QDMR steps to a question word (Fig. 7, top) .

Figure 7: Representation Issues. Projecting the SDG over the question token (DG) is not always trivial. We solve this by concatenating special tokens to the question.

Empty SDG nodes some steps only contains tokens that are not in the question (e.g, "Number of #2" in Fig. 7 bottom) , and thus their list of tokens in the SDG node is empty. In this case, we cannot ground the SDG node in the question. Therefore we add a constant number of dummy tokens, [DUM] , which are used to ground such SDG nodes.

Single tokens to multiple SDG nodes A single question token can be aligned to multiple SDG nodes. Recall the tokens of each SDG nodes are connected with a chain of span edges. This leads to cases where two chains that pass through the same question token cannot be distinguished when converting the DG back to an SDG for evaluation. We solve this by concatenating a constant number of special [DUP] tokens that conceptually duplicate another token by referring to it with a new duplicate tag. Now, each span chain uses a different copy of the shared token by referring to the [DUP] instead of the original one.

5 Models

Once we have methods to convert QDMRs to DGs and LFs, and DGs to LFs, we can evaluate the advantages and disadvantages of standard autoregressive decoders compared to graph-based parsers. We describe three models: (a) An autoregressive parser, (b) a graph parser, (c) an autoregressive parser that is trained jointly with a graph parser in a multi-task setup. For a fair comparison, all models have the same BERT-based encoder (Devlin et al., 2019) .

CopyNet-BERT (baseline) This autoregressive QDMR parser is based on the CopyNet baseline from Wolfson et al. (2020), except we replace the BiLSTM encoder with a transformer initialized with BERT. The model encodes the question q and then decodes the QDMR S step-by-step and tokenby-token.

The QDMR decoder is an LSTM (Hochreiter and Schmidhuber, 1997) augmented with a copy mechanism (Gu et al., 2016) , where at each time step the model either decodes a token from the vocabulary or a token from the input. Since the input is tokenized with word pieces, we average word pieces that belong to a single word to get word representations, which enables word copy-ing. Training is dones with standard maximum likelihood.

Biaffine Graph Parser (BiaffineGP) The biaffine graph parser takes as input the quesiton q augmented with the special tokens described in §4.3 and predicts the DG by classifying for every pair of tokens whether there is an edge between them and the label of the edge. The model is based on the biaffine dependency parser of Dozat and Manning (2018) , except here we predict a DAG and not a tree, so each node can have more that one outgoing edge. Let H = h 1 , . . . , h |H| be the sequence of representations output by the BERT encoder. The biaffine parser uses four 1-hidden layer feed-forward networks over each contextualized token representation h l :

h edge-head l = FF edge-head (h l ), h edge-dep l = FF edge-dep (h l ), h label-head l = FF label-head (h l ), h label-dep l = FF label-dep (h l ).

The probability of an edge from token i to token j is given by σ(h

edge-dep i W edge h edge-head j )

, where W edge is a parameter matrix. Similarly, the score of an edge labeled by the tag t from token i to token j is given by s

t ij = h label-dep i W t h label-head j

, where W t is the parameter matrix for this tag. We then compute a distribution over the set of tags T with softmax(s 1 ij , . . . , s

|T | ij )

. Training is done with maximum likelihood both on the edge probabilities and label probabilities. Inference is done by taking all edges with edge probability > 0.5 and then labeling those edges according to the most probable tag.

There is no guarantee that the biaffine parser will output a valid DG. For example, if an SDG node has an outgoing edge labeled with filter-sub and another labeled with project-sub, we cannot tell if the operator is FILTER or PROJECT. This makes parsing fail, which occurs in 1.83% of the cases. To create a SDG, we first use the span edges to contruct SDG nodes with lists of tokens, and then add edges between SDG nodes by projecting the edges between tokens to edges between the SDG nodes. To prevent cases where parsing fails, we can optionally apply an ILP that takes the predicted probabilities as input, and outputs a valid DG. The exact details are given in our open source implementation. Figure 8 : Latent-RAT architecture. The encoder hidden states are used to represent the relations between the question tokens, r ij . Then, these representations are used for both (1) direct prediction of the dependency (tag) between the tokens (graph-head); and (2) augment the encodings via RAT layers (seq2seq-head). The subnetworks for r K ij , r V i,j are symetric, and represented by the r ij .

Figure 8: Latent-RAT architecture. The encoder hidden states are used to represent the relations between the question tokens, rij . Then, these representations are used for both (1) direct prediction of the dependency (tag) between the tokens (graph-head); and (2) augment the encodings via RAT layers (seq2seq-head). The subnetworks for rKij , r V i,j are symetric, and represented by the rij .

Multi-task Latent-RAT Encoder In this model, our goal is to improve the sequence-to-sequence parser by providing more information to the encoder using the DG supervision. Our model will take the question q (with special tokens as before) as input, and predict both the graph G directly and the QDMR structure S with a decoder.

We would like the information on relations between tokens to be part of the transformer encoder, unlike the biaffine parser that uses separate feedforward networks for that, so that the decoder can take advantage of this information. To accomplish that, we use RAT transformer layers (Shaw et al., 2018; , which explicitly represent relations between tokens, and have been shown to be useful for encoding graphs over input tokens in the context of semantic parsing.

RAT layers inject information on the relation between tokens inside the transformer self-attention mechanism (Vaswani et al., 2017) . Specifically, the similarity score e ij computed using queries and keys is given by:

e ij ∝ h i W Q (h j W K + r K ij ) T ,

where W Q , W K are the query and key parameter matrices and the only change is the term r K ij , which represents the relation between the tokens i and j. Similarly, the relation between tokens is also considered when summing over self-attention values:

H j=1 α ij (x j W V + r V ij ),

where W V is the value parameter matrix, α ij is the attention distribution and the only change is the term r V ij . Unlike prior work where the terms r K ij , r V ij were learned parameters, here we want these vectors to (a) be a function of the contextualized representation and (b) be informative for classifying the dependency label in the gold graph. By learning latent representations from which the gold graph can be decoded, we will provide useful information for the sequence-to-sequence decoder. Specifically, given a RAT layer with representations h i , h j for tokens i and j, we represent relations and compute a loss in the following way (see Figure 8 ):

r K ij = F F K (h i − h j ), S K = R K W out + b K ∈ R n×n×|T | , Loss K = CE(S K ).

F F K is a 1-hidden layer feed-forward network, R K ∈ R (n×n)×d transformer is a concatenation of all r K ij for all pairs of tokens, W out ∈ R d transformer ×|T | is a projection matrix that provides a score for all possible labels (including the NONE label).

We compute an analogous loss Loss V for r V ij and the final graph loss is Loss K + Loss V over all RAT layers. To summarize, by performing multi-task training with this graph loss we push the transformer to learn representations r ij that are informative of the gold graph, and can then be used by the decoder to output better QDMR structures.

6 Experiments

We now describe our empirical evaluation of the models described above.

6.1 Experimental Setup

We build our models in AllenNLP (Gardner et al., 2018) , and use BERT-base (Devlin et al., 2019) to produce contextualized token representations in all models. We train with the Adam optimizer (Kingma and Ba, 2015). Our Latent-RAT model include 4 RAT layers, each with 8 heads. Full details on hyperparameters and training procedure in Appendix §A.3.

We examine the performance of our models in three setups:

• Standard: We use the official BREAK dataset.

• Sample Complexity (SC): We examine the performance of models with decreasing amounts of training data. The goal is to test which model has better sample complexity.

• Domain Generalization (DomGen): We train on 7 out of 8 sub-domains in BREAK and test on the remaining domain, for each target domain.

The goal is to test which model generalizes better to new domains.

As an evaluation metric, we use LF-EM and also the official BREAK metric, normalized EM, when reporting test results on BREAK. Table 2 compares the performance of the different models ( §5) to each other and to the top entries on the BREAK leaderboard. As expected, initializing CopyNet with BERT dramatically improves test performance (0.388→0.47). The Latent-RAT sequence-tosequence model achieves similar performance (0.471), and the biaffine graph parser, BiaffineGP, is slightly behind with an LF-EM of 0.44 (but has faster inference, as we show below). Adding an ILP layer on top of BiaffineGP to eliminate constraint violations in the output graph improves performance to 0.454.

Standard Setup

While our proposed models do not significantly improve performance in the LF-EM setup, we will see next that they improve domain generalization and sample complexity. Moreover, since BiaffineGP is a non-autoregressive model that predicts all output edges simultaneously, it dramatically reduces inference time.

Last, the top entry on the BREAK leaderboard uses BART (Lewis et al., 2020) , a pre-trained seq2seq model (we use a pre-trained encoder only), which leads to a state-of-the-art LF-EM of 0.496. Table 3 shows LF-EM on each of BREAK's sub-domains when training on the entire dataset (top), when training on all domains but the target domain (middle), and the relative drop compared to the standard setup (bottom).

Domain Generalization

The performance of BiaffineGP and Latent-RAT is higher compared to CopyNET+BERT in the DomGen setup. In particular, the performance of Latent-RAT is the best in 7 out of 8 sub-domains, and the performance of BiaffineGP is the best in the last domain. Moreover, Latent-RAT outperforms CopyNet+BERT in all sub-domains. We also observe that the performance drop is lower for BiaffineGP and Latent-RAT compared to Copy-Net+BERT. Overall, this shows that using graphs as a source of supervision leads to better domain generalization. Table 4 shows model performance as a function of the size of the training data. While the LF-EM of BiaffineGP is lower given the full training set (Table 2) , when the size of the training data is small it substantially outperforms other models, improving performance by 3-4 LF-EM points given 1%-10% of the data. With 20%-50% of the data Latent-RAT and Copy-Net+BERT have comparable performance.

Sample Complexity

Inference time for the graph parser The graph parser, BiaffineGP, is a non-autoregressive model that predicts all output edges simultaneously, as opposed to a sequence-to-sequence model that decodes a single token at each step. We measure the average runtime of the forward pass for both BiaffineGP and CopyNet+BERT and find that Bi-affineGP has an average runtime of 0.08 seconds, compared to 1.306 seconds of CopyNet+BERT -a 16x speed-up.

6.3 Analysis

Model agreement Figure 9 shows model agreement between CopyNet+BERT, BiaffineGP, and Latent-RAT on the development set. Roughly 60% of the examples are predicted correctly by one of the models, indicating that enembling the three models could result in further performance improvement. The agreement of Latent-RAT with Copy-Net+BERT (5.5%) and BiaffineGP (4.27%) is greather than the overlap between CopyNet+BERT and BiaffineGP, perhaps since it is a hybrid of a seq2seq and graph parser. Moreover, in 3.83% of the examples, Latent-RAT is the only model with a correct prediction.

Figure 9: Model agreement in terms of LF-EM on the development set. Each slice gives the fraction of examples predicted correctly by a subset of models.

Length Analysis

We compared the average LF-EM of models for each possible number of steps in the QDMR structure (Fig. 10) . We observe that CopyNet+BERT outperforms Latent-RAT when the number of steps is small, but once the number of steps is ≥ 5, Latent-RAT outperforms Copy-Net+BERT, showing it is handles complex decompositions better, and in agreement with the tendency of sequence-to-sequence models to struggle with long output sequences.

Figure 10: LF-EM on the development set per number of steps. We compute for each example its (gold) number of steps, and calculate the average LF-EM per bin.

Error analysis We randomly sampled 30 errors from each model and manually analyzed them. Table 5 describes the error classes for each model, and Appendix A.4 provides examples for these classes. Each example can have more than one error cateogry.

For all models, the largest error category is actually cases where the prediction is correct but not captured by the LF-EM metric: 70% for Copy-Net+BERT, 40% for BiaffineGP and 73.3% for Latent-RAT. This shows that the performance of current QDMR parsers is actually quite good, but capturing this with an automatic evaluation is challenging. Cases where the output is correct include:

• Equivalent Solutions: the prediction is logically equivalent to the gold structure.

• Elaboration Level: the model prediction is more/less granular compared to the gold structure, but the prediction is correct.

• Redundancy: additional information is predicted/omitted that does not effect the computation. For example the second occurrence of "yard's' in , "2. return yards of #1; 3. #1 where #2 is lower than 10 yards".

• Wrong Gold -cases where the predication is more accurate than the gold decomposition. The main classes of errors are: • Missing Information: missing steps, missing ref-

erences or missing tokens that affect the result of the computation.

• Additional Steps: duplicate steps or additional steps that change the result of the computation.

• Wrong Global Structure: The computation described by the predicted structure is wrong (for example, addition instead of subtraction).

• Wrong Step Structure: incoherent structure of a particular step that cannot be mapped to a proper structure. • Out of Vocabulary: seq2seq models sometimes predict tokens that are not related to the question nor the decomposition. For example, "rodents" in a question about flowers.

7 Conclusion

In this work, we propose to represent QDMR structures with a dependency graph over the input tokens, and propose a graph parser and a seq2seq model that uses graph supervision as an auxiliary loss. We show that a graph parser is 16x faster than a seq2seq model, and that it exhibits better sample coplexity. Moreover, using graphs as auxiliary supervision improves out-of-domain generalization and leads to better performance on questions that represent a long sequence of computational steps. Last, we propose a new evaluation metric for QDMR parsing and show it better corresponds to human intuitions.

In future work, we will examine ensemble models that take into account the complementary nature of graph parsers and seq2seq parser to further improve performance on QDMR parsing.

A Appendices

A.1 QDMR LF Table 6 shows the different operators, their properties and examples of LFs. Table 7 shows terms that are used to identify the QDMR step operator's properties. We use the same lexicon from BREAK for detecting operators, extended with some specifications for numeric properties such as equals 0.

A.2 Lf-Based Evaluation (Lf-Em)

In §3.2 we described a LF-based evaluation metric. Given a logical form of a QDMR, Z, the metric transforms it to a normalized form Z norm in the following way: (1) Normalize the steps by removing unnecessary tokens and replace equivalents; (2) Merge steps with their referrer; and (3) Reorder the steps in a consistent order. Now we describe these steps more formally.

Normalize Steps Let z = o, ρ, A be a logical form of step s, where

A = { η a , τ a | η a ∈ PROP o , τ a ⊆ s} |A| a=1

is the named-arguments set. Note here τ is a set of tokens instead of subsequence of s. A normalization transformation is a function T o,ρ : s → {∅} ∪ V q ∪ V store mapping each token to an equivalent token out of the allowed vocabulary or to ∅ for removal. We denote by T o,ρ (A) := { η, {T o,ρ (τ 1 ), . . . , T o,ρ (τ |τ | )} | η, τ ∈ A}, i.e, applying a transformation on the named-arguments set is defined by applying it on each of the arguments tokens. The final normalized form of l is given by applying multiple transformations on the arguments, removes the property lexicon entries of properties ρ. This information is already given by ρ and therefore is not needed to take a part in the arguments. Table 7 shows the lexicon entries for each property.

z norm = o, ρ, T rep o,ρ • T aux o,ρ • T prop o,ρ (A) .

T aux o,ρ removes uninformative tokens, such as prepositions. These V aux were added in the first place to allow continuous fluent sentences to be written.

T rep o,ρ maps a token to its representative token. Recall BREAK samples where annotated based on an allowed vocabulary, which is built on top of the question tokens variations and some additional ones. We define equivalence classes of break equivalent tokens and set a representative for each class. Table 8 shows some examples for these classes.

Merge Steps

The level of elaboration varies between annotators, leading to implicit steps, i.e, steps that are contained in other steps. This is especially common in FILTER and PROJECT steps. Therefore we offer a merging mechanism for group-

, η src , o dst , o out , f ρ , f η

where o src , η src are the referrer step operator and argument name with the observed reference, o dst is the referred step operator, o out is the merged step operator and f ρ : PROP o src × PROP o dst → PROP o out , f η : ARG o src ,ρ src ∪ ARG o dst ,ρ dst → ARG o out ,ρ out are mappings for the merged step properties and named arguments. The values for the arguments, τ , are induced by merging the values of the arguments names that are mapped to the merged argument name. In particular, we use the following merging rules:

• project-sub → select = project Collapse select step that is referred by a project step as its subject.

o src , η src , o dst , o out = project, sub, select, project f ρ (ρ src , ρ dst ) = ρ src f η (a) = a, if a ∈ ARG src sub, if a ∈ ARG dst

• filter-sub → select = filter Collapse select step that is referred by a filter step as its subject.

• filter-sub → filter = filter Collapse filter step that is referred by another filter step as its subject. This rule deals with filter chains, when a sequence of filters referred by each other with sub argument, any order of them has the same meaning.

Reorder Execution Graph In some cases there are multiple possible sequential orderings for the same execution graph, for example for parallel execution branches (Fig. 4) . We reorder the graph by first splitting the steps into layers where each layer may refer to previous layers only, and then order within a layer lexicographically. Formally, let REF(z i ) ∈ [m] \ {i} be the references of step z i . We define the degree (layer) of z i by:

d(z i ) :=    0, if REF(s) = ∅ max j∈REF(z i ) d(z j ) + 1, otherwise

Since QDMR execution graph is a DAG, d(•) is well defined. Let

∆ d = {i | d(z i ) = d} for d ∈ [0..m]. rnk d(z i ) (z i ) is the alphabet rank of z i in ∆ d(z i ) , where z i textual representation is of the form o i [ρ i 1 , . . . , ρ i |ρ i | ](η i 1 = τ i 1 , . . . , η i |A i | = τ i |A i | )

. The properties ρ i 1 , . . . , ρ i |ρ i | are sorted, and so the arguments η i 1 = τ i 1 , . . . , η i |A i | = τ i |A i | first by the names η and second by the values τ . The textual representation of an argument value τ consists of alphabet ordered token, references first. Finally, the total rank of a step is given by d(z i ), rnk d(z i ) (z i ) , i.e primary order by degree and secondary order by in-layer rank.

A.3 Experiments Parameters

CopyNet-BERT The LSTM decoder has hidden size 768. We use a batch size of 32 and train for up to 25 epochs (∼35k steps) with beam search of size 5.

Biaffine Graph Parser The POS embeddings are of size 100. The four FFNs consist of 3-layers with hidden size 300 and use ELU activation function. We use dropout of rate 0.6 on the contextualized encodings, and of rate 0.3 on the FF representations. We use a batch size of 32 and train for up to 80 epochs (∼111k steps).

Latent RAT We stack 4 relation-aware selfattention layers on top of the contextualized encodings, each with 8 heads and dropout with rate 0.1. The FFNs for relation representation uses 3-layers with hidden size of 96, ReLU activation function and dropout rate of 0.1. We tie the layers, and multiply the graph loss by 100. The rest is identical to the CopyNet-BERT configuration.

Optimization We used the Adam optimizer (Kingma and Ba, 2015) with the hyperparameters. The learning rate changes during training according to slanted triangular schema, in which it linearly increases from 0 to lr for the first warmup steps = 0.06 • max steps, and afterwards linearly decreases back to 0. We use learning rate of 1 • 10 −3 , and a separate learning rate of 5 • 10 −5 for the BERTbased encoder.

A.4 Error Analysis Examples

Some examples for each error class from §6.3. The gold decompositions are given on left, and the predictions are on the right.

https://leaderboard.allenai.org/ break/submissions/public

https://developers.google.com/ optimization