## Abstract

QDMR is a meaning representation for complex questions, which decomposes questions into a sequence of atomic steps. While stateof-the-art QDMR parsers use the common sequence-to-sequence (seq2seq) approach, a QDMR structure fundamentally describes labeled relations between spans in the input question, and thus dependency-based approaches seem appropriate for this task. In this work, we present a QDMR parser that is based on dependency graphs (DGs), where nodes in the graph are words and edges describe logical relations that correspond to the different computation steps. We propose (a) a nonautoregressive graph parser, where all graph edges are computed simultaneously, and (b) a seq2seq parser that uses gold graph as auxiliary supervision. We find that a graph parser leads to a moderate reduction in performance (0.47→0.44), but to a 16x speed-up in inference time due to the non-autoregressive nature of the parser, and to improved sample complexity compared to a seq2seq model. Second, a seq2seq model trained with auxiliary graph supervision has better generalization to new domains compared to a seq2seq model, and also performs better on questions with long sequences of computation steps.

## 1 Introduction

Training neural networks to reason over multiple parts of their inputs across modalities such as text, tables, and images, has been a focal point of interest in recent years (Antol et al., 2015; Pasupat and Liang, 2015; Johnson et al., 2017; Suhr et al., 2019; Welbl et al., 2018; Talmor and Berant, 2018; Yang et al., 2018; Hudson and Manning, 2019; Dua et al., 2019; Chen et al., 2020; Hannan et al., 2020; Talmor et al., 2021) . The most common way to check whether a model is capable of complex reasoning, is to pose in natural language a complex question, which requires performing multiple steps of computation over the input.

To address the need for better understanding of complex questions, recently proposed QDMR, a meaning representation where complex questions are represented through a sequence of simpler atomic executable steps (see Fig. 1 ), and the final answer is the answer to the final step. QDMR has been shown to be useful for multi-hop question answering (QA) and also for improving interpretability in visual QA (Subramanian et al., 2020) .

State-of-the-art QDMR parsers use the typical sequence-to-sequence (seq2seq) approach. However, it is natural to think of QDMR as a dependency graph over the input question tokens. Consider the example in Fig. 1 . The first QDMR step selects the span "Indiana Jones". Then, the next step uses a PROJECT operation to find the "movies" of Indiana Jones, and then next step uses another PROJECT operation to find the date when the movies were "released". Such relations can naturally be represented as labeled edges over the relevant question tokens, as shown in Fig. 1 , bottom.

In this work, we propose to use the dependency graph view of QDMR to improve QDMR parsing. We describe a conversion procedure that automatically maps QDMR structures into dependency graphs, using a structured intermediate logical form representation (Fig 1, middle) . Once we have graph supervision for every example, we train a dependency graph parser, in the spirit of Dozat and Manning (2018) , where we predict a labeled relation for every pair of question tokens, representing the logical relation between the tokens. Unlike seq2seq models, this is a non-autoregressive parser, which decodes the entire output structure in a single step.

A second method to exploit the graph supervision is to train a seq2seq model, but have an auxiliary loss term where the graph is decoded from the encoder representations. Towards that end, we propose a Latent-RAT encoder, which uses relationaware transformer (Shaw et al., 2018) to explicitly represent the relation between every pair of input tokens. Relation-aware transformer has been shown to be useful for encoding graph structures in the context of semantic parsing .

Last, we propose a new evaluation metric, LF-EM, for QDMR parsing, which is based on the aforementioned intermediate logical form, and show it correlates better with human judgements compared to existing metrics.

We find that our graph parser leads to a small reduction in LF-EM compared to seq2seq models (0.47→0.44), but is 16x faster due to its nonautoregressive nature. Moreover, our graph parser has better sample complexity and outperforms the seq2seq model when trained on 10% of the data or less. When training a seq2seq model along with the auxiliary graph supervision, we find that the parser obtains similar performance when trained on the entire dataset (0.471 LF-EM), but substantially improves performance when generalizing to new domains. Moreover, The Latent-RAT parser performs better on examples with a large number of computation steps.

Our code is available at https://github.

## 2 Overview

The core of this work is to examine the utility of a dependency graph (DG) representation for QDMR. We propose conversion procedures that enable training and evaluating with DGs (see Fig. 2 ). First, we convert gold QDMR structures into logical forms (LF), where each computation step in QDMR is represented with a formal operator, properties and arguments ( §3.1). Then, we obtain gold DGs by projecting the logical forms onto the question tokens ( §4). Once we have question-DG pairs, we can train a graph parser. At test time, QDMRs and DGs are converted into LFs for evaluation. We propose a new evaluation metric over LFs ( §3.2), and show it is more robust to semantic-preserving changes compared to prior metrics. Our proposed parsers are in §5. On top of standard seq2seq models, we describe (a) a graph parser, and (b) a multi-task model, where the encoder of the seq2seq model is also trained to predict the DG.

## 3 Qdmr Logical Forms

QDMR is a text-based meaning representation focused on representing the meaning of complex questions. It is based on a decomposition of questions into a sequence of simpler executable steps (Fig. 1) , where each step corresponds to a SQL-inspired operator (Table 6 in Appendix §A.1). We briefly review QDMR and then define a logical form (LF) representation based on these operations. We use the LFs both for map-1. return census groups 2. return #1 that is Pacific islander 3. return #1 that is African American 4. return size of #2 5. return size of #3 6. return which is lowest of #4 , #5

Vq = {. . . group, groups, . . . small, smaller, smallest, . . . } V 5 ref = {#1, #2, #3, #4} Vconst = Vop ∪ Vstore ∪ Vaux Vop = {difference, sum, lowest, highest, for each, .

. . } Vstore = {population, size, elevation, flights, price, date . . . } Vaux = {a, is, are, of, that, the, with, was, did, to . . . } Figure 3 : QDMR annotation vocabularies. Each example is annotated with a lexicon that consists of: V q , the question tokens and their inflections; V i ref , references to previous steps; V const , constant terms including operational terms (V op ), domain-specific words that are not in the question, such as size (V store ); and auxiliary words like prepositions (V aux ). Boldface indicates words used in the QDMR structure.

ping QDMRs to DGs, and also to fairly evaluate the output of parsers that output either QDMRs directly or DGs.

## 3.1 Definitions

QDMR Definition Given a question with n tokens, q = q 1 , . . . , q n , its QDMR is a sequence of m steps s 1 , . . . , s m , where step s i conceptually maps to a single logical operator

o i ∈ O. A step, s i , is a sequence of n i tokens s i = s i 1 , . . . s i n i , where token s i

j is either a question token ∈ V q (or some inflection of it), a word from a constant predefined lexicon ∈ V const , or a reference token

∈ V i ref = {#1, . . . , #(i − 1)},

referring to a previous step. Fig. 3 shows an example for a question and its QDMR structure.

QDMR Logical Form (LF) Given a QDMR S = q; s 1 , . . . , s m , its logical form is a sequence of logical form steps Z = q; z 1 , . . . , z m . The LF step z i , corresponding to s i , is a triplet z i = o i , ρ i , A i where o i ∈ O is the logical operator; ρ i ∈ PROP(o i ) are operator-specific properties; and A i is a dictionary of arguments, mapping an operator-specific argument η ∈ ARG(o i ) to a span τ from the QDMR step s i . For convenience, we denote z i with the string Table 6 for full list).

o i [ρ i ](η i 1 = τ i 1 , . . . ).

pendix provides the full list. Imposing more structure on QDMR through the LF is useful (a) for evaluation ( §3.2), where LFs detect differenet QDMR phrasings that have identical meaning, and (b) for creating DGs, since the operators, properties and arguments, will be used to define the labels of edges in the DG (see Fig 1) .

QDMR→LF We convert QDMRs to LFs with a rule-based method, extending the procedure for detecting operators from to also find properties and arguments. To detect properties we use a lexicon, see details in Table 7 (Appendix §A.1), and in our public implementation.

## 3.2 Lf-Based Evaluation (Lf-Em)

The official evaluation metric for QDMR 1 is normalized EM (NormEM), where the predicted and gold QDMR structures are normalized using a rulebased procedure, and then exact string match is computed between the two normalized QDMRs. Since in this work we convert both QDMRs ( §3.1) and DGs ( §4) to LFs, we propose LF-EM, a LFbased evaluation metric, and show that it correlates better with notions of semantic equivalence. LF-EM essentially involves computing exact match between the predicted and gold LFs, using the LF described above. To further capture semantic equivalences, we perform a few more normalization steps. We briefly describe these normalizations, and give the full description in §A.2.

Given a logical form Z, we transform each step to a normalized form, and the final textual rep- resentation is given by representing each step as described in §3: OPERATOR[property](arg=. . . ; . . . ). We apply the following steps ( Fig. 4

## ):

Remove and normalize tokens Each LF step includes a list of tokens in its arguments. In this normalization step, we remove lexical items, such as "max", which are used to detect the operator and property (Table 7 in §A.1), as those are already represented outside the arguments. In addition, we remove words from a stop word list (V aux , see Fig. 3 ). Finally, we use a synonym list to represent words in such a list with a single representative (countries→country).

Merge Steps QDMR annotations sometime vary in their granularity. For example one example might contain 'return metal objects', while another might have 'return objects; return #1 that are metal'. This is especially common in FILTER and PROJECT steps. We merge chains of FILTER steps, as well as FILTER or PROJECT steps that follow a SELECT step. See details in §A.2.

Reorder steps QDMR describes a directed acyclic graph of computation steps, and there are multiply ways to order the steps (Fig. 4) . We recursively compute the layer of each step as layer(s) = max {layer(s ref 1 ), . . . } + 1, where the maximization is over all steps s refers to. We then re-order steps by layer and then lexicographically.

We manually evaluate the metrics normalized EM and LF-EM on 50 random development set ex-amples using predictions from the CopyNet-BERT model (see §6). We find that both (binary) metrics have perfect precision: they only assign credit when indeed the QDMR reflects the correct question decomposition, as judged by the authors. However, LF-EM covers more examples, where the LF-EM on this sample is 52.0, while normalized EM is 40.0. Thus, LF-EM provides a tighter lower bound on the performance of a QDMR parser and is a better metric for QDMR parsing.

## 4 From Lfs To Dependency Graphs

Given a QDMR decomposition S = q; s 1 , . . . , s m , we construct a dependency graph G = N , E , where the nodes N correspond to question tokens, and the edges E describe the logical operations, resulting in a graph with the same meaning as S.

The LF→DG procedure is shown in Fig. 5 and consists of the following steps:

• Token alignment: align each token in the question to a token in a QDMR step. • Spans Dependency Graph (SDG) extraction: construct a graph where each node corresponds to a list of tokens in a QDMR step, and edges describe the dependencies between the steps. • Dependency Graph (DG) extraction: convert the SDG to a DG over the question tokens.

Here, we add span edges for tokens that are in the same step, and deal with some representation issues (see §4.3). Because we convert predicted DGs to LFs for evaluation, the LF→DG conversion must be invertible. We now describe the details of the LF→DG conversion. Our conversion succeeds in 97.12% of the BREAK dataset .

## 4.1 Token Alignment

We denote the question tokens by q = q 1 . . . q n and the ith QDMR step tokens by ∀i ∈

[1..m], s i = s i 1 . . . s i n i . An alignment is defined by M = {(q i , s k j ) | q i ≈ s k j ; i ∈ [1..n], k ∈ [1..m], j ∈ [1..n k ]},

where by t ≈ t we mean t, t are either identical or equivalent. Roughly speaking, these equivalences are based on BREAK annotation lexicon ( Fig. 3 ) -in particular. the inflections of the question tokens V q (e.g , "object" and "objects"), and equivalence classes on top of the constant lexicon V const (e.g , "biggest" and "longest"). See Table 8 in Appendix §A.2 for more details. To find the best alignment M , we formulate an optimization problem in the form of an Integer Linear Program (ILP) and use a standard ILP solver. 2 The full details are given as a part of our open source implementation. The objective function uses several heuristics to assign a high score to an alignment the has the following properties ( Fig. 6 ):

• Minimalism: Aligning each question token to at most one QDMR step token and vice versa is preferable. • Exact Match: Aligning a question token to a QDMR token that is identical is preferable. • Sequential Preference: Aligning long sequences from the question to a single step is preferable (when a step has a reference token (#1), we take into account the tokens in the referenced step, see Fig. 6 , top right). • Steps Coverage: Covering more steps is preferable.

## 4.2 Spans Dependencies Extraction

Given the QDMR, LF, and alignment M , we construct the Span Dependency Graph (SDG). Each QDMR step is a node labeled by a list of tokens (spans). The list of tokens is computed with the alignment M , where given a QDMR step s k , the list contains all question tokens q i , such that (q i , s k j ) ∈ M , where s k j is a word in s k . The list is ordered according to the position in the question.

Edges in the SDG are computed using reference tokens. If step s i has a reference token to step s j , Figure 6 : Heuristics for token alignment. Potential tokens for alignment colored, where the preferable choice according to the heuristic is underlined. On the top left, the second occurrence of "water" is preferred in QDMR step #1 due to the adjacent word "buffalo". On the top right, the second occurrence of "name" is preferred in QDMR step #5, because this step refers to #4 that contains the word "drivers".

we add an edge (s i , s j ) (we abuse notation and refer to SDG nodes and QDMR steps with the same notation). Each edge has a tag, which is a triple consisting of the operator o i of the source node s i , the property ρ i of the source node, and the named argument η i ref that contains the reference token. For readability we denote the tag triplet

tag(i, j) = o i , ρ i , η i ref by o i -η i ref [ρ i ].

## 4.3 Sdg→Dg

We construct a DG by projecting the SDG on the question tokens. This is done by: (a) For each SDG node and its list of tokens, add edges between the tokens from left-to-right with a new span tag (black edges in Fig 5) ; (b) use the rightmost word in every span as its representative for the edges between different spans.

However, this transformation is non-trivial for two reasons. First, some SDG nodes do not align to any question token, Second, some question tokens align to multiple SDG nodes, which does not allow the DG to be converted back to an SDG unambiguously for evaluation. We now explain how we resolve such representation issues, mostly based on adding more tokens to the input question. Fig. 7 illustrates the different types of challenges and our proposed solution.

Domain-specific concepts QDMR annotators were allowed to use a small number of tokens that are pragmatically assumed to exist in the domain (V store in Fig. 3 ). For example, when annotating ATIS questions (Hemphill et al., 1990 ), the word "flight" is allowed to be used in the QDMR structure even if it does not appear in the question, since this is a flight-reservation domain. We concatenate all the words in V store to the end of each question after a special separator token, which allows token alignment ( §4.1) to map such QDMR steps to a question word (Fig. 7, top) .

Empty SDG nodes some steps only contains tokens that are not in the question (e.g, "Number of #2" in Fig. 7 bottom) , and thus their list of tokens in the SDG node is empty. In this case, we cannot ground the SDG node in the question. Therefore we add a constant number of dummy tokens, [DUM] , which are used to ground such SDG nodes.

Single tokens to multiple SDG nodes A single question token can be aligned to multiple SDG nodes. Recall the tokens of each SDG nodes are connected with a chain of span edges. This leads to cases where two chains that pass through the same question token cannot be distinguished when converting the DG back to an SDG for evaluation. We solve this by concatenating a constant number of special [DUP] tokens that conceptually duplicate another token by referring to it with a new duplicate tag. Now, each span chain uses a different copy of the shared token by referring to the [DUP] instead of the original one.

## 5 Models

Once we have methods to convert QDMRs to DGs and LFs, and DGs to LFs, we can evaluate the advantages and disadvantages of standard autoregressive decoders compared to graph-based parsers. We describe three models: (a) An autoregressive parser, (b) a graph parser, (c) an autoregressive parser that is trained jointly with a graph parser in a multi-task setup. For a fair comparison, all models have the same BERT-based encoder (Devlin et al., 2019) .

CopyNet-BERT (baseline) This autoregressive QDMR parser is based on the CopyNet baseline from Wolfson et al. (2020), except we replace the BiLSTM encoder with a transformer initialized with BERT. The model encodes the question q and then decodes the QDMR S step-by-step and tokenby-token.

The QDMR decoder is an LSTM (Hochreiter and Schmidhuber, 1997) augmented with a copy mechanism (Gu et al., 2016) , where at each time step the model either decodes a token from the vocabulary or a token from the input. Since the input is tokenized with word pieces, we average word pieces that belong to a single word to get word representations, which enables word copy-ing. Training is dones with standard maximum likelihood.

Biaffine Graph Parser (BiaffineGP) The biaffine graph parser takes as input the quesiton q augmented with the special tokens described in §4.3 and predicts the DG by classifying for every pair of tokens whether there is an edge between them and the label of the edge. The model is based on the biaffine dependency parser of Dozat and Manning (2018) , except here we predict a DAG and not a tree, so each node can have more that one outgoing edge. Let H = h 1 , . . . , h |H| be the sequence of representations output by the BERT encoder. The biaffine parser uses four 1-hidden layer feed-forward networks over each contextualized token representation h l :

h edge-head l = FF edge-head (h l ), h edge-dep l = FF edge-dep (h l ), h label-head l = FF label-head (h l ), h label-dep l = FF label-dep (h l ).

The probability of an edge from token i to token j is given by σ(h

edge-dep i W edge h edge-head j )

, where W edge is a parameter matrix. Similarly, the score of an edge labeled by the tag t from token i to token j is given by s

t ij = h label-dep i W t h label-head j

, where W t is the parameter matrix for this tag. We then compute a distribution over the set of tags T with softmax(s 1 ij , . . . , s

|T | ij )

. Training is done with maximum likelihood both on the edge probabilities and label probabilities. Inference is done by taking all edges with edge probability > 0.5 and then labeling those edges according to the most probable tag.

There is no guarantee that the biaffine parser will output a valid DG. For example, if an SDG node has an outgoing edge labeled with filter-sub and another labeled with project-sub, we cannot tell if the operator is FILTER or PROJECT. This makes parsing fail, which occurs in 1.83% of the cases. To create a SDG, we first use the span edges to contruct SDG nodes with lists of tokens, and then add edges between SDG nodes by projecting the edges between tokens to edges between the SDG nodes. To prevent cases where parsing fails, we can optionally apply an ILP that takes the predicted probabilities as input, and outputs a valid DG. The exact details are given in our open source implementation. Figure 8 : Latent-RAT architecture. The encoder hidden states are used to represent the relations between the question tokens, r ij . Then, these representations are used for both (1) direct prediction of the dependency (tag) between the tokens (graph-head); and (2) augment the encodings via RAT layers (seq2seq-head). The subnetworks for r K ij , r V i,j are symetric, and represented by the r ij .

Multi-task Latent-RAT Encoder In this model, our goal is to improve the sequence-to-sequence parser by providing more information to the encoder using the DG supervision. Our model will take the question q (with special tokens as before) as input, and predict both the graph G directly and the QDMR structure S with a decoder.

We would like the information on relations between tokens to be part of the transformer encoder, unlike the biaffine parser that uses separate feedforward networks for that, so that the decoder can take advantage of this information. To accomplish that, we use RAT transformer layers (Shaw et al., 2018; , which explicitly represent relations between tokens, and have been shown to be useful for encoding graphs over input tokens in the context of semantic parsing.

RAT layers inject information on the relation between tokens inside the transformer self-attention mechanism (Vaswani et al., 2017) . Specifically, the similarity score e ij computed using queries and keys is given by:

e ij ∝ h i W Q (h j W K + r K ij ) T ,

where W Q , W K are the query and key parameter matrices and the only change is the term r K ij , which represents the relation between the tokens i and j. Similarly, the relation between tokens is also considered when summing over self-attention values:

H j=1 α ij (x j W V + r V ij ),

where W V is the value parameter matrix, α ij is the attention distribution and the only change is the term r V ij . Unlike prior work where the terms r K ij , r V ij were learned parameters, here we want these vectors to (a) be a function of the contextualized representation and (b) be informative for classifying the dependency label in the gold graph. By learning latent representations from which the gold graph can be decoded, we will provide useful information for the sequence-to-sequence decoder. Specifically, given a RAT layer with representations h i , h j for tokens i and j, we represent relations and compute a loss in the following way (see Figure 8 ):

r K ij = F F K (h i − h j ), S K = R K W out + b K ∈ R n×n×|T | , Loss K = CE(S K ).

F F K is a 1-hidden layer feed-forward network, R K ∈ R (n×n)×d transformer is a concatenation of all r K ij for all pairs of tokens, W out ∈ R d transformer ×|T | is a projection matrix that provides a score for all possible labels (including the NONE label).

We compute an analogous loss Loss V for r V ij and the final graph loss is Loss K + Loss V over all RAT layers. To summarize, by performing multi-task training with this graph loss we push the transformer to learn representations r ij that are informative of the gold graph, and can then be used by the decoder to output better QDMR structures.

## 6 Experiments

We now describe our empirical evaluation of the models described above.

## 6.1 Experimental Setup

We build our models in AllenNLP (Gardner et al., 2018) , and use BERT-base (Devlin et al., 2019) to produce contextualized token representations in all models. We train with the Adam optimizer (Kingma and Ba, 2015). Our Latent-RAT model include 4 RAT layers, each with 8 heads. Full details on hyperparameters and training procedure in Appendix §A.3.

We examine the performance of our models in three setups:

• Standard: We use the official BREAK dataset.

• Sample Complexity (SC): We examine the performance of models with decreasing amounts of training data. The goal is to test which model has better sample complexity.

• Domain Generalization (DomGen): We train on 7 out of 8 sub-domains in BREAK and test on the remaining domain, for each target domain.

The goal is to test which model generalizes better to new domains.

As an evaluation metric, we use LF-EM and also the official BREAK metric, normalized EM, when reporting test results on BREAK. Table 2 compares the performance of the different models ( §5) to each other and to the top entries on the BREAK leaderboard. As expected, initializing CopyNet with BERT dramatically improves test performance (0.388→0.47). The Latent-RAT sequence-tosequence model achieves similar performance (0.471), and the biaffine graph parser, BiaffineGP, is slightly behind with an LF-EM of 0.44 (but has faster inference, as we show below). Adding an ILP layer on top of BiaffineGP to eliminate constraint violations in the output graph improves performance to 0.454.

## Standard Setup

While our proposed models do not significantly improve performance in the LF-EM setup, we will see next that they improve domain generalization and sample complexity. Moreover, since BiaffineGP is a non-autoregressive model that predicts all output edges simultaneously, it dramatically reduces inference time.

Last, the top entry on the BREAK leaderboard uses BART (Lewis et al., 2020) , a pre-trained seq2seq model (we use a pre-trained encoder only), which leads to a state-of-the-art LF-EM of 0.496. Table 3 shows LF-EM on each of BREAK's sub-domains when training on the entire dataset (top), when training on all domains but the target domain (middle), and the relative drop compared to the standard setup (bottom).

## Domain Generalization

The performance of BiaffineGP and Latent-RAT is higher compared to CopyNET+BERT in the DomGen setup. In particular, the performance of Latent-RAT is the best in 7 out of 8 sub-domains, and the performance of BiaffineGP is the best in the last domain. Moreover, Latent-RAT outperforms CopyNet+BERT in all sub-domains. We also observe that the performance drop is lower for BiaffineGP and Latent-RAT compared to Copy-Net+BERT. Overall, this shows that using graphs as a source of supervision leads to better domain generalization. Table 4 shows model performance as a function of the size of the training data. While the LF-EM of BiaffineGP is lower given the full training set (Table 2) , when the size of the training data is small it substantially outperforms other models, improving performance by 3-4 LF-EM points given 1%-10% of the data. With 20%-50% of the data Latent-RAT and Copy-Net+BERT have comparable performance.

## Sample Complexity

Inference time for the graph parser The graph parser, BiaffineGP, is a non-autoregressive model that predicts all output edges simultaneously, as opposed to a sequence-to-sequence model that decodes a single token at each step. We measure the average runtime of the forward pass for both BiaffineGP and CopyNet+BERT and find that Bi-affineGP has an average runtime of 0.08 seconds, compared to 1.306 seconds of CopyNet+BERT -a 16x speed-up.

## 6.3 Analysis

Model agreement Figure 9 shows model agreement between CopyNet+BERT, BiaffineGP, and Latent-RAT on the development set. Roughly 60% of the examples are predicted correctly by one of the models, indicating that enembling the three models could result in further performance improvement. The agreement of Latent-RAT with Copy-Net+BERT (5.5%) and BiaffineGP (4.27%) is greather than the overlap between CopyNet+BERT and BiaffineGP, perhaps since it is a hybrid of a seq2seq and graph parser. Moreover, in 3.83% of the examples, Latent-RAT is the only model with a correct prediction.