Break It Down: A Question Understanding Benchmark

Tomer Wolfson
Mor Geva
Ankit Gupta
Matt Gardner
Yoav Goldberg
Daniel Deutch
Jonathan Berant
TACL
2020
View in Semantic Scholar

Abstract

Understanding natural language questions entails the ability to break down a question into the requisite steps for computing its answer. In this work, we introduce a Question Decomposition Meaning Representation (QDMR) for questions. QDMR constitutes the ordered list of steps, expressed through natural language, that are necessary for answering a question. We develop a crowdsourcing pipeline, showing that quality QDMRs can be annotated at scale, and release the Break dataset, containing over 83K pairs of questions and their QDMRs. We demonstrate the utility of QDMR by showing that (a) it can be used to improve open-domain question answering on the HotpotQA dataset, (b) it can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Last, we use Break to train a sequence-to-sequence model with copying that parses questions into QDMR structures, and show that it substantially outperforms several natural baselines.

1 Introduction

Recently, increasing work has been devoted to models that can reason and integrate information from multiple parts of an input. This includes reasoning over images (Antol et al., 2015; Johnson et al., 2017; Suhr et al., 2019; Hudson and Manning, 2019) , paragraphs (Dua et al., 2019) , documents (Welbl et al., 2018; Talmor and Berant, 2018; , tables (Pasupat and Liang, 2015) and more. Question answering (QA) is commonly used to test the ability to reason, where a complex natural language question is posed, and is to be answered given a particular context (text, image, etc.) . Although questions often share structure across tasks and modalities,

Question Decomposition Supervision

Question Decomposition understanding the language of complex questions has thus far been dealt within each task in isolation. Consider the questions in Figure 1 , all of which express operations such as fact chaining and counting. Additionally, humans can take a complex question and break it down into a sequence of simpler questions even when they are unaware of what or where the answer is. This ability, to compose and decompose questions, lies at the heart of human language (Pelletier, 1994) and allows us to tackle previously unseen problems. Thus, better question understanding models should improve performance and generalization in tasks that require multi-step reasoning or that do not have access to substantial amounts of data. In this work we propose question understanding arXiv:2001.11770v1 [cs.CL] 31 Jan 2020

Figure 1: Questions over different sources share a similar compositional structure. Natural language questions from multiple sources (top) are annotated with the QDMR formalism (middle) and deterministically mapped into a pseudo-formal language (bottom).

as a standalone language understanding task. We introduce a formalism for representing the meaning of questions that relies on question decomposition, and is agnostic to the information source. Our formalism, Question Decomposition Meaning Representation (QDMR), is inspired by database query languages (SQL; SPARQL), and by semantic parsing (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Clarke et al., 2010) , in which questions are given full meaning representations.

We express complex questions via simple ("atomic") questions that can be executed in sequence to answer the original question. Each atomic question can be mapped into a small set of formal operations, where each operation either selects a set of entities, retrieves information about their attributes, or aggregates information over entities. While this has been formalized in knowledge-base (KB) query languages (Chamberlin and Boyce, 1974) , the same intuition can be applied to other modalities, such as images and text. QDMR abstracts away the context needed to answer the question, allowing in principle to query multiple sources for the same question.

In contrast to semantic parsing, QDMR operations are expressed through natural language, facilitating annotation at scale by non-experts. QDMR serves as the formalism for creating BREAK, a question decomposition dataset of 83,978 questions over ten datasets and three modalities. BREAK is collected via crowdsourcing, with a user interface that allows us to train crowd workers to produce quality decompositions ( §3). Validating the quality of annotated structures reveals 97.4% to be correct ( §4).

We demonstrate the utility of QDMR in two setups. First, we regard the task of open-domain QA over multi-hop questions from the HOTPOTQA dataset. Combining QDMR structures in BREAK with an RC model (Min et al., 2019b) improves F 1 from 43.3 to 52.4 ( §5). Second, we show that decompositions in BREAK possess high annotation consistency, which indicates that annotators produce high-quality QDMRs ( §4.3). In §6 we discuss how these QDMRs can be used as a strong proxy for full logical forms in semantic parsing.

We use BREAK to train a neural QDMR parser that maps questions into QDMR representations, based on a sequence-to-sequence model with copying (Gu et al., 2016) . Manual analysis of generated structures reveals an accuracy of 54%, showing that automatic QDMR parsing is possible, though still far from human performance ( §7).

To conclude, our contributions are: • Proposing the task of question understanding and introducing the QDMR formalism for representing the meaning of questions ( §2) • The BREAK dataset, which consists of 83,978 examples sampled from 10 datasets over three distinct information sources ( §3) • Showing how QDMR can be used to improve open-domain question answering ( §5), as well as alleviate the burden of annotating logical forms in semantic parsing ( §6) • A QDMR parser based on a sequence-tosequence model with copying mechanism ( §7) The BREAK dataset, models and entire codebase are publicly available at: https:// allenai.github.io/Break/.

2 Question Decomposition Formalism

In this section we define the QDMR formalism for domain agnostic question decomposition.

QDMR is primarily inspired by SQL (Codd, 1970; Chamberlin and Boyce, 1974) . However, while SQL was designed for relational databases, QDMR also aims to capture the meaning of questions over unstructured sources such as text and images. Thus, our formalism abstracts away from SQL by assuming an underlying "idealized" KB, which contains all entities and relations expressed in the question. This abstraction enables QDMR to be unrestricted to a particular modality, with its operators to be executed also against text and images, while allowing in principle to query multiple modalities for the same question. 1 QDMR Definition Given a question x, its QDMR is a sequence of n steps, s = s 1 , ..., s n , where each step s i corresponds to a single query operator f i (see Table 1 ). A step, s i is a sequence of tokens,

Table 1: The 13 operator types of QDMR steps. Listed are, the natural language template used to express the operator, the operator signature and an example question that uses the query operator in its decomposition.

s i = (s i 1 , ..., s i m i ),

where a token s i k is either a word from a predefined lexicon L x (details in §3) or a reference token, referring to the result of a previous step s j , where j < i. The last step, s n returns the answer to x.

Decomposition Graph QDMR structures can be represented as a directed acyclic graph (DAG), used for evaluating QDMR parsing models ( §7.1). Given QDMR, s = s 1 , ..., s n , each step s i is a node in the graph, labeled by its sequence of tokens and index i. Edges in the graph are induced by reference tokens to previous steps. Node

s i is connected by an incoming edge (s j , s i ), if ref [s j ] ∈ (s i 1 , ..., s i m i ).

That is, if one of the tokens in s i is a reference to s j . Figure 2 displays a sequence of QDMR steps, represented as a DAG.

Figure 2: QDMR of the question “Return the keywords which have been contained by more than 100 ACL papers.”, represented as a decomposition graph.

QDMR Operators A QDMR step corresponds to one of 13 query operators. We designed the op-

"papers" #1 "#1 in ACL" #2 "keywords of #2" #3 "number of #2 for each #3" #4

"#3 where #4 is more than 100" #5 group comparative project select filter Figure 2 : QDMR of the question "Return the keywords which have been contained by more than 100 ACL papers.", represented as a decomposition graph.

erators to be expressive enough to represent the meaning of questions from a diverse set of datasets ( §3). QDMR assumes an underlying KB, K which contains all of the entities and relations expressed in its steps. A relation, r is a function mapping two Function Description agg Given a phrase wagg which describes an aggregate operation, agg denotes the corresponding operation. Either max, min, count, sum or avg. sup Given wsup describing a superlative, it denotes the corresponding function. Either arg max or arg min. com

Given wcom describing a comparison, it denotes the corresponding relation out of: <, ≤, >, ≥, =, =. ari Given w ari describing an arithmetic operation, it denotes the corresponding operation out of: +, −, * , /. ground e K (w) Given a natural language phrase w, it returns the set of corresponding KB entities, Se. ground r K (w) Given a natural language phrase w, it returns the corresponding KB relation, r. (iii) natural language phrases w, representing entities and relations in K. We assume the existence of grounding functions that map a phrase w to concrete constants in K. Table 2 describes the aforementioned constructs. In addition, we define the function map K (S e , S o ) which maps entity e ∈ S e to the set of corresponding objects from S o . Each o ∈ S o corresponds to an e ∈ S e by being contained in the result of a sequence of PROJECT and GROUP operations applied to e: 2

Table 2: Functions used for grounding natural language phrases in numerical operators or KB entities.

map K (S e , S o ) = { e, o | e ∈ S e , o ∈ S o , o ∈ op k • ... • op 1 (e)}.

We now formally define each QDMR operator and provide concrete examples in Table 1. • SELECT: Computes the set of entities in K corresponding to w: select(w) = ground e K (w). • FILTER: Filters a set of objects so that it follows the condition expressed by w:

filter(S o , w) = S o ∩{o | r(e, o) K ≡ true},

where r = ground r K (w), e = ground e K (w)}. • PROJECT: Computes the objects that relate to input entities S e with the relation expressed by w, proj(w, S e ) = {o | r(e, o) K ≡ true, e ∈ S e },

where r = ground r K (w). • AGGREGATE: The result of applying an aggregate operation:

aggregate(w agg , S o ) = {agg(S o )}.

• GROUP: Receives a set of "keys", S e and a set of corresponding "values", S o . It outputs a set of numbers, each corresponding to a key e ∈ S e . Each number results from applying aggregate, w agg to the subset of values corresponding to e.

group(w agg , S o , S e ) = {agg(V o (e)) | e ∈ S e },

where V o (e) = {o | e, o ∈ map K (S e , S o )}. • SUPERLATIVE: Receives entity set S e and number set S n . Each number n ∈ S n is the result of a mapping from an entity e ∈ S e . It returns a subset of S e for which the corresponding number is either highest/lowest as indicated by w sup .

super(S e , S n , w sup ) = {sup(map K (S e , S n ))}.

• COMPARATIVE: Receives entity set S e and number set S n . Each n ∈ S n is the result of a mapping from an e ∈ S e . It returns a subset of S e for which the comparison with n , represented by w com , holds.

comparative(S e , S n , w com , n ) = {e | e, n ∈ map K (S e , S n ), com(n, n ) ≡ true}.

• UNION: Denotes the union of object sets:

union(S 1 o , S 2 o ) = S 1 o ∪ S 2 o .

• DISCARD: Denotes the set difference of two objects sets:

discard(S 1 o , S 2 o ) = S 1 o \ S 2 o . • INTERSECTION:

Computes the intersection of its entity sets and returns all objects which relate to the entities with the relation expressed by w.

intersect(w, S 1 e , S 2 e ) = {o | e ∈ S 1 e ∩ S 2 e , r(e, o) K ≡ true, r = ground r K (w)}.

• SORT: Orders a set of entities according to a corresponding set of numbers. Each number n i is the result of a mapping from entity e i .

sort(S 1 e , S 2 n ) = { e i 1 ...e im | e i j , n i j ∈ map K (S e , S n ), n i 1 ≤ ... ≤ n im }.

• BOOLEAN: Returns whether the relation expressed by w holds between the input objects:

boolean(S 1 o , w, S 2 o ) = { r(o 1 , o 2 ) K }, where r = ground r K (w) and S 1 o , S 2 o are singleton sets containing o 1 , o 2 respectively.

• ARITHMETIC: Computes the application of an arithmetic operation:

arith(w ari , S 1 n , S 2 n ) = {ari(n 1 , n 2 )}, where S 1 n , S 2 n are singleton sets containing n 1 , n 2 respectively.

Question: "The actress that played Pearl Gallagher on the TV sitcom Different Strokes is also the voice of an animated character that debuted in what year?" High-level QDMR:

1. Return the actress that played Pearl Gallagher on the TV sitcom Different Strokes 2. Return the animated character that #1 is the voice of 3. Return the year that #2 debuted in Figure 3 : Example of a high-level QDMR.

Step #1 merges together SELECT and multiple FILTER steps.

High-level Decompositions In QDMR, each step corresponds to a single logical operator. In certain contexts, a less granular decomposition might be desirable, where sub-structures containing multiple operators could be collapsed to a single node. This can be easily achieved in QDMR by merging certain adjacent nodes in its DAG structure. When examining existing RC datasets Dua et al., 2019) , we observed that long spans in the question often match long spans in the text, due to existing practices of generating questions via crowdsourcing. In such cases, decomposing the long spans into multiple steps and having an RC model process each step independently, increases the probability of error. Thus, to promote the usefulness of QDMR for current RC datasets and models, we introduce high-level QDMR, by merging the following operators:

• SELECT + PROJECT on named entities: For the question, "What is the birthdate of Jane?" its high-level QDMR would be "return the birthdate of Jane" as opposed to the more granular, "return Jane; return birthdate of #1".

• SELECT + FILTER: Consider the first step of the example in Figure 3 . It contains both a SELECT operator ("return actress") as well as two FILTER conditions ("that played...", "on the TV sitcom...").

• FILTER + GROUP + COMPARATIVE: Certain high-level FILTER steps contain implicit grouping and comparison operations. E.g., "return yard line scores in the fourth quarter; return #1 that both teams scored from".

Step #2 contains an implicit GROUP of team per yard line and a COMPARATIVE returning the lines where exactly two teams scored.

We provide both granular and high-level QDMRs for a random subset of RC questions (see Table 3 ). The concrete utility of high-level QDMR to open-domain QA is presented in §5.

Table 3: The QA datasets in BREAK. Lists the number of examples in the original dataset and in BREAK. Numbers of high-level QDMRs are denoted by high.

3 Data Collection

Our annotation pipeline for generating BREAK consisted of three phases. First, we collected complex questions from existing QA benchmarks. Second, we crowdsourced the QDMR annotation of these questions. Finally, we validated worker annotations in order to maintain their quality.

Question Collection Questions in BREAK were randomly sampled from ten QA datasets over the following tasks (Table 3) :

• Semantic Parsing: Mapping natural language utterances into formal queries, to be executed on a target KB (Price, 1990; Zelle and Mooney, 1996; Yu et al., 2018) . • Reading Comprehension (RC): Questions that require understanding of a text passage by reasoning over multiple sentences (Talmor and Berant, 2018; Dua et al., 2019; Abujabal et al., 2019) . • Visual Question Answering (VQA): Questions over images that require both visual and numerical reasoning skills (Johnson et al., 2017; Suhr et al., 2019) . All questions collected were composed by human annotators. 3 HOTPOTQA questions were all sampled from the hard split of the dataset.

QDMR Annotation A key question is whether it is possible to train non-expert annotators to produce high-quality QDMRs. We designed an annotation interface (Figure 4) , where workers are first given explanations and examples on how to identify and phrase each of the operators in Table 1 . Then, workers decompose questions into a list of steps, where they are only allowed to use words from a lexicon L x , which contains: (a) words appearing in the question (or their automatically computed inflections), (b) words from a small predefined list of 66 function word such as, 'if', 'on', 'for each', or (c) reference tokens that refer to the results of a previous step. This ensures that the language used by workers is consistent across examples, while being expressive enough for the decomposition. Our annotation interface presents workers with the question only, so they are agnostic to the original modality of the question. The efficacy of this process is explored in §4.2.

Figure 4: User interface for decomposing a complex question that uses a closed lexicon of tokens.

We used Amazon Mechanical Turk to crowdsource QDMR annotation. In each task, workers decomposed a single question, paying them $0.4, which amounts to an average pay of $12 per hour. Overall, we collected 83,978 examples using 64 distinct workers. The dataset was partitioned into train/development/test sets following the partitions in the original datasets. During partition, we made sure that development and test samples do not share the same context.

Worker Validation

To ensure worker quality, we initially published qualification tasks, open to all United States' workers. The task required workers to carefully review the annotation instructions and decompose 10 example questions. The examples were selected so that each QDMR operation should appear in at least one of their decompositions (Table 1) . In total, 64 workers were able to correctly decompose at least 8 examples and were qualified as annotators. To validate worker performance over time, we conducted random validations of annotations. Over 9K annotations were reviewed by experts throughout the annotation process. Only workers that consistently produced correct QDMRs for at least 90% of their tasks were allowed to continue as annotators.

4 Dataset Analysis

This section examines the properties of collected QDMRs in BREAK and analyzes their quality.

4.1 Quantitative Analysis

Overall, BREAK contains 83,978 decompositions, including 60,150 QDMRs and 23,828 examples with high-level QDMRs, which are exclusive to text modalities. Table 3 shows data is proportionately distributed between questions over structured (DB) and unstructured modalities (text, images).

The distribution of QDMR operators is presented in Table 4 , detailing the prevalence of each query operator 4 (we automatically compute this distribution, as explained in §4.3). SELECT and PROJECT are the most common operators. Additionally, at least 10% of QDMRs contain operators such as GROUP and COMPARATIVE which entail complex reasoning, in contrast to high-level QDMRs, where such operations are rare. This distinction sheds light on the reasoning types required for answering RC datasets (high-level QDMR) compared to more structured tasks (QDMR). Table 5 details the distribution of QDMR sequence length. Most decompositions in QDMR include 3-6 steps, while high-level QDMRs are much shorter, as a single SELECT often finds an entity described by a long noun phrase (see §2).

Table 4: Operator prevalence in BREAK. Lists the percentage of QDMRs where the operator appears.

Table 5: The distribution over QDMR sequence length.

4.2 Quality Analysis

We describe the process of estimating the correctness of collected QDMR annotations. Similar to previous works (Yu et al., 2018; Kwiatkowski et al., 2019) we use expert judgements, where the experts had prepared the guidelines for the annotation task. Given a question and its annotated QDMR, (q, s) the expert determines the correctness of s using one of the following categories:

• Correct (C): If s constitutes a list of QDMR operations that lead to correctly answering q.

Question 1: "What color is the biggest sphere in the picture?" QDMR:

(1) Return spheres;

(2) Return #1 that is the biggest;

(3) Return color of #2. Expert Judgement: C. Correct, but not fully decomposed.

Step #2 should be broken down to: (2) Return size of #1; (3) Return #1 where #3 is highest. Question 2: "What is the full name of each car maker, along with its id and how many models it produces?" QDMR:

(1) Return car makers;

(2) Return models of #1;

(3) Return number of #2 for each #1; (4) Return full names of #1;

(5) Return ids of #1; (6) Return #4 , #5 , #3. Expert Judgement: C G . Correct and fully decomposed. Question 3: "Show the names and locations of institutions that are founded after 1990 and have the type Private." QDMR:

(1) Return institutions;

(2) Return #1 founded after 1990;

(3) Return types of #1; (4) Return #1 where #3 is Private;

(5) Return #2 , #4; (6) Return names of #5; (7) Return locations of #5; (8) return #6 , #7. Expert Judgement: I. Incorrect, as step #5 returns institutions that were either founded after 1990, or are Private. • Granular (C G ): If s is correct and none of its operators can be further decomposed. 5 • Incorrect (I): If s is in neither C nor C G .

Examples of these expert judgements are shown in Figure 5 . To estimate expert judgement of correctness, we manually reviewed a random sample of 500 QDMRs from BREAK. We classified 93.8% of the samples in C G and another 3.6% in C. Thus, 97.4% of the samples constitute a correct decomposition of the original question. Workers have somewhat struggled with decomposing superlatives (e.g., "biggest sphere"), as evident from the first question in Figure 5 . Collected QDMRs displayed similar estimates of C, C G and I, regardless of their modality (DB, text or image).

Figure 5: Examples and justifications of expert judgement on collected QDMRs in BREAK.

4.3 Annotation Consistency

As QDMR is expressed using natural language, it introduces variability into its annotations. We wish to validate the consistency of collected QDMRs, i.e., whether we can correctly infer the formal QDMR operator (f i ) and its arguments from each step (s i ). To infer these formal representations, we developed an algorithm that goes over the QDMR structure step-by-step, and for each step s i , uses a set of predefined templates to identify f i and its arguments, expressed in s i . This results in an execution graph (Figure 2) , where the execution result of a parent node serves as input to its child. Figure 1 presents by our algorithm (lower box). Each node lists its operator (e.g., GROUP), its constant input listed in brackets (e.g., count) and its dynamic input which are the execution results of its parent nodes. Overall, 99.5% of QDMRs had all their steps mapped into pseudo-logical forms by our algorithm. To evaluate the correctness of the mapping algorithm, we randomly sampled 350 logical forms, and examined the structure of the formulas, assuming that words copied from the question correspond to entities and relations in an idealized KB (see §2). Of this sample, 99.4% of its examples had all of their steps, s i , correctly mapped to the corresponding f i . Overall, 93.1% of the examples were of fully accurate logical forms, with errors being due to QDMRs that were either incorrect or not fully decomposed (I, C in §4.2). Thus, a rule-based algorithm can map more than 93% of the annotations into a correct formal representation. This shows our annotators produced consistent and high-quality QDMRs. Moreover, it suggests that non-experts can annotate questions with pseudo-logical forms, which can be used as a cheap intermediate representation for semantic parsers (Yih et al., 2016) , further discussed in §6.

5 Qdmr For Open-Domain Qa

A natural setup for QDMR is in answering complex questions that require multiple reasoning steps. We compare models that exploit question decompositions to baselines that do not. We use the open-domain QA ("full-wiki") setting of the HOTPOTQA dataset : Given a question, the QA model retrieves the relevant Wikipedia paragraphs and answers the question using these paragraphs.

5.1 Experimental Setup

We compare BREAKRC, a model that utilizes question decomposition to BERTQA, a standard QA model, based on BERT , and present COMBINED, an approach that enjoys the benefits of both models.

BREAKRC Algorithm 1 describes the BREAKRC model which uses high-level QDMR structures for answering open-domain multi-hop questions. We assume access to an IR model and an RC model, and denote by ANSWER(•) a function that takes a question as input, runs the IR model to obtain paragraphs, and then feeds those paragraphs as context for an RC model that returns a distribution over answers.

Given an input QDMR, s = s 1 , ..., s n , iterate over s step-by-step and perform the following. First, we extract the operation (line 4) and the previous steps referenced by s i (line 5) . Then, we compute the answer to s i conditioned on the extracted operator. For SELECT steps, we simply run the ANSWER(•) function. For PROJECT steps, we substitute the reference to the previous step in s i with its already computed answer, and then run ANSWER(•). For FILTER steps, 6 we use a simple rule to extract a "normalized question",ŝ i from s i and get an intermediate answer ans tmp with ANSWER(ŝ i ). We then "intersect" ans tmp with the referenced answer by multiplying the probabilities provided by the RC model and normalizing. For COMPARISON steps, we compare, with a discrete operation, the numbers returned by the referenced steps. The final answer is the highest probability answer of step s n .

As our IR model we use bigram TF-IDF, proposed by Chen et al. (2017) . Since the RC model is run on single-hop questions, we use the BERTbased RC model from Min et al. (2019b) , trained solely on SQuAD (Rajpurkar et al., 2016) . 6 INTERSECTION steps are handled in a manner similar to FILTER, but we omit the exact description for brevity.

Algorithm 1 BREAKRC 1: procedure BREAKRC(s : QDMR) 2:

ansrs ← [] 3: for s i in s = s 1 , ..., s n do 4: op ← OPTYPE(s i ) 5: ref s ← REFERENCEDSTEPS(s i ) 6:

if op is SELECT then 7: BERTQA Baseline As BREAKRC exploits question decompositions, we compare it with a model that does not. BERTQA receives as input the original natural language question, x. It uses the same IR model as BREAKRC to retrieve paragraphs for x. For a fair comparison, we set its number of retrieved paragraphs such that it is identical to BREAKRC (namely, 10 paragraphs for each QDMR step that involves IR). Similar to BREAKRC, retrieved paragraphs are fed to a pretrained BERT-based RC model (Min et al., 2019b) to answer x. In contrast to BREAKRC, that is trained on SQUAD, BERTQA is trained on the target dataset (HOTPOTQA), giving it an advantage over BREAKRC.

ans ← ANSWER(s i ) 8: else if op is FILTER then 9:ŝ i ← EXTRACTQUESTION(s i ) 10: anstmp ← ANSWER(ŝ i ) 11: ans ← INTERSECT(

A COMBINED Approach Last, we present an approach that combines the strengths of BREAKRC and BERTQA. In this approach, we use the QDMR decomposition to improve retrieval only. Given a question x and its QDMR s, we run BREAKRC on s, but in addition to storing answers, we also store all the paragraphs re-trieved by the IR model. We then run BERTQA on the question x and the top-10 paragraphs retrieved by BREAKRC, sorted by their IR ranking. This approach resembles that of Qi et al. (2019) . The advantage of COMBINED is that we do not need to develop an answering procedure for each QDMR operator separately, which involves different discrete operations such as comparison and intersection. Instead, we use BREAKRC to retrieve contexts, and an end-to-end approach to learn how to answer the question directly. This can often handle operators not implemented in BREAKRC, like BOOLEAN and UNION.

DATASET To evaluate our models, we use all 2,765 QDMR annotated examples of the HOTPOTQA development set found in BREAK. PROJECT and COMPARISON type questions account for 48% and 7% of examples respectively. Table 6 shows model performance on HOT-POTQA. We report EM and F 1 using the official HOTPOTQA evaluation script. IR, measures the percentage of examples in which the IR model successfully retrieved both of the "gold paragraphs" necessary for answering the multihop question. To assess the potential utility of QDMR, we report results for BREAKRC G , which uses gold QDMRs, and BREAKRC P , which uses QDMRs predicted by a COPYNET parser ( §7.2).

Table 6: Open-domain QA results on HOTPOTQA.

5.2 Results

Retrieving paragraphs with decomposed questions substantially improves the IR metric from 46.3 to 59.2 (BREAKRC G ), or 52.5 (BREAKRC P ). This leads to substantial gains in EM and F 1 for COMBINED G (43.3 to 52.4) and COMBINED P (43.3 to 49.3). The EM and F 1 of BREAKRC G are only slightly higher than BERTQA since BREAKRC does not handle certain operators, such as BOOLEAN steps (9.4% of the examples).

The majority of questions in HOTPOTQA combine SELECT operations with either PROJECT (also called "bridge" questions), COMPARISON, or FILTER. PROJECT and COMPARISON questions ( Figure 6 ) were shown to be less susceptible to reasoning shortcuts, i.e. they necessitate multistep reasoning (Chen and Durrett, 2019; Jiang and Bansal, 2019; Min et al., 2019a) . In Table 7 we report BREAKRC results on these question types, where it notably outperforms BERTQA.

Figure 6: Examples of PROJECT and COMPARISON questions in HOTPOTQA (high-level QDMR).

Table 7: Results on PROJECT and COMPARISON questions from HOTPOTQA development set.

Ablations In BREAKRC, multiple IR queries are issued, one at each step. To examine whether these multiple queries were the cause for performance gains, we built IR-NP: A model that issues multiple IR queries, one for each noun phrase in the question. Similar to COMBINED, the question and union of retrieved paragraphs are given as input to BERTQA. We observe that COMBINED substantially outperforms IR-NP, indicating that the structure of QDMR, rather than multiple IR queries, has led to improved performance. 7 To test whether QDMR is better than a simple rule-based decomposition algorithm, we developed a model that decomposes a question by applying a set of predefined rules over the dependency tree of the question (full details in §7.2). COMBINED and BREAKRC were compared to COMBINED R and BREAKRC R which use the rulebased decompositions. We observe that QDMR lead to substantially higher performance when compared to the rule-based decompositions.

6 Qdmr For Semantic Parsing

As QDMR structures can be easily annotated at scale, a natural question is how far are they from fully executable queries (known to be expensive to annotate). As shown in §4.3, QDMRs can be mapped to pseudo-logical forms with high precision (93.1%) by extracting formal operators and arguments from their steps. The pseudo-logical form differs from an executable query in the lack of grounding of its arguments (entities and relations) in KB constants. This stems from the design of QDMR as a domain-agnostic meaning representation ( §2). QDMR abstracts away from a concrete KB schema by assuming an underlying "idealized" KB, which contains all of its arguments.

Thus, QDMR can be viewed as an intermediate representation between a natural language question and an executable query. Such intermediate representations have already been discussed in prior work on semantic parsing. Kwiatkowski et al. (2013) and Choi et al. (2015) used underspecified logical forms as an intermediate representation. Guo et al. (2019) proposed a two-stage approach, separating between learning an intermediate text-to-SQL representation and the actual mapping to schema items. Works in the database community have particularly targeted the mapping of intermediate query representations into DB grounded queries, using schema mapping and join path inference (Androutsopoulos et al., 1995; Baik et al., 2019) . We argue that QDMR can be used as an easy-to-annotate representation in such semantic parsers, bridging between natural language and full logical forms.

7 Qdmr Parsing

We now present evaluation metrics and models for mapping questions into QDMR structures.

Task Definition Given a question x we wish to map it to its QDMR steps, s = s 1 , ..., s n . One can frame this as a sequence-to-sequence problem where x is mapped to a string representing its decomposition. We add a special separating token SEP , and define the target string to be s 1 1 , ..., s 1 m 1 , SEP , s 2 1 , ..., s 2 m 2 , SEP , ..., s n mn , where m 1 , ..., m n are the number of tokens in each decomposition step.

7.1 Evaluation Metrics

We wish to assess the quality of a predicted QDMR,ŝ to a gold standard, s. Figure 7 lists various properties by which question decompositions may differ, such as granularity (e.g., steps 1-3 of decomposition 1 are merged into the first step of decomposition 2), ordering (e.g., the last two steps are swapped) and wording (e.g., using "from" instead of "on"). While such differences do not affect the overall semantics, the second decomposition can be further decomposed. To measure such variations, we introduce two types of evaluation metrics. Sequence-based metrics treat the decomposition as a sequence of tokens, applying stan-Question: "Show me all the flights from Atlanta to Baltimore on any airline on Thursday" Decomposition 1:

Figure 7: Differences in granularity, step order, and wording between two decompositions.

1. Return flights 2. Return #1 from Atlanta 3. Return #2 to Baltimore 4. Return #3 on Thursday 5. Return #4 from any airline Decomposition 2:

1. Return flights from Atlanta to Baltimore 2. Return #1 on any airline 3. Return #2 on Thursday Figure 7 : Differences in granularity, step order, and wording between two decompositions. Figure 8 : Graph edit operations between the graphs of the two QDMRs in Figure 7 .

Figure 8: Graph edit operations between the graphs of the two QDMRs in Figure 7.

dard text generation metrics. As such metrics ignore the QDMR graph structure, we also employ graph-based metrics that compare the predicted graph Gŝ to the gold QDMR graph G s (see §2).

Sequence-based scores, where higher values are better, are denoted by ⇑. Graph-based scores, where lower values are better, are denoted by ⇓.

• Exact Match ⇑: Measures exact match between s andŝ, either 0 or 1. • SARI ⇑ (Xu et al., 2016) : SARI is commonly used in tasks such as text simplification. Given s, we consider the sets of added, deleted, and kept n-grams when mapping the question x to s. We compute these three sets for both s and s using the standard of up to 4-grams, then average (a) the F 1 for added n-grams between s andŝ, (b) the F 1 for kept n-grams, and (c) the precision for the deleted n-grams. • Graph Edit Distance (GED) ⇓: A graph edit path is a sequence of node and edge edit operations (addition, deletion, and substitution), where each operation has a predefined cost. GED computes the minimalcost graph edit path required for transitioning from G s to Gŝ (and vice versa), normalized by max(|G s |, |Gŝ|). Operation costs are 1 for insertion and deletion of nodes and edges. The substitution cost of two nodes u, v is set to be 1 − Align (u, v) , where Align(u, v) is the ratio of aligned tokens between these steps. • GED+ ⇓: Comparing the QDMR graphs in Figure 8 , we consider the splitting and merging of graph nodes. We implement GED+, a variant of GED with additional operations to merge (split) a set of nodes (node), based on the A* algorithm (Hart et al., 1968 ). 8

7.2 Qdmr Parsing Models

We present models for QDMR parsing, built over AllenNLP (Gardner et al., 2017) .

• COPY: A model that copies the input question x, without introducing any modifications.

• RULEBASED: We defined 12 decomposition rules, to be applied over the dependency tree of the question, augmented with coreference relations. A rule is a regular expression over the question dependency tree, which invokes a decomposition operation when matched (Table 8). E.g., the rule for relative clauses (relcl) breaks the question at the relative pronoun "that", while adding a reference to the preceding part of the sentence. A full decomposition is obtained by recursively applying the rules until no rule is matched. • SEQ2SEQ: A sequence-to-sequence neural model with a 5-layer LSTM encoder and atten-8 Due to its exponential worst-case complexity, we compute GED+ only for graphs with up to 5 nodes, covering 75.2% of the examples in the development set of BREAK. tion at decoding time.

Table 8: The decomposition rules of RULEBASED. Rules are based on dependency labels, part-of-speech tags and coreference edges. Text fragments used for decomposition are in boldface.

• S2SDYNAMIC: SEQ2SEQ with a dynamic output vocabulary restricted to the closed set of tokens L x available to crowd-workers (see §3). • COPYNET: SEQ2SEQ with an added copy mechanism that allows copying tokens from the input sequence (Gu et al., 2016) . Table 9 presents model performance on BREAK. Neural models outperform the RULEBASED baseline and perform reasonably well, with COPYNET obtaining the best scores across all metrics. This can be attributed to most of the tokens in a QDMR parse being copied from the original question.

Table 9: Performance of QDMR parsing models on the development and test set. GED+ is computed only for the subset of QDMR graphs with up to 5 nodes, covering 66.1% of QDMRs and 97.6% of high-level data.

7.3 Results

Error analysis To judge the quality of predicted QDMRs we sampled 100 predictions of COPYNET (Table 10) half of them being high-level QDMRs. For standard QDMR, 24% of the sampled predictions were an exact match, with an additional 30% being fully decomposed and semantically equivalent to the gold decompositions. E.g., in the first row of Table 10 , the gold decomposition first discards the number of cylinders, then counts the remaining objects. Instead, COPYNET opted to count both groups, then subtract the number of cylinders from the number of objects. This illustrates how different QDMRs may be equivalent. For high-level examples (from RC datasets), as questions are often less structured, they require a deeper semantic understating from the decomposition model. Only 8% of the predictions were an exact match, with an additional 46% being semantically equivalent to the gold. The remaining 46% were of erroneous predictions (see Table 10 ).

Table 10: Manual error analysis of the COPYNET model predictions. Lower examples are of high-level QDMRs.

8 Related Work

Question decomposition Recent work on QA through question decomposition has focused mostly on single modalities (Gupta and Lewis, 2018; Guo et al., 2019; Min et al., 2019b) . QA using neural modular networks has been suggested for both KBs and images by Andreas et al. (2016) and Hu et al. (2017) . Question decomposition over text was proposed by Talmor and Berant (2018) , however over a much more limited set of questions than in BREAK. Iyyer et al. (2017) have also decomposed questions to create a "sequential question answering" task. Their annotators viewed a web table and performed actions over it to retrieve Table 9 : Performance of QDMR parsing models on the development and test set. GED+ is computed only for the subset of QDMR graphs with up to 5 nodes, covering 66.1% of QDMRs and 97.6% of high-level data.

Gold Prediction (Copynet) Analysis

"How many objects other than cylinders are there?"

(1) objects; (2) cylinders;

(3) #1 besides #2; (4) number of #3.

(1) objects; (2) cylinders;

(3) number of #1; (4) number of #2; (5) difference of #3 and #4.

sem. equiv. (30%) "Where is the youngest teacher from?"

(1) teachers;

(2) the youngest of #1;

(3) where is #2 from.

(1) youngest teacher;

(2) where is #1. incorrect (46%)

"Kyle York is the Chief Strategy Officer of a company acquired by what corporation in 2016?"

(1) company that Kyle York is the Chief Strategy Officer of;

(2) corporation that acquired #1 in 2016.

(1) company that Kyle York is the Chief Strategy Officer of;

(2) corporation in 2016 that #1 was acquired by.

sem. equiv.

"Dayton's Devils had a cameo from the 'MASH' star who played what role on the show?"

(1) MASH star that Dayton 's Devils had a cameo from;

(2) role that #1 played on the show.

(1) the MASH that Dayton 's Devils had a cameo;

(2) what role on the show star of #1 played. incorrect (46%) the cells that constituted the answer. Conversely, we provided annotators only with the question, as QDMR is agnostic to the original context. An opposite annotation cycle to ours was presented in Cheng et al. (2018) . The authors generate sequences of simple questions which crowdworkers paraphrase into a compositional question. Questions in BREAK are composed by humans, and are then decomposed to QDMR.

Semantic formalism annotation Labeling corpora with a semantic formalism has often been reserved for expert annotators (Dahl et al., 1994; Zelle and Mooney, 1996; Abend and Rappoport, 2013; Yu et al., 2018) . Recent work has focused on cheaply eliciting quality annotations from nonexperts through crowdsourcing (He et al., 2016; Iyer et al., 2017; . FitzGerald et al. (2018) facilitated non-expert annotation by introducing a formalism expressed in natural language for semantic-role-labeling. This mirrors QDMR, as both are expressed in natural language.

Relation to other formalisms QDMR is related to Dependency-based Compositional Semantics (Liang et al., 2013) , as both focus on question representations. However, QDMR is designed to facilitate annotations, while DCS is centered on paralleling syntax. Domain-independent intermediate representations for semantic parsers were proposed by Kwiatkowski et al. (2013) and Reddy et al. (2016) . As there is no consensus on the ideal meaning representation for semantic parsing, rep-resentations are often chosen based on the particular execution setup: SQL is used for relational databases (Yu et al., 2018) , SPARQL for graph KBs (Yih et al., 2016) , while other ad-hoc languages are used based on the task at hand. We frame QDMR as an easy-to-annotate formalism that can be potentially converted to other representations, depending on the task. Last, AMR (Banarescu et al., 2013) is a meaning representation for sentences. Instead of representing general language, QDMR represents questions, which are important for QA systems, and for probing models for reasoning.

9 Conclusion

In this paper, we presented a formalism for question understanding. We have shown it is possible to train crowd-workers to produce such representations with high quality at scale, and created BREAK, a benchmark for question decomposition with over 83K decompositions of questions from 10 datasets and 3 modalities (DB, images, text). We presented the utility of QDMR for both opendomain question answering and semantic parsing, and constructed a QDMR parser with reasonable performance. QDMR proposes a promising direction for modeling question understanding, which we believe will be useful for multiple tasks in which reasoning is probed through questions.

A system could potentially answer "Name the political parties of the most densely populated country", by retrieving "the most densely populated country" using a database query, and "the political parties of #1" via an RC model.

The sequence of operations op 1 , . . . , op k is traced using the references to previous steps in the QDMR structure.

Except for COMPLEXWEBQUESTIONS (CWQ), where annotators paraphrased automatically generated questions.

Regarding the three merged operators of high-level QDMRs ( §2), the first two operators are treated as SELECT, while the third is considered a FILTER.

For high-level QDMRs, the merged operators ( §2) are considered to be fully decomposed.

Issuing an IR query over each "content word" in the question, instead of each noun phrase, led to poor results.