Text Modular Networks: Learning to Decompose Tasks in the Language of Existing Models

Tushar Khot
Daniel Khashabi
Kyle Richardson
P. Clark
Ashish Sabharwal
NAACL
2021
View in Semantic Scholar

Abstract

We propose a general framework called Text Modular Networks(TMNs) for building interpretable systems that learn to solve complex tasks by decomposing them into simpler ones solvable by existing models. To ensure solvability of simpler tasks, TMNs learn the textual input-output behavior (i.e., language) of existing models through their datasets. This differs from prior decomposition-based approaches which, besides being designed specifically for each complex task, produce decompositions independent of existing sub-models. Specifically, we focus on Question Answering (QA) and show how to train a next-question generator to sequentially produce sub-questions targeting appropriate sub-models, without additional human annotation. These sub-questions and answers provide a faithful natural language explanation of the model’s reasoning. We use this framework to build ModularQA, a system that can answer multi-hop reasoning questions by decomposing them into sub-questions answerable by a neural factoid single-span QA model and a symbolic calculator. Our experiments show that ModularQA is more versatile than existing explainable systems for DROP and HotpotQA datasets, is more robust than state-of-the-art blackbox (uninterpretable) systems, and generates more understandable and trustworthy explanations compared to prior work.

1 Introduction

A natural way to solve more complex tasks such as multi-hop reasoning (Yang et al., 2018; Khashabi et al., 2018a; Khot et al., 2020) and numerical reasoning (Dua et al., 2019) is by decomposing them into simpler problems that have already been solved, e.g., single-fact QA (Rajpurkar et al., 2016) . Besides being able to reuse existing simpler models, this approach allows developing an

American

Okay, the answer is "American" ModularQA ■ DROP Question: How many years did it take for the services sector to rebound? (answer: 1)

Modularqa

Hey S! In what year did the services sector rebound?

Hey S! When did the services sector start to take a dip?

ModularQA 2003 2002Okay, the answer is "1" Figure 1 : MODULARQA learns to ask sub-questions to existing simple QA models, including a symbolic calculator, to finally arrive at the answer to a given complex questions. Note that our approach did not rely on any decomposition annotations. The system learned to add "start to take a dip" in the question from the DROP dataset.

Figure 1: MODULARQA learns to ask sub-questions to existing simple QA models, including a symbolic calculator, to finally arrive at the answer to a given complex questions. Note that our approach did not rely on any decomposition annotations. The system learned to add “start to take a dip” in the question from the DROP dataset.

interprettable system that can explain its reasoning as a composition of these simpler sub-tasks, as illustrated in Figure 1 .

Current decomposition approaches (Talmor and Berant, 2018; Min et al., 2019a) mainly focus on developing algorithms that target specific question patterns (e.g., decomposing by predicting split points). However, such approaches are difficult to extend to new question patterns and cannot guarantee that the generated sub-questions are actually solvable by existing models. For instance, consider the question from the DROP dataset in Figure 1. The second sub-question here requires the introduction of a new phrase, "start to take a dip". This is beyond the scope of standard approaches that only rephrase question splits. Additionally, the final question needs to be posed to a symbolic calculator, which operates with a different ques-tion language.

To address these issues, we propose a general framework called Text Modular Networks (TMNs) where the system learns to decompose complex questions (of any form) into subquestions that are answerable by existing QA models or symbolic modules (henceforth referred to as sub-models). 1 The core of this framework is a next-question generator that sequentially produces the next sub-question as well as the appropriate sub-model that should be capable of answering this sub-question. This sequence of subquestions and answer provides an interpretable explanation of the model's neuro-symbolic reasoning, as illustrated in Figure 1 .

A key insight behind our approach is that we can model the capabilities of existing sub-models by training a text-to-text model to generate the questions in the sub-model's training dataset (e.g, SQuAD), given appropriate cues. In our case, we train a BART model (Lewis et al., 2019) to generate questions given the context, answer, and vocabulary hints as cues. We then use these sub-task question models to generate sub-questions (and the appropriate sub-models) that could lead to the intermediate answers for each complex question (e.g., "Raymond S." and "American" in the Hot-potQA example in Figure 1 ). The resulting subquestions, by virtue of our training, are in the language (i.e., within-scope) of the corresponding sub-models.

These sub-question sequences can now be used to train the next-question generator to sequentially produce the next sub-question. TMNs use only this trained generator, along with existing QA models, to answer complex questions, without need for answer supervision during inference. Importantly, since TMNs learn model-specific decompositions automatically, they can easily integrate different QA sub-models as blackboxes, irrespective of whether they are neural or symbolic.

We use the TMN framework to develop a modular system, MODULARQA, that explains its reasoning and generalizes across DROP (Dua et al., 2019) , 2 and HotpotQA (Yang et al., 2018 ) using a neural factoid QA sub-model and a symbolic calculator. MODULARQA outperforms a baseline text-to-text model and even other modular net-works that each target only one of these datasets.

Contributions. (1) A General Framework Called

Text Modular Networks (TMNs) that decomposes complex questions into sub-questions answerable by existing models treated as blackboxes (neural or symbolic); (2) An implementation of this framework called MODULARQA 3 that learns to decompose multi-hop and discrete reasoning questions; (3) The first question decomposition approach for DROP 2 that also generalizes to Hot-potQA; and (4) An interprettable model that can explain its reasoning in terms of a sequence of simple questions and answers.

2 Related Work

The idea of decomposing complex behavior in terms of simpler parts is long-standing in the AI literature. Minsky (1988) famously characterized human intelligence as interactions between simpler parts or agents (forming what he called a society of mind). Many early QA systems were also designed as combination of distinct modules (Harabagiu et al., 2001; Moldovan et al., 2000) . Traditionally, this idea has been used to compose outputs of lower-level language tasks to solve the higher-level NLU tasks (Harabagiu and Hickl, 2006; Lally et al., 2012; Khot et al., 2017; Khashabi et al., 2018b) . A distinct feature of much of this prior work is the use of pre-determined symbolic representations. This includes, in addition, frameworks for mapping math word problems to a symbolic representation based on mathematical operations (Seo et al., 2015; Roy and Roth, 2018; Amini et al., 2019; Chen et al., 2020) , as well as mapping questions to pre-defined set of operations (Berant et al., 2013; Neelakantan et al., 2016) , inter alia.

There have been different modular network architectures proposed to exploit compositionality (Rosenbaum et al., 2018; Kirsch et al., 2018) . The closest models to our work are based on neural module networks (NMN) (Andreas et al., 2016) . NMNs are end-to-end modeling frameworks that compose task-specific neural modules where each module performs a simple operation. Two formulations of NMNs have been proposed for complex QA tasks (1) Self-Assembling NMNs (Jiang and Bansal, 2019) for HotpotQA and (2) NMN for DROP . Both these approaches target exactly one dataset and need to learn the simple reasoning of the modules from the complex task. While these systems are also capable of producing explanations in terms of a semantic parse and the module attentions, they are harder to interpret than our simple text-based explanations.

Question decomposition has been pursued before for two different datasets: ComplexWe-bQuestions (Talmor and Berant, 2018) and Hot-potQA (Yang et al., 2018) .

Both the approaches (Talmor and Berant, 2018; Min et al., 2019b) focus on directly training a model to produce the decomposition, similar to work on semantic parsing to produce the latent logical forms (Berant et al., 2013) . Along similar lines, Wolfson et al. (2020) collect question decompositions for these textual QA datasets, in addition to other semantic parsing datasets, with a fixed operation set. However, we argue that textual questionanswering may not have a fixed vocabulary of valid operations -the "operations" are defined by the sub-models used to solve the simple questions. Hence, we propose Text Modular Networks as an alternate framework that learns to generate question decompositions in the language of these models. Additionally, these approaches (Talmor and Berant, 2018; Min et al., 2019b) exploit the explicit decomposition cues from the question to produce question splits -an approach not usable for DROP questions (as shown in Fig. 1 ). Perez et al. (2020) also used a text-to-text model to generate simple sub-questions for Hot-potQA. However, they used similar questions from a large corpus of simple questions as training data for this model without trying to exactly capture the reasoning steps. Consequently, their final system mainly used these generated decompositions to collect evidence.

Our approach to train a model to produce the latent decomposition from weak/distant supervision bears resemblance to semantic parsing approaches (Berant et al., 2013; Krishnamurthy et al., 2017) that are trained to produce the latent logical forms that can be executed over tables. We, however, diverge from this line of work since our inferred decompositions are in free-form language and not limited to any pre-specified set of operations. Iyyer et al. (2017) also proposed solving complex QA tasks using a sequence of sub-questions, but the sub-questions were model-agnostic and limited to QA over structured tables. Finally, various approaches (Chen et al., 2016; Buck et al., 2018; Nogueira and Cho, 2017; have also proposed reformulating questions for textual QA tasks. However these approaches mostly focus on rephrasing simple questions and not decomposing the question itself.

3 Text Modular Networks

TMNs are a family of architectures consisting of modules that communicate through language learned from these modules, to accomplish a certain goal (e.g., answering a question). Figure 2 illustrates this general idea in the context of answering a DROP question. The core of our system is a next-question generator G, a component in charge of generating and distributing sub-tasks among sub-models A s . The system alternates between using next-question generator model to produce the next question (NextGen) and using the corresponding sub-model to answer this question. To state it formally, solving a complex question qc is an alternating process between the following two steps: Generate the next question q i for submodel t i :

Figure 2: A sample inference chain for a DROP question. Through text-to-text interactions between

t i , q i = D(qc, q 1 , a 1 , . . . , q i−1 , a i−1 ) Find answer a i by posing q i to submodel t i : a i = A t i (q i , p)

where q i is the i th generated sub-question and a i is the answer produced by a sub-model t i based on a given context paragraph p. This simple iterative process ends when q i+1 equals to a special end-ofsequence symbol (denoted throughout as [EOQ]) with the final output answer a i .

3.1 Design Principles

The two key design principles that motivate this design of Text Modular Networks are: Figure 2 : A sample inference chain for a DROP question. Through text-to-text interactions between the next-question generator D and existing QA models A, we can clearly see the interactive reasoning process of our system. easy and even necessary. Ensuring solvability of the sub-tasks is a necessary step towards solving the more complex task that builds on top of them.

Decomposability

Exploiting existing models. Our architecture aims to decompose complex tasks into simpler sub-tasks, enabling the reuse of existing models (or the use of easy-to-define functions). As we show in later sections, we solve questions in DROP and HotpotQA, two multi-step reasoning datasets, by decomposing them in terms of SQuAD (an existing simpler dataset) and a symbolic calculator. This paradigm is in contrast to recent efforts to build increasingly complex monolithic models trained on large datasets for each new complex task (Fang et al., 2019; Xiao et al., 2019; Ran et al., 2019) . While similar textbased decomposition architectures have also been proposed before (Talmor and Berant, 2018; Min et al., 2019b) , the main characteristic of TMNs is learning to produce these decompositions in the language of these models. This design attribute makes Text Modular Networks extensible since a designer can plug in any QA model, even a nondifferentiable symbolic model, to extend the scope of the overall system.

3.2 Building A Text Modular Network

The key challenge in building a Text Modular Networks is developing the next-question generator model. Training this model requires a nextquestion prediction dataset where each example is a step in the iterative progression of sub-question generation. We illustrate this with two examples: Output t 2 , q 2 = SQuAD, "When did the services sector start to take a dip?" While it may be possible to do this by collecting task-specific datasets or designing a task-specific next-question generator (Min et al., 2019b; Talmor and Berant, 2018) , our goal is to build a framework that can be easily extended to new complex QA tasks as well as to new sub-tasks. To achieve this, we present a general framework to generate the next-question training dataset with the following two properties:

Decomposition in the language of sub-models. To effectively communicate with each sub-model, the next-question generator has to generate subtasks in the language of each sub-model (i.e., decomposing the complex tasks as chunks that are understandable by each model). This means that our training dataset (and thereby the next-question generator) should use questions that an existing sub-model can handle without having to manually specify them for each sub-model (Sec. 3.2.1).

Minimal supervision of decompositions: Just as we need to model the sub-tasks, the nextquestion generator also needs to model the rea-p: ...The sector decreased by 7.8 percent in 2002, before rebounding in 2003 with a 1.6 percent growth rate... a: 2002 G S (p, a): When did the services sector start to take a dip? Figure 3 : Sample questions generated by our conditional question generator models. Given the context and answer, our SQuAD-based G model is able to generate a valid factoid question answerable by a SQuAD QA model. Similarly a symbolic generator can be used to produce a question answerable by a calculator.

Figure 3. Not extracted; please refer to original document.

soning needed in the complex task. We argue that for many problems there are practical ways to weakly supervise the reasoning chain without the need for manual and exhaustive annotation of the full decomposition. As we show in Sec. 3.2.2, it is possible to extract a set of hints (embodied in the questions and context paragraphs) to semi-automatically construct the training data for next-question generator that captures the reasoning needed for the complex task.

3.2.1 Modeling Qa Sub-Models

We identify minimal supervision hints, e.g. intermediate answers, to weakly specify each reasoning step (described shortly) for a complex QA task. These hints are then used to generate the sub-questions q i in the "language" of existing submodels at each step. To ensure these questions are answerable by existing sub-models, we train a text-to-text model on the sub-model's original task to generate a plausible q i conditioned on these hints, e.g., a BART model trained to generate the question given the answer. We can view this utility as characterizing the question language of the submodel. For example, such a model trained on the SQuAD dataset would produce factoid questions -the space of questions answerable by a model trained on this dataset.

While an unconditional question generation model can also capture the space of questions, it can generate a large number of possibly valid questions, making it hard to effectively train or use such a model. Instead, we scope the problem down to conditional generation of questions given hints z. For example, we could use the context p and answer a as input conditions to train a question generator model G : z → q where z = p, a . Figure 3 shows an example question from such a generator G S trained on SQuAD dataset. Para, p: ... The sector decreased by 7.8 percent in 2002, before rebounding in 2003 with a 1.6 percent growth rate ... Question, qc: How many years did it take for the services sector to rebound? Answer a: 1 Hints → Sub-Questions

a 1 =2003 → q 1 = G S (p, a 1 ):

In what year did the services sector rebound? a 2 =2002 → q 2 = G S (p, a 2 ): When did the services sector start to take a dip?

a 3 =1 → q 3 = G C (p, a 3 ): diff(2003, 2002) → q 4 = [EOQ]

Figure 4: An example decomposition generated for a DROP example using hints and sub-question generators G. These intermediate answers can be obtained by finding two dates/numbers and an operation that produces the final answer.

Figure 4. Not extracted; please refer to original document.

3.2.2 Generating Training Data

We next describe how to use the sub-task question models to produce potential question-answer chains for a complex question. To generate using the sub-task question model, we need to define the input hints z at each step. Such hints can often be extracted from multi-hop datasets (as we show in Sec. 5) and are easier to annotate than modelspecific decompositions. For example, in our running DROP example, while we do not have annotations for the exact decomposition, we can use cues in the question and the paragraph to derive hints (such as the answers for each sub-question). This is akin to approaches that generate weaksupervision for semantic parsers Berant et al., 2013) , math QA systems (Roy and Roth, 2017) and textual NMNs Jiang and Bansal, 2019) . For example, under the definition of z = p, a , we would need to provide the context and answer for each reasoning step. In our example question, we can derive the intermediate answers by finding the two dates whose difference leads to the final answer. 4 As shown in Fig. 4 , this kind of simple weak-supervision in combination with our sub-task question models can be used to produce the sequence of sub-questions needed to train our next-question generator.

At the end, we can now easily derive the training data needed for our next-question generator. For each question q i generated using the sub-task

Operation QA Model, A C Question Generator, G C diff(X, Y, [Z])

Return absolute difference between X and Y. If Z ∈ {days, months, years} is specified, find the difference in Z units.

Generate questions with all possible date/number pairs as X,Y. If Z ∈ {days, months, years} is mentioned in the question, add Z not(X)

Return 100 -X Generate questions for every number ≤ 100 as X. if_then(X > Y, Z, W) If X is greater than Y, return Z else return W (X, Y can be numbers or dates)

Generate questions with all possible date/number pairs as X and Y. Use pair of entities in the question as Z and W. if_then(X != Y, Z, W) If X is not the same as Y, return Z else return W.

Generate questions with previous answers as X, Y. Z and W are set to "no" and "yes" respectively. question model G t i , we can create a training examples for our next-question generator:

Input: qc, q 1 , a 1 , . . . , q i−1 , a i−1 Output: t i , q i

Note that the cues are only needed for creating this training data for the next-question generator D. Hence they do not need to be carefully designed to have complete coverage or be noise-free. The model will learn to generalize over this noise and other linguistic variations.

4 Modularqa System

We next describe a specific instantiation of the Text Modular Network: MODULARQA-a new QA system that works across HotpotQA and DROP. We focus here on the key pieces of this system that are independent of the end-task. We present details about our two end-tasks and the associated extraction of hints in the next section. More specific details of the system and model hyperparameters are provided in the Appendix A.

4.1 Qa Sub-Models, A

We use two QA models that are sufficient to cover a large space of questions in these two datasets:

• SQuAD model, A S : A RoBERTa-Large model trained on the entire SQuAD 2.0 dataset including the no-answer questions.

• Math calculator model, A C : A symbolic Python program that can perform key operations needed for DROP and HotpotQA (see Table 1 ).

Table 1: Set of operations handled by the symbolic calculator model AC and the corresponding approach to generate such questions in GC .

4.2 Sub-Task Question Models

We define two sub-task question models corresponding to each of our QA sub-models. We train these conditional generators using the context, answer and an estimated question vocabulary v as hints, i.e. z= p, a, v . Note that as the hints get more specific, the space of potential questions gets smaller, making it less likely to produce noisy training decompositions. However, it can be harder to obtain extremely specific hints for questions in your end-task, e.g., the exact words of the decomposed questions would be hard to obtain unless one knew the gold decomposition. We found the answer and estimated question vocabulary to be a reasonable signal for reducing the noise without being hard to obtain automatically.

4.2.1 Squad Qa Model, G S

We train a BART-Large model on the answerable subset of SQuAD 2.0 to build our sub-task question model for SQuAD. We use the gold paragraph and answer from the dataset as the input context and answer. For the estimated question vocabulary, we select essential words 5 from the gold questions (referred as the function Φ from hereon) with additional irrelevant words sampled from other questions. 6

To train the text-to-text BART S model, we use a simple concatenation of the passage, vocabulary and answer (with markers such as "H:" and "A:" to indicate each field) as the input sequence and the question as the output sequence. While a constrained-decoding approach (Hokamp and Liu, 2017; Hu et al., 2019a ) could be used here to further promote the use of the vocabulary hints, this simple approach was effective and more generally applicable to other hints in our use-case. E.g., Complex Q

Step 1

Step 2

Step 3 qc: How many years did it take the services sector to rebound after the 2002 decrease? , "2002", "2003 To use this model, we use nucleus sampling to generate k questions and filter out unanswerable or incorrectly answered questions, i.e., given q ∼

a 1 : 2002 a 2 : 2003 a 3 : 1 v 1 : Φ(qc) v 2 : Φ(qc) v 3 : ["diff"

a 1 : Raymond S Persi a 2 : American v 1 : Φ(qc) v 2 : Φ(qc) + "Raymond S Persi" p 1 : d 1 p 2 : d 2 q 1 : Who directed "

BART S (p, a, v), G S (p, a, v) = {q | overlaps(A S (p, q), a)} 4.2.2 Math Calculator, G C

Given the symbolic nature of this solver, rather than training a neural generator, we just generate all possible numeric questions given the context. Similar to the G S model, we first generate potential questions (see Table 1 ) and then select the ones that lead to the expected answer using A C . Table 2 shows example complex questions and the hints derived for these questions. Given these input hints, 7 and our sub-task question models, we can identify the next-question for each step and the appropriate sub-model (based on the sub-task question model that produced this question). Note that since we sample these questions for each step, every complex question can lead to many such question chains. To improve the quality of the training data, we also filter out potentially noisy decompositions. 8 7 for simplicity, we don't show the input context here 8 E.g., if an answer is unused or the vocabulary is too different. See Appendix A.3 for more details.

Table 2: Sample hints and the resulting generated questions for DROP and HotpotQA examples. The function Φ selects non-stopword words from the input question.

G C (p, a, Φ) = {q | A C (p, q) = a} 4.3 Training Next-Question Generator

We train a BART-Large model on this generated training data to produce the next question (and QA system) given the complex question and previous question-answer pairs. For example, for the first question decomposition in Table 2 , the model is trained on: Input = QC: How many years did it take the services sector to rebound after the 2002 decrease? QI: (squad) When did the services sector take a decrease? A: 2002 QI: (squad) When did the services sector rebound? A: 2003 QS: Output = (math) diff (2002, 2003)

4.4 Inference

For inference, we now only rely on the nextquestion generator and QA sub-models. Given a set of modules (in our case one D and different A t ), we describe inference as finding the best path through a directed graph where each vertex is one of the modules being called and the edges are the outputs of these modules. Each edge has an associated target module e.g. the output sub-question of the next-question generator also provides the QA sub-model that can answer the sub-question. We sample multiple sub-questions from the nextquestion generator using nucleus sampling (Holtzman et al., 2020), each forming a different edge in our graph. We select the top-k answers from the QA sub-models, again each forming a new edge with the next-question generator as the target module. 9 An inference chain is any sequence of edges u = e 1 , ..., e n from a unique start node (nextquestion generator with the input complex question) to a unique end node ( Figure 5 : A sample inference chain scored by our approach for a negation DROP question. For each n 0 question generated in the first step, we will explore n 1 questions in the second step (and so on). We use our scoring function w to select the optimal inference chain+answer (and prune incomplete low-scoring chains).

Figure 5: A sample inference chain scored by our approach for a negation DROP question. For each n0 question generated in the first step, we will explore n1 questions in the second step (and so on). We use our scoring function w to select the optimal inference chain+answer (and prune incomplete low-scoring chains).

Under this formulation, the inference problem therefore can be stated as a single-source shortestpath problem (Cormen et al., 2009) :

P * = arg min P ∈P e j ∈P w(e j ) ,

where the goal is to find the minimal cost path P * (corresponding to an inference chain) parameterized by some weighting function w (see illustration in Figure 5 ). To approximate the full space of possible paths P, we perform a best-first search (Dijkstra et al., 1959) . We define a monotonic scoring function w that scores each sub-question edge based on vocabulary overlap with the input question. 10 At the end, we also score the final chain (i.e. w(e n )) using a RoBERTa model trained on randomly sampled chains (see App. A.5). We note that while our particular choice of w is simple and empirically motivated by our particular choice of datasets, 11 many such weighting functions could be used here, including functions based on additional learned components (e.g., using the neural shortest path approach of Richardson et al. (2018) or other graph-based structure prediction techniques (Deutsch et al., 2019) ).

5 Experimental Setup

We train and evaluate our system on two different QA datasets: DROP and HotpotQA. DROP dataset contains questions that need discrete numeric reasoning over a single paragraph. Hot-potQA dataset is a dataset of multi-hop questions that need facts from two paragraphs to answer each question. While these are very different 10 Same function used to remove noisy decompositions as described in Appendix A.3 11 Experiments using model output scores (e.g., generator and QA scores) to weigh the edges in the graph led to suboptimal results. datasets, by decomposing them into a sequence of simple QA sub-tasks, we are able to develop a single QA system that works across both of them. We will first present the dataset-specific details of collecting the decomposition programs followed by our results.

5.1 Drop Dataset

We focus on three classes of questions that are within the scope of our QA system: difference, comparison, and negation. 12

Difference (How many days before X did Y happen?): We identify such questions based on the presence of terms indicating a difference operations e.g. "days before". We consider all possible number and date pairs close to the mention of X and Y as the intermediate answers. If the difference between the number pair matches the final answer, we assume them to be the intermediate answer. For hints, we use the key terms from the question for the first two questions, and use the two intermediate answers for the last question (and any unit if mentioned).

Comparison (Which event happened before: X or Y?): We identify such questions based on the presence of two compared entities e.g. "...: X or Y". we do not know the exact answer, but it often involves a numeric comparison. So we assume that the intermediate answers should be the numbers or dates close to the mention of the entities in the question. 13 Since there can be multiple dates/numbers that satisfy this criterion, we generate one set of hints for each potential pair of numbers/dates.

Negation (What percent is not X?): We identify such questions based on the phrasing "not X" and the presence of a number = 100 − a in the context (which forms the intermediate entity). For hints, we use the key terms from the question for the first question, and use the intermediate answer as the hint for the last question.

For further details, please refer to Appendix C. Note that such approaches to identify within-scope questions and using heuristics to identify potential reasoning steps are commonly used to better guide models for such complex tasks . In our case, 14.4K training DROP questions are answerable with our system, which forms 18.7% of the complete dataset. 14 . We similarly select 2973 Dev questions (from 9536 questions in the Dev set) that are further split to produce 601 Dev and 2371 Test questions.

5.2 Hotpotqa Dataset

We consider both bridge and comparison questions from the HotpotQA dataset in the Distractor setting, where the input context has 10 Wikipedia documents including the two documents needed to answer the question. Since we rely on existing systems for QA, we only use 17% of the training dataset (15661 questions categorized as "hard" by the authors) to train our decomposition model.Since the test set is blind, we split the 7405 Dev questions (entire Dev set) into 1481 Dev and 5924 Test questions.

There are generally two forms of reasoning needed for HotpotQA:

For bridge questions (e.g., Where was 44th President born?), the model often needs to find an entity in one of the relevant document d 1 that links to the second Wikipedia document, d 2 containing the answer. For such questions, we assume that the title entity of the second paragraph, e 2 is the intermediate answer. 15 In this case, we set z 1 = p 1 = d 1 , a 1 = e 1 , v 1 = η(qc) for the first question. The second question would use this intermediate entity and so we add it to the vocabulary, i.e., z 2 = p 2 = d 2 , a 2 = a, v 2 = η(qc)+e 1 . The function η extracts the key terms from the 14 Previous modular systems have targetted smaller subsets to develop models with interpretable reasoning 15 Similar approaches have been used to produce reasoning chains for multi-hop datasets (Chen et al., 2019; Kundu et al., 2019; Xiao et al., 2019) question relevant to the input context. 16 If the final answer is present in both the paragraphs, we assume it to be a conjunction question (e.g. who acted as X and directed Y?). For such questions, each sub-question would have the same answer i.e. a 1 = a 2 = a.

For comparison questions (e.g., who is older: X or Y?), we handle them exactly same as we did for DROP. For each potential pair of names/dates, we create one potential set of hints where

v i = d i , e i , η(qc) .

Further details are deferred to Appendix C.

6 Results

We evaluate MODULARQA against a crossdataset baseline architecture and, for completeness, also against modular systems targeting individual datasets. When possible, we train the compared model on our dataset, otherwise we use the published model trained on its corresponding dataset (indicated throughout with †). We show that our general approach is comparable to these targeted approaches in terms of its quality, with more explainable text-based reasoning. Table 3 summarizes our quantitative results, 17 compared to both a cross-dataset baseline (Section 6.1) and dataset-targeted architectures (Section 6.2). Table 4 illustrates MODULARQA's ability to explain its reasoning via sub-questions (Section 6.3).

Table 3: Main results, using F1 score (see Appendix B for EM scores). TOP: Comparison to dataset-agnostic models. MODULARQA outperforms the BART baseline on both HotpotQA (bridge and comparison questions) and DROP datasets. BOTTOM: Comparison to dataset-specific architectures. † indicates we used the model trained on its target set. Compared to other modular and explainable systems (NMN-D and SNMN), MODULARQA is a single system that demonstrates similar performance while being applicable to multiple datasets. Compared to targeted blackbox systems (NumNet and Quark), MODULARQA lags behind but remains the only system that produces an explanation and generalizes to two datasets.

Table 4: Sample Reasoning Explanations generated by MODULARQA for DROP and HotpotQA questions. The name of the sub-models can be inferred from the sub-questions and are excluded for simplicity. These explanations provide valuable insight into the reasoning used by MODULARQA for each question. As shown here, MODULARQA is able to learn the appropriate decompositions for these questions without any manually designed rules (e.g., “smaller”⇒ x < y).

6.1 Cross-Dataset Architectures

MODULARQA is a general system that works across both DROP and HotpotQA datasets. Since there does not exist any similar system that works across discrete reasoning (needed by DROP) and multi-document span-prediction (needed by HotpotQA) tasks, we use a language-generation model that can naturally handle both these tasks. We train a single model for DROP and HotpotQA for both these systems for a fair comparison.

Specifically, we train a BART-Large model on these two datasets to generate the answer given the passage and the question. For DROP, we directly train model to produce the answer (can be a span or a number). For HotpotQA, the input context with multiple documents is too long for such models, so we train the model independently on each document to produce the answer (if present in the document) or "N/A" (if not). This is an extension of an existing BERT-based model for HotpotQA (Min et al., 2019a) . During inference, we collect all the non-"N/A" answers across documents and select the most common answer span. The top section of Table 3 summarizes crossdataset results. We see that on the DROP dataset, the BART baseline is completely unable to learn the numeric reasoning needed for these question classes. While BART has some success with comparison questions, MODULARQA is able to handle all three classes well, with a close to 100% score on the 2-hop negation questions. The table also shows that BART, while being competitive on HotpotQA, is substantially outperformed by MODULARQA on both bridge and comparison questions. 18

6.2 Dataset-Targeted Approaches

Although our emphasis is on cross-dataset models, for completeness, we also compare MOD-ULARQA with two DROP-specific models: (1) NumNet+V2 (Ran et al., 2019) , a state-of-theart model built on top of RoBERTa for discrete reasoning, and (2) a Neural Module Network model ) specifically designed for a subset of DROP (referred to as NMN-D). NMN-D used techniques similar to MODU- 18 A DROP model such as NumNet could be applied to HotpotQA by converting it into a single-document reading comprehension task. We leave this for future work, noting that such a system still wouldn't produce explanations and would have to learn to reason from scratch.

LARQA to train its modules and provide auxiliary supervision for its target subset. Since the difference and negation questions are not within NMN-D's target set, we evaluate it only on comparison questions.

As shown in the bottom section of Table 3 , MODULARQA achieves performance comparable to the state-of-the-art model, NumNet+V2, on DROP, while at the same time exhibiting three unique strengths that no other compared model possesses: being able to produce explanations, generalizing across different datasets, and re-using existing QA models.

Since the other modular system, NMN-D, focuses on a different subset of DROP than MOD-ULARQA, we report its score (79.1) only on the 496 test questions it shares with MODULARQA, denoted by * in the table. When evaluated also on this same subset, MODULARQA achieves an F1 score of 92.5 (not shown in the table), revealing a gap of 13 F1 points. In addition, MODULARQA produces textual explanations while being able to re-use existing QA systems.

We also compare MODULARQA with two HotpotQA-specific systems: (1) Quark (Groeneveld et al., 2020) , which additionally uses supporting fact annotations to first select relevant sentences and then trains a BERT-based QA system directly on it; and (2) an NMN model (Jiang and Bansal, 2019) specifically designed for HotpotQA (referred to as SNMN) with each module trained from scratch on the HotpotQA task.

As shown again in the bottom section of Table 3, MODULARQA performs comparably to the DROP Example 1: How many days passed between the Sendling Christmas Day Massacre and the Battle of Aidenbach? » Q: When was the Battle of Aidenbach? A: 8 January 1706 Q: When was the Sendling Christmas Massacre? A: 25 December 1705 Q: diff(8 January 1706, 25 December 1705, days) A: 14 Example 2: Which ancestral group is smaller: Irish or Italian? » Q: How many of the group were Irish? A: 12.2 Q: How many Italian were there in the group? A: 6.1 Q: if_then(12.2 < 6.1, Irish, Italian) A: Italian Example 3: How many percent of the national population does not live in Bangkok?

» Q: What percent of the national population lives in Bangkok? A: 12.6 Q: not(12.6) A: 87.4 Table 4 : Sample Reasoning Explanations generated by MODULARQA for DROP and HotpotQA questions. The name of the sub-models can be inferred from the sub-questions and are excluded for simplicity. These explanations provide valuable insight into the reasoning used by MODULARQA for each question. As shown here, MODULARQA is able to learn the appropriate decompositions for these questions without any manually designed rules (e.g., "smaller" ⇒ x < y).

modular SNMN model even though the latter is trained on the complete HotpotQA dataset. While the Quark system does outperform both modular approaches, it is known that such end-to-end models often exploit artifacts present in the HotpotQA dataset to achieve high scores with just single-hop reasoning (Min et al., 2019a) .

While we present the above comparisons with dataset-targeted models for completeness, it is important to note that our focus is on cross-dataset models that are able to produce an interpretable representation of the multi-hop reasoning procedure used to arrive at the answer (instead of only generating the answer or marking supporting facts in the given context).

6.3 Model Intepretability

We analyzed MODULARQA's explanation on 40 Dev questions (20 from each dataset) to verify its interpretability. Of these questions, MOD-ULARQA answered 28 questions correctly (avg. acc. of 70%). Among the cases answered correctly, MODULARQA used a valid reasoning chain to arrive at the answer in as many as 93% of the cases. This highlights the strong ability of our model to produce human understandable explanations of its reasoning.

Additionally, the use of a neural generator (as opposed to heuristics) to produce these questions results in grammatical questions in almost all of these cases (barring two). Among the error cases where MODULARQA produced an incorrect answer, 33% of the questions were out-of-scope for it (e.g., required reasoning over multiple spans), while 25% of the questions in fact lead to correct reasoning chain but an incorrect answer, often due to the failure of a QA sub-model. Table 4 shows sample question decompositions generated by MODULARQA. We exclude the context and sub-models for simplicity. We can see that the model is able to take oddly phrased questions to create clean questions (example 4), handle yes/no questions (example 6), recognize the unit of comparison (example 1) and map the phrase "smaller" to the appropriate direction of comparison without any manual rules (example 2).

In contrast, the DecompRC (Min et al., 2019b ) system decomposes the fifth example into:

Q1: which writer of the sitcom maid marian and her merry men Q2: how many children's books has [answer] written ?

While these sub-questions may result in the correct answer, the question phrasing is not grammatical due to the splitting-based nature of the decomposition. Moreover, DecompRC sometimes passes certain questions directly passed to the QA system without any decomposition (e.g., it does not decompose Example 4). Not surprisingly, due to the nature of these questions, naïvely replacing our next-question generator in MODULARQA with the output of DecompRC results in a huge drop in performance.

The BREAK dataset (Wolfson et al., 2020) , on the other hand, provides the following high-level logical form for the first example: Q1: return when was the Sendling Christmas day Massacre Q2: return when was the Battle of Aidenbach Q3: return days between #1 and #2 Q4: return number of #3

While this meaning representation is more readable, it can introduce additional steps not needed to solve this task. Again, due to the difference in language, we can not use the BREAK dataset (or a model trained on it) as-is with MODULARQA either.

Importantly, however, our approach does not preclude exploiting these decompositions as cues, that is, as z for each decomposition step. Combining these question-based decompositions with our generation framework is left for future work.

6.4 Dataset-Specific Parameters

Our approach has the advantage that a single model can be applied to both DROP and Hot-potQA datasets. Here we explore the impact of training models specific to each dataset. Table 5 : A single MODULARQA system can work across both the tasks as well as dataset-specific model trained on each dataset.

Table 5: A single MODULARQA system can work across both the tasks as well as dataset-specific model trained on each dataset.

As shown in Table 5 , MODULARQA with a single set of parameters is comparable to the specific models trained on each individual dataset. Even in this setting, the BART model is unable to learn the discrete reasoning needed by the DROP dataset, showing that our subset of DROP is not trivially solvable by a generative model.

7 Discussion

There are many natural applications of Text Modular Networks to other complex tasks. We next discuss challenges that might arise when using TMNs in these applications and potential mitigation strategies.

7.1 Non-Differentiable Reasoning

Unlike NMNs, our final system communicates between any sub-task function through text. While this enables us to easily combine symbolic functions with neural models, it also makes our system non-differentiable. As a result, our next-question generator or the QA sub-models cannot be directly fine-tuned on the end task using standard gradient-based methods, but reinforcement learning (Williams, 1992) remains promising.

7.2 Better Models For Faster Search

Inference in the MODULARQA system involves sampling multiple possible sub-questions at each step and finding the optimal chain. Generating these questions and answering each question in the chain can result in slow inference (about 10 seconds per question). Further fine-tuning the nextquestion generator on the end-task may reduce the number of questions that need to be sampled (or one could simply perform greedy search).

7.3 Handling Multiple Spans

While in theory one should be able to implement TMNs to cover all types of questions, there are some question types currently not covered by our sub-models in MODULARQA. For example, there are many DROP questions that need a list of spans to be extracted and then counted ("How many touchdowns were scored by X in 2nd quarter?"). Due to the lack of any dataset tackling this simple problem, models were not designed to handle this simple task until recently (Hu et al., 2019b; Segal et al., 2019) .

Multiple spans are also challenging to extract as the intermediate answers for our sub-task question models. For example, if the answer to the previous question was "3", any set of 3 spans would qualify as valid intermediate answers. We believe these intermediate annotations are still easier to obtain than the decomposition itself. Such intermediate annotations have been collected for DROP recently and have been shown to improve the accuracy of other models as well (Dua et al., 2020) .

7.4 Boolean Questions

Compared to these multi-span questions, the space of intermediate answers is really small for Boolean questions. However, the space of questions even when given the answer is very large. For example, consider qc="Are both X and Y musicians?" and the final answer a=no. We know that the answer to one of the sub-questions in the decomposition should be "no" for the final answer to be "no" (by conjunction). However, if we were to generate any relevant Boolean sub-question with "no" as the answer, there is an impractically large space of such questions (even when constrained to the vocabulary: "X", "Y", "musicians"). As a result, most of the generated sub-questions can be very different from the ideal decomposition, e.g., we may generate the sub-question "Do musicians work for X?" that does not help answer the original question.

Similar issues have been observed in semantic parsing where additional weak supervision was needed to handle Boolean questions . One promising way forward is to exploit the similarity between the two sub-questions in the ideal decomposition ("Is X a musician?" and "Is Y a musician?") to guide our sub-question generation.

8 Conclusion

We introduced Text Modular Networks, which provide a general-purpose framework that casts complex tasks as textual interaction between existing, simpler QA modules. Based upon this conceptual framework, we build MODULARQA, an instantiation of TMNs that can perform multi-hop and discrete numeric reasoning. Empirically, we demonstrate that MODULARQA is significantly stronger than a dataset-agnostic system, while being on-par with other modular dataset-specific approaches. Importantly, MODULARQA provides easy-to-interpret explanations of its reasoning. It is the first system that decomposes DROP questions into textual sub-questions and can be simultaneously applied to both DROP and HotpotQA. We leave several angles, such as covering more question types and more effective learning, as future work.

A Model Settings

Each BART model is trained with the same set of hyper-parameters -batch size of 64, learning rate of 5e-6, triangular learning rate scheduler with a warmup of 500 steps, and training over 5 epochs. Each RoBERTa model is trained with the same set of hyper-parameters but a smaller batch size of 16. We selected these parameters based on early experiments and did not perform any hyperparameter tuning thereafter. All the baseline models are trained with their default hyper-parameters provided by the authors.

We always used nucleus sampling to sample sequences from the BART models. To sample the sub-question using the SQuAD sub-question generator, we sampled 5 questions for each step with p=0.95 and max question length of 40. To sample the question decompositions during inference, we additionally set k=10 to reduce the noise in these questions.

A.1 Training Squad Question Generator

We use the SQuAD 2.0 answerable questions to generate the training data for our SQuAD question generator. We use the nouns, verbs, nouns, adjectives and adverbs (pos tags=[NOUN, VERB, NUM, PROPN, ADJ, RB]) from the question to define the vocabulary hints (after filtering stop words). To simulate the noisy vocabulary, we also add distractor terms with similar pos tags from other questions from the same paragraph. We sample j ∈ [2, ..., 7] distractor terms for each question and add them to the vocabulary hints.

A.2 Generating Sub-Questions

For every step in the reasoning process, we generate 5 questions using nucleus sampling. We select the questions that the corresponding sub-model is able to answer correctly. For each sub-question, we generate 5 questions in the next step (and so on). At the end, we select all the successful question chains (i.e each sub-question was answered by the sub-model to produce the expected answer at each step).

A.3 Selecting Question Decompositions

It is possible that some of these sub-questions, while valid answerable questions, introduce other words mentioned in the paragraph. However, these may not be valid decompositions of the original question. E.g., for the complex question: "When was the 44th US President born?", the subquestion may state "Who was the 44th President from Hawaii?". While this a valid question with the expected intermediate answer, it introduces irrelevant words that would not be possible for the next-question generator to learn.

To filter out such potentially noisy decompositions, we compute three statistics based on nonstopword overlap. We compute the proportion of new words introduced in a decomposition u = {..., q i , a i , ...} that were not in the input question or any of the previous answers i.e.

θ(u) = | ∪ i {w ∈ q i |¬(w ∈ qc or ∃j < i w ∈ a j )} | | {w ∈ qc} |

We also compute the number of words from the input question not covered by the decomposition as µ

µ(u) = | {w ∈ qc|¬(∃i w ∈ q i )} | | {w ∈ qc} |

We also compute the number of answers ν which were not used in any subsequent question i.e. the sub-question associated with this answer is irrelevant

ν(u) =| {a i |∃w ∈ a i , j > i s.t. w ∈ q j } |

We only select the decompositions where θ < 0.3, µ < 0.3, θ + µ < 0.4 and ν = 0. To prevent a single question from dominating the training data, we select upto 50 decompositions for any input question. These hyper-parameters were selected early in the development and not fine-tuned for each dataset.

A.4 Inference Parameters

We sample n i questions in the i th question decomposition step. To ensure sufficient exploration of the search space, we initially sample a larger number of questions but scale them down every step for efficiency. Due to the pipeline nature of our system, it is difficult for our model to recover from any missed question early in the search. We set the number of sampled questions as n i = N * r i where N=10 and r = 1 2 . For the QA models, we select only the most likely answer.

To score each generated question, we again rely on the same word-overlap statistic used to filter decompositions. We only use the θ metric that captures the number of new words introduced in a question chain. The other two metrics are nonmotonic i.e they could go down depending on future questions and answers in the chain. At the end, we use a chain scorer (described next) to score each decomposition chain. While we use the θ metric to guide the search, we primarily rely on the chain score δ to select the right answer. As a result, the final score for a chain u is a weighted combination of these two metrics with higher emphasis on δ

score(u) = θ(u) + λδ(u)

where λ=10. δ can only be computed for a complete decomposition and is set to zero for the intermediate steps. Note that higher this score, the worse the chain i.e. we need to find the chain with the lowest score.

A.5 Chain Scorer

To train the scorer, we first collect positive and negative chains by running inference with just the θ metric. For every complete chain, we compute the F1 score of the final answer with the gold answer. If the F1 score exceeds a threshold (0.2 in our case), we assume this chain to be a positive example. We collect such positive and negative chain examples from the training set and then train a RoBERTa model to classify these chains. We use the RoBERTa model's predicted probability for the negative class as the score δ. Table 6 expands upon the quantitative results in Table 3 and reports both F1 and EM (exact match) scores in each setting considered.

Table 6: Expanded version of the quantitative part of Table 3, reporting both F1 and EM scores in each case. The first three columns, as before, denote qualitative capabilities of each model: whether it can Explain its reasoning, Generalize well to multiple datasets, or Re-use existing QA models.

C Hints For Complex Qa Tasks

To apply Text Modular Networks to any complex QA dataset, we need to be able to extract the hints needed by the sub-task question model. As mentioned earlier, these need not have full coverage or have 100% precision.

C.1 Hotpotqa

The questions qc in HotpotQA have two supporting facts: d 1 and d 2 . Additionally they are also partitioned into two classes: Bridge and Comparison questions.

C.1.1 Bridge Questions

There are two forms of bridge questions in Hot-potQA:

Composition questions : These questions need to first find an intermediate entity e 1 that is referred by a sub-question in HotpotQA. This intermediate entity points to the final answer through the second sub-question. Generally this intermediate entity is the title entity of the document containing the answer. Say d 2 is the document containing the answer and d 1 is the other document. If we are able to find a span that matches the title of d 2 in d 1 and the answer only appears in d 2 , we assume it to be a composition question. We set e 1 to the span that matches the title of d 2 in d 1 . For the question vocabulary, we could use the terms from the entire question for both steps. Also the second sub-question will use the answer of the first sub-question, so we add it to the vocabulary too. However, we can reduce some noise by removing the terms that are exclusively appear in the other document. The final hints for this question are:

p 1 = d 1 ; a 1 = e 1 ; v 1 = ζ(qc, d 1 , d 2 ) p 2 = d 2 ; a 2 = a; v 2 = ζ(qc, d 2 , d 1 ) + e 1

where ζ (q, d 1 , d 2 ) indicates the terms in q that appear in d 2 but not in d 1 .

Conjunction questions : These class of questions do not have any intermediate entity but have two sub-questions with the same answer e.g. "Who is a politician and an actor?". If the answer appears in both supporting paragraphs, we assume that it is a conjunction question. The hints for such questions are simple:

p 1 = d 1 ; a 1 = a; v 1 = ζ(qc, d 1 , d 2 ) p 2 = d 2 ; a 2 = a; v 2 = ζ(qc, d 2 , d 1 ) C.2 Comparison Questions

These questions compare certain attribute between two entities/events mentioned in the question. E.g., "Who is younger: X or Y?". We identify the two entities e 1 and e 2 in such questions and find dates/numbers that are mentioned in documents. For every n 1 , n 2 number/date mentioned in the document d 1 and d 2 respectively, we create the following hints:

p 1 = d 1 ; a 1 = n 1 ; v 1 = ζ(qc, d 1 , d 2 ) p 2 = d 2 ; a 2 = n 2 ; v 2 = ζ(qc, d 2 , d 1 )

p 3 = φ; a 3 = a; v 3 = [if_then, n 1 , n 2 , e 1 , e 2 ]

The final set of hints would be used by the calculator generator to create the questions: if_then(n 1 > n 2 , e 1 , e 2 ) and if_then(n 1 < n 2 , e 1 , e 2 ).

C.3 Drop

For the questions in DROP, we first identify the class of question that it may belong to and then generate the appropriate hints. Note that one question can belong to multiple classes and we would generate multiple sets of hints in such cases. The questions qcin DROP have only one associated context p.

C.3.1 Difference Questions

We identify these questions based on the presented of terms indicating a difference such as "before', "after", "win by", etc. Also we check for two dates or numbers in the context such that their difference (in all units) can lead to the final answer. If these conditions are satisfied, for every pair n 1 , n 2 where the difference (in units u) can lead to the final answer, we generate the hints: p 1 = p; a 1 = n 1 ; v 1 = Φ(qc) p 2 = p; a 2 = n 2 ; v 2 = Φ(qc) p 3 = φ; a 3 = a; v 3 = [diff, n 1 , n 2 , u]

C.3.2 Comparison Questions

We identify these questions based on the presented of the pattern: "ques: e 1 or e 2 ". We handle them in exactly the same way as HotpotQA. Since DROP contexts can have more dates and numbers, we select numbers and dates that are close to the entity mentioned . p 1 = p; a 1 = n 1 ; v 1 = Φ(ques) + e 1 p 2 = p; a 2 = n 2 ; v 2 = Φ(ques) + e 2 p 3 = φ; a 3 = a; v 3 = [if_then, n 1 , n 2 , e 1 , e 2 ]

C.3.3 Negation Questions

We identify these questions purely based on the presence of ".* not .*". For such questions, we only need to find one number n 1 such that the a = 100 − n 1 . The hints are pretty straightforward too: p 1 = p; a 1 = n 1 ; v 1 = Φ(qc) p 2 = φ; a 2 = a; v 2 = [not, n 1 ]

TMNs, in fact, treat sub-models as blackboxes, and can thus use any model or function as a module.2 We target 20% of DROP questions that need only numeric reasoning.

http://github.com/allenai/modularqa

Other systems on DROP and HotpotQA have used similar ideas for auxilliary supervision(Jiang and Bansal, 2019;Andor et al., 2019).

We select non-stopwords that are nouns, verbs, adjectives and adverbs6 More details in Appendix A

k is set to 1 in our experiments.

Rest of the questions require a QA model that can return multiple answers. See discussion in Sec. 7.3.13 Similar idea has been used to train NMNs for DROP.

We remove terms that do not appear in the corresponding context (e.g. d2) but do appear in the other context (e.g d1).17 We report F1 scores here. For completeness, exact match (EM) scores may be found inTable 6in Appendix B.