Measuring and Reducing Non-Multifact Reasoning in Multi-hop Question Answering

H. Trivedi
Niranjan Balasubramanian
Tushar Khot
Ashish Sabharwal
ArXiv
2020
View in Semantic Scholar

Abstract

The measurement of true progress in multihop question-answering has been muddled by the strong ability of models to exploit artifacts and other reasoning shortcuts. Models can produce the correct answer, and even independently identify the supporting facts, without necessarily connecting the information between the facts. This defeats the purpose of building multihop QA datasets. We make three contributions towards addressing this issue. First, we formalize this form of disconnected reasoning and propose contrastive support sufficiency as a better test of multifact reasoning. To this end, we introduce an automated sufficiency-based dataset transformation that considers all possible partitions of supporting facts, capturing disconnected reasoning. Second, we develop a probe to measure how much can a model cheat (via non-multifact reasoning) on existing tests and our sufficiency test. Third, we conduct experiments using a transformer based model (XLNet), demonstrating that the sufficiency transform not only reduces the amount of non-multifact reasoning in this model by 6.5% but is also harder to cheat -- a non-multifact model sees a 20.8% (absolute) reduction in score compared to previous metrics.

1 Introduction

Multi-hop question answering broadly refers to the task of answering question by taking multiple inference steps. We focus on the reading comprehension setting, which involves aggregating and synthesizing information from multiple facts present in the input context, which we refer to as multifact reasoning. Recent studies have shown that such multi-hop readingcomprehension datasets have unintended biases and artifacts (e.g., shortcuts based on answer type)

Which country got independence when the cold war started?

Supporting Facts (Sf)

The war started in 1950.

The cold war started in 1947.

France finally got its independence. Figure 1 : Non-multifact Reasoning: Models can find the answer and the necessary supporting facts without using any meaningful synthesis of information of the supporting facts -i.e. without any interaction between facts. E.g. A non-multifact model could identify the blue supporting fact ( ) since it is the only fact mentioning cold war. Independently, it could find the correct answer by selecting the only country getting independence with associated time (India) and hence find the red supporting fact ( ). that allow models to find correct answers by simply exploiting reasoning shortcuts, without using interaction between any two supposedly necessary facts (Min et al., 2019; Chen and Durrett, 2019) . Consequently, models can achieve high scores on multi-hop dataset leaderboards using mostly nonmultifact reasoning strategies, which muddles the progress of multifact reasoning research.

Figure 1: Non-multifact Reasoning: Models can find the answer and the necessary supporting facts without using any meaningful synthesis of information of the supporting facts – i.e. without any interaction between facts. E.g. A non-multifact model could identify the

To address this shortcoming, we introduce a general-purpose characterization of non-multifact reasoning and a model-agnostic way to measure it.

What is non-multifact reasoning? Informally, it refers to arriving at an answer without a meaningful synthesis of information from all supporting facts, by exploiting biases or other artifacts in the dataset. Consider the simple multifact example in Figure 1 , which requires a synthesis of information from two supporting facts mentioned in the input text (marked with a star). However, since only two countries are mentioned as getting independence, and of those only one with an associated notion of time, a model can take a shortcut to correctly answer this question-and even identify the fact containing the answer ( )-without having to connect with the information in the other supporting fact ( ) via year 1947. Similarly, the model can also identify the second supporting fact in isolation, since it is the only fact in the context mentioning the cold war.

This type of disconnected reasoning is undesirable as it allows a model to both identify supporting facts and correctly answer a multifact reasoning question without any meaningful synthesis of information in the supporting facts. More generally, for a question that requires k supporting facts, a model can get away with only connecting information within some subsets of the k facts, but not connecting information across these subsets.

How can we test multifact reasoning? Prior work introduced support identification-locating the facts that support the answer-as an additional test to deter such disconnected reasoning. However, as we described earlier, it is possible for models to independently locate supporting facts by using lexical overlap and cues such as answer type and question context. On the other hand, if it is not guaranteed that the input contains all of the necessary facts, then the model cannot simply rely on finding the most related facts independently-it becomes important to also check whether the chosen facts actually connect well with each other.

Based on this observation, we design a new contrastive support sufficiency test, which checks the behavior of a model under two contrasting types of input for each question: sufficient context, where the input has all the supporting facts, and insufficient contexts where the input is missing some of the supporting facts. 1 The test asks the model to assert whether the given input context has all of the supporting facts. Additionally, since we do not know how a model is internally perform- 1 Ignoring the contrastive aspect of our mechanism, this sufficiency check is similar in spirit to the notion of answerable questions in SQuAD 2.0 where models must decide if the text has the information needed to answer a question (Rajpurkar et al., 2018) .

ing this disconnected reasoning, our test considers all partitions of the supporting facts. A model is deemed as passing the sufficiency test if it is able to correctly determine sufficiency for all contrastive instance groups. This forms our transformed dataset which can notably be generated automatically from any support-annotated multifact dataset.

How can we measure non-multifact reasoning?

We can detect disconnected reasoning in a model by checking its behavior on partial contexts (containing only a subset of supporting facts). In particular, for all tests discussed so far including sufficiency, we can measure how much of a model's score on the test could have been achieved via disconnected reasoning. We formalize this notion into a condition for identifying non-multifact reasoning, and derive corresponding probes that can measure how much can a model cheat (via nonmultifact reasoning) on answer prediction, support identification, and sufficiency prediction tests.

We apply the test and probes to the HotpotQA dataset (Yang et al., 2018) and evaluate a strong QA model based on the XLNet transformer architecture (Yang et al., 2019) . The results demonstrate several benefits of our proposed Contrastive Support Sufficiency test: (i) it is a better estimate of progress on multi-hop reasoning, with model scores dropping by 8.6% (absolute) under this test; (ii) it reduces the amount of non-multifact reasoning in the XLNet QA model by 6.5% (measured by our probe); and (iii) it is less amenable to disconnected reasoning, with the score of a non-multifact model dropping by 20.8% under this test.

2 Related Work

Multi-hop Reasoning: Many multifact reasoning approaches have been proposed for HotpotQA and other multi-hop reasoning datasets (Mihaylov et al., 2018; Khot et al., 2020) . These models use iterative fact selection (Nishida et al., 2019; Tu et al., 2020) or graph neural networks (Xiao et al., 2019; Fang et al., 2019; Tu et al., 2020) in an attempt to capture the interactions between the paragraphs. While these approaches have pushed state-of-the-art, it is unclear whether the underlying models are making any progress on the problem of multifact reasoning.

Identifying Dataset Artifacts: Several previous works have identified dataset artifacts in NLP datasets, which can be misused by models to cheat on the datasets. (Gururangan et al., 2018) showed that natural language inference datasets have a bias that allows them to predict the entailment class only with the hypothesis. (Feng et al., 2018) showed that reading comprehension models often predict the same answer even after the question is reduced to only a couple of tokens. Recently, (Sugawara et al., 2019) introduced ablations like removing the word order information from reading comprehension datasets, and found the models still perform well on these datasets after removing crucial information that would be required by humans to answer the question. Our NMF probing method can be thought of as ablating the information specifically related to the skill of multi-fact reasoning showing how solvable the dataset is with the given model without multi-fact reasoning.

Specifically for multi-hop reasoning, recent works have shown that multi-hop QA datasets don't necessarily require multi-hop reasoning (Min et al., 2019; Chen and Durrett, 2019) or multi-hop support selection (Groeneveld et al., 2020) . They show that models processing only one paragraph or sentence at a time can perform reasonably well on HotpotQA dataset.

We, on the other hand, provide a model-agnostic automatically generated probing dataset, that can be applied on any model to estimate the performance attributable to non-multi-fact reasoning. Unlike prior work, we systematically analyzed the cheatability of both answer and support identification.

Mitigation of Dataset Artifacts: For reading comprehension tasks, several methods have been proposed to mitigate the dataset shortcuts. Jia and Liang (2017) showed that adding adversarially created distracting text to the context prevents models from taking some of the superficial shortcuts. A similar method was also applied to HotpotQA to add automatically constructed distracting paragraph (Jiang and Bansal, 2019 ) that would lead a non-multifact model to the wrong answer. We note that these text perturbation methods are complementary to our sufficiency-based partitions. Rajpurkar et al. (2018) proposed a mix of answerable and unanswerable questions to make the models avoid superficial reasoning. Using unanswerable questions for making reading comprehension more comprehensive has been suggested in as well. Unlike previous approaches with no support, we specifically focus on (automatically) creating unanswerable multi-hop questions by providing partial support. Apart from training more robust models, we use these partial support examples to also measure the amount of non-multifact reasoning in current models.

Few recent works Gardner et al., 2020) have proposed to evaluate NLP systems by generating closely similar (contrastive) examples of instances. Our grouped set of instances can be thought of as automatically generated contrastive examples specifically for avoiding non-multi-hop reasoning artifact in datasets.

3 Tests For Multifact Reasoning

For the rest of this work, we assume a multifact reading comprehension setting where the input is a question Q along with a context C, which consists of a set of facts, containing all supporting facts

F s ⊆ C with |F s | ≥ 2.

A good multifact reasoning is one where information from all supporting facts is meaningfully synthesized to arrive at the answer. While we have an intuitive understanding of what a meaningful synthesis might mean, it depends on the semantics of the facts being combined and is difficult to define in a precise way. This makes it challenging to devise a measurable test of multifact reasoning. Nevertheless, previous work has identified two natural prediction tasks as approximate tests:

Answer Prediction Test: Since arriving at the correct answer supposedly requires synthesizing information from multiple facts, answer prediction is itself a test of multifact reasoning: Given (Q, C) as input, output A.

Support Identification Test:

As noted earlier, models can often cheat their way out of the above test by using shortcuts to arrive at the correct answer, even without consulting all supporting facts. To catch such undesirable behaviour, having models identify all supporting facts has been proposed as a stronger test: Given (Q, C) as input, output F s .

However, dataset biases and artifacts allow models to pass even the second test without meaningfully combining information from all supporting facts. As we explain in this section, a key issue is the guarantee of sufficient information present in the provided context, i.e., F s is guaranteed to be always present in C. Models can often use this guarantee to reduce the task of support identification to that of identifying subsets of F s independently, without any meaningful interaction across these subsets, which defies the purpose of the test. We illustrate this issue using an example.

Consider the question in Figure 1 . Here, a model can cheat on the Answer Prediction test by searching for a fact in C that mentions a country getting its independence. Doing this correctly is not completely straightforward as there are two such facts. However, the model can use the overall context of Q to prefer one that also has a notion of time. This will correctly point to the red supporting fact ( ), and hence India as the answer, without consulting any other fact. Second, the model can also cheat on the Support Identification test as follows. It can break Q up into simpler constituents, namely, Q 1 about a country getting independence and Q 2 about the year a war started. It can use Q 2 to identify the blue fact ( ) as the preferred one that most closely matches the context of Q. This can be done without any reference to the other red supporting fact ( ). This, per se, is not a problemit is reasonable (in fact common) for a model to be able to locate supporting facts relevant to some constituents of a larger question without taking other supporting facts into account. What's not fine, however, is to be able to do this for all supporting facts. In this example, the model can also use Q 1 and the overall context of Q as before to identify the red fact ( ), without reference to the other blue supporting fact ( ). This results in disconnected reasoning, where both of the supporting facts are identified without reference to the other.

We observe that the model was able to take the above shortcut in part because sufficient support is guaranteed to be present in C. This allowed the model to search for the most relevant fact (a ranking problem) for each constituent Q 1 and Q 2 independently, without taking into account the desired interaction that year 1947 is what makes the two facts connect together.

To understand this better, consider what would happen if sufficient support was not guaranteed. Suppose the blue fact ( ) is removed from C to obtain a partial context C ′ . This is illustrated in the last row of Figure 2 , in which the supporting Figure 2: Transformation of a question for contrastive support sufficiency evaluation. Top-Left: An original instance, with annotation on the right denoting red ( ) and ( ) supporting facts. Bottom-Left: Its transformation into a group of 3 instances, one with sufficient context and two with insufficient context, with annotation on the right denoting context sufficiency. Right: Behavior of good vs bad models on original vs transformed dataset. A good multifact model would realize that the potentially relevant facts are not sufficient (do not connect) whereas a bad model would find potentially relevant facts and assume that there is sufficient information.

Figure 2: Transformation of a question for contrastive support sufficiency evaluation. Top-Left: An original

blue fact ( ) is crossed out. We expect a good model to behave differently under contexts C and C ′ . However, since the above cheating model did not rely on the blue fact ( ) to produce India as the answer (and also identify the red fact ( )) with context C, it will continue to do the same for context C ′ as well. Further, for the second supporting fact, it would choose the next best matching fact, namely the light blue ( ) one, which also indicates the start of a war and thus appears to be a reasonable choice. This is illustrated on the bottom-right of Figure 2 . Without considering interaction between the two identified facts ( and ), this model would not realize that the light blue fact ( ) does not fit well with the red fact ( ) because of the mismatch in the year (1950 vs. 1947) , and the two together are thus insufficient to answer Q.

A disconnected reasoning model would thus continue to behave similarly under contexts C and C ′ , still producing the same answer and what it believes are two supporting facts, even if the context is insufficient. On the other hand, a proper multifact model would be able to recognize the lack of sufficient information and thus behave differently under C and C ′ . This observation motivates our new 'contrastive' test, described next.

3.1 Contrastive Support Sufficiency Test

The key idea behind our proposed test is the following: if a model cheats under the full context by identifying proper subsets F s1 and F s2 of F s independently, without any meaningful interaction across these subsets, then it should not be able to confidently tell whether F s1 itself provides sufficient information to answer Q. We can test for this by evaluating the model with only F s1 as the context and asking it to predict whether the provided context is sufficient to arrive at the answer. More generally, as long as facts in F s are not duplicated in C \F s , we can use any context C ′ ⊂ C that contains F s1 but not F s2 as a support sufficiency test case. This is illustrated in the bottom two rows of Figure 2 , where the 'red' fact and the 'blue' fact have been removed, respectively. Accordingly, our contrastive support sufficiency test 2 checks the behavior of a model under two contrasting conditions, sufficient context and insufficient context. When given (Q, C ′ ) as input for some context C ′ ⊆ C, the model must output 1 if F s ⊆ C ′ , and output 0 otherwise (i.e., when F s \ C ′ is non-empty). We can create many insufficient context test cases, by excluding different subsets of F s . A model is deemed to pass the sufficiency test for Q if and only if it makes the correct sufficiency prediction for every sufficient and insufficient context test case thus generated for Q. Intuitively, passing the contrastive support sufficiency test for Q suggests that there is no bipartition {F s1 , F s2 } of F s for which the model considers the two subsets in isolation from each other when operating on (Q, C). That is, when reasoning with any partition, it relies on at least one fact outside that partition. It must thus be combining information from all facts together.

3,4

3.2 Contrastive Support Sufficiency Transform

Leveraging ideas behind the above test, we introduce an automated dataset transformation 2 For brevity, we sometimes abbreviate it as simply the sufficiency test in the rest of the paper. 3 We say the test suggests this conclusion, rather than guarantees it, because the behavior of the model with partial context C ′ ⊂ C may not necessarily be qualitatively identical to its behavior with full context C.

4 The test focuses on detecting cases where a model does not combine information from all facts in F s . Whether the manner in which the information is combined is semantically meaningful or interesting is beyond the scope of this test.

method, named Contrastive Support Sufficiency Transform, denoted T. This transformation can be applied to any multifact reading comprehen-

sion dataset D = {(Q i , C i ; A i )} N i=1

that has all necessary and sufficient supporting facts F i s annotated in the context C i of each Q i . For brevity, let

q i denote the i-th instance, (Q i , C i ; A i ).

The result of the transformation is a dataset T(D) where each original q i is transformed into a group of sufficient and insufficient context instances T(q), as described shortly and illustrated in Figure 2 . T(D) thus has N groups of instances. Further, for each performance metric m(⋅) of interest in the original dataset (e.g., answer accuracy, support identification accuracy, etc.), there is now a corresponding grouped metric m ∧ (⋅) that operates in a conjunctive fashion at the instance group level. For example, if m(q) captures answer accuracy for q and T(q) denotes the group of questions from transforming q, then m ∧ (q) = ∏ q ′ ∈T(q) m(q ′ ) captures whether the entire group was answered correctly. Our probes in Section 5 will reveal a key advantage of this transformation: T(D) is less susceptibility to shortcuts than D. This in turn incentivizes models to learn to perform proper multifact reasoning, resulting in a higher fraction of correctly answered questions being attributable to multifact reasoning (Section 6).

To ensure the transformation does not introduce artifacts such as context length bias, we use the following procedure to create instance groups. Similar to the constrastive support sufficiency test, the transformation of q = (Q, C; A) ∈ D involves two types of instances, those with sufficient context and those with insufficient context. Let F s ⊂ C with |F s | ≥ 2 denote the set of sufficient supporting facts for answering Q, as before, and F d = C \ F s be the set of other (non-supporting or distracting) facts present in the context. Let F

′ d ⊂ F d be a fixed subset of |F d | − |F s | + 1 other facts in the context.

5 The significance of this set size, which is simply |F d | − 1 when |F s | = 2, will become apparent shortly. The transformed instance group T(q) for q includes a single sufficient context instance for (Q, C), created using C

′ = F s ∪ F ′ d

as the input context and L suff = 1 as the output label. Since the context is sufficient to answer the question, we also expect the model to output the correct an-swer and supporting facts. Accordingly, we include these as output labels, to use for Answer Prediction and Support Identification tests if desired. This results in the instance:

(Q, F s ∪ F ′ d ; L ans =A, L supp =F s , L suff =1) (1)

For |F s | = 2, this is illustrated as the 1 st instance in the group at the bottom of Figure 2 . Next, let F s1 ⊂ F s , F s1 ≠ φ be any non-empty but proper (and hence insufficient) subset of all supporting facts; F s1 contains a single fact when generally results in more difficult instances. The transformed instance group T(q) includes the following insufficient context instances created using

|F s | = 2. Let F r1 ⊆ F d \ F ′ d be

F s1 ∪ F r1 ∪ F ′ d

as the input context and L suff = 0 as the output label:

7 (Q, F s1 ∪ F r1 ∪ F ′ d ; L suff =0) for all F s1 (2)

Since there isn't sufficient information in the context, we do not care what the model outputs for the answer and supporting sentences. This yields 2 |F s | − 2 insufficient context instances, which is simply 2 when |F s | = 2, as illustrated by the 2 nd and 3 rd instances in the group at the bottom of Figure 2 . Importantly, the choices of the set sizes above ensure that each insufficient context

F s1 ∪ F r1 ∪ F ′ d

has precisely the same number of facts as each sufficient context F s ∪ F ′ d , thereby avoiding any unintended context length bias.

This turns each instance q ∈ D into a group

T(q) of 2 |F s | − 1 transformed instances in T(D).

Grouped metrics m ∧ (⋅) operate in a conjunctive fashion over these groups, capturing whether a model behaves as desired on the entire group. 6 We choose F r1 uniformly at random from

F d \ F ′ d . One could alternatively choose facts from F d \ F ′ d that are most similar to F s \ F s1

, which is what F r1 is intended to replace. Ideally, we would like to have F s be effectively indistinguishable from F s1 ∪ F r1 .

7 An alternative to using L suff = 0 to indicate support insufficiency is to use special symbols such as NA and NS individually as the answer and support labels L ans and L supp , respectively. By using a separate label, we allow the model to separate identifying potential answers/supports from sufficiency identification, as well as give us the ability to combine the sufficiency test with any new test.

4 Characterizing Non-Multifact Reasoning

To one extent or another, non-multifact reasoning models can get around any multifact reasoning test, including contrastive support sufficiency, as long as datasets contain artifacts and biases. We first develop a general characterization of when a model should be deemed as performing nonmultifact reasoning from the perspective of a given test. We then instantiate this characterization for the three tests discussed in Section 3. The resulting characterizations form the basis of our proposed non-multifact reasoning probes (Section 5) and experiments (Section 6).

Our characterization focuses on the multi-fact aspect of reasoning, that is, on detecting cases where a model does not synthesize information from all supporting facts. As noted earlier (cf. Footnote 4), this leaves open the possibility of combining information, but not in a semantically meaningful or interesting way (e.g., via simple word overlap). Characterizing such other undesirable aspects of multifact reasoning, while important, is left for future work.

As before, let q = (Q, C; A) be an instance in the dataset D, with supporting facts F s ⊆ C. Let f denote a multifact reasoning test and f (q) the output a model should produce when tested on input q. For example, when f is Support Identification,

f (q) equals F s .

We say a model M performs non-multifact (NMF) reasoning on q from the perspective of a test f , if the following condition holds:

NMF condition: There exists a proper bi- partition 8 {F s1 , F s2 } of F s such that M

correctly predicts the output f (q) on input q without a non-trivial synthesis of information across F s1 and F s2 .

In practice, we use an equivalent condition:

NMF condition, reformulated: There exists a proper bi-partition (F s1 , F s2 ) of F s such that the two outputs of M with input q modified to have C \ F s2 and C \ F s1 as contexts, respectively, can be trivially combined to produce the output f (q). Figure 3 shows such bi-partitions and two examples of disconnected reasoning for a 3-hop rea- Input Output

Figure 3: Generalization of disconnected reasoning to a 3-fact reasoning question. As shown in the bottom half, a model could perform multifact reasoning on two disjoint partitions to answer this question. We consider such a model to be performing non-multifact reasoning as it does not use the entire chain of reasoning and relies on artifacts (specifically, it uses 1-fact and 2-fact reasoning, but not 3-fact reasoning). For each of the two examples, there exists a fact bi-partition (shown on the right) that we can use to detect such reasoning as the model would continue to produce all the expected labels even under this partition.

Input Output

What's the capital city of the country that got independence in the year the cold war started? , , ,

#2 #3 #1 #2 , , , #3 , ,

Partition that detects cheating Figure 3 : Generalization of disconnected reasoning to a 3-fact reasoning question. As shown in the bottom half, a model could perform multifact reasoning on two disjoint partitions to answer this question. We consider such a model to be performing non-multifact reasoning as it does not use the entire chain of reasoning and relies on artifacts (specifically, it uses 1-fact and 2-fact reasoning, but not 3-fact reasoning). For each of the two examples, there exists a fact bi-partition (shown on the right) that we can use to detect such reasoning as the model would continue to produce all the expected labels even under this partition. soning question. It highlights the need for considering every bi-partition in the NMF condition. For instance, if we only consider partitions that separate the purple ( ) and yellow ( ) facts, then the model performing the lower example of disconnected reasoning would not be able to output the correct labels in any partition and would thus appear to not satisfy the NMF condition. We would therefore not be able to detect that it is doing nonmultifact reasoning.

The above NMF condition leaves open what constitutes a trivial combination, because it is unclear how to define it concretely in a test agnostic fashion. Instead, we next instantiate what constitutes a trivial combination for each of the three tests discussed earlier.

4.1 Non-Multifact Answer Prediction

For this test, we assume that the model, M , assigns a numerical score s(a) to its produced answer a. Suppose M outputs answer a 1 with score s(a 1 ) when the input context is C \ F s2 , and outputs a 2 with score s(a 2 ) when the input context is C \ F s1 . A trivial combination here is the max operator over the scores. Specifically, we say the NMF condition is met for the Answer Prediction test if arg max a∈{a 1 ,a 2 } s(a) is the correct answer A.

4.2 Non-Multifact Support Identification

The trivial combination here corresponds to the set union operator over the supporting facts identified by the model. Specifically, we say the NMF condition is met for the Support Identification test if the model outputs F s1 when the input context is C \ F s2 , and outputs F s2 when the input context is C \ F s1 , so that the union of the produced supporting facts is precisely F s .

4.3 Non-Multifact Support Sufficiency

Suppose that given any context C ′ ⊆ C of facts, M can correctly determine whether C ′ contains F s1 , irrespective of whether or not F s2 ⊆ C ′ ;

and vice versa for F s2 . This implies it can correctly determine whether the context contains F s1 and F s2 in an independent fashion, without considering any interaction between these two subsets. This is undesirable, as M can then use and as the trivial combination operator over the two decisions to correctly determine whether F s ⊆ C ′ .

In particular, we say that the NMF condition is met for the Contrastive Support Sufficiency test if the above behavior holds for at least the following three choices of C

′ : C \ F s1 , C \ F s2 ,

And

Original Dataset D ⇒ Question q = (Q, C; A) in D is assumed to be annotated with supporting facts {f 1 , f 2 }.

Probing Dataset P ans+supp (D) for Answer Prediction and Support Identification tests: ⇒ Probing question collection P ans+supp (q) has only one group, corresponding to the unique bi-partition {{f 1 }, {f 2 }}, containing:

1. (Q, {f 1 } ∪ F ′ d ; L ans =A, L supp ={f 1 }) 2. (Q, {f 2 } ∪ F ′ d ; L ans =A, L supp ={f 2 })

Transformed Dataset T(D) for evaluating Constrastive Support Sufficiency: ⇒ Transformed question group T(q) in T(D) is defined using a single replacement fact f r ∈ C \ {f 1 , f 2 }, and other non-supporting facts F

′ d = C \ {f 1 , f 2 , f r }: 1. (Q, {f 1 , f 2 } ∪ F ′ d ; L ans =A, L supp ={f 1 , f 2 }, L suff =1) 2. (Q, {f 1 , f r } ∪ F ′ d ; L suff =0) 3. (Q, {f r , f 2 } ∪ F ′ d ; L suff =0)

Probing Dataset P ans+supp+suff (T(D)) for all three tests: ⇒ Probing question collection P ans+supp+suff (T(q)) for the transformed question T(q) has only one group, corresponding to the unique bi-partition {{f 1 }, {f 2 }}, and is defined as: C \(F s1 ∪F s2 ). Intuitively, if M can correctly determine whether or not C ′ contains each of the two subsets of supporting facts for these three choices of C ′ , then it should have sufficient information to do so correctly for any C ′ ⊆ C.

1. (Q, {f 1 } ∪ F ′ d ; L ans =A, L supp ={f 1 }, L * suff =0) 2. (Q, {f 2 } ∪ F ′ d ; L ans =A, L supp ={f 2 }, L * suff =0) 3. (Q, {f r } ∪ F ′ d ; L * suff = − 1)

5 Measuring Non-Multifact Reasoning

For each of the multifact reasoning tests in Section 3, we now use the characterization of nonmultifact reasoning in Section 4 to devise a probe that measures how much can a model score on that test using non-multifact reasoning. We refer to this number, relative to what the model scores normally on that test, as the NMF reasoning %, which provides a measure of the cheatability of the test for this model. Tests that are less cheatable are, of course, preferable. Importantly, the probe comprises an automatically generated probing dataset, on which a given model is trained and evaluated. Our experiments using these probes (cf. Section 6) will confirm the intuition that adding the Contrastive Support Sufficiency test to any existing test makes it notably harder to cheat (i.e., results in a smaller NMF reasoning %). As before, q = (Q, C; A) denotes an instance in a dataset D, annotated with supporting facts F s ⊆ C. Let P s be the set of all 2

|F s |−1 − 1 proper bi- partitions {F s1 , F s2 } of F s . Following the notation in Section 3.2, F d = C \ F s denotes other (non- supporting) facts, F ′ d ⊂ F d is a fixed (uniformly sampled) subset of size |F d | − |F s | + 1, F r1 ⊆ F d \ F ′ d

is a fixed (uniformly sampled) subset of replacement facts such that |F s1 | + |F r | = |F s |, and F r2 is defined analogously.

To construct the probing dataset P f (D) for a multifact reasoning test f , we convert each q ∈ D into a collection P f (q) of 2 |F s |−1 − 1 groups of instances, with each group corresponding to one proper bi-partition of F s . 9 For a performance metric m(q) of interest in D, the probe uses a corresponding probe metric m f (q) that operates as a disjunction or max over a grouped metric m f (q; F s1 ) defined for each of the proper bipartitions {F s1 , _} 10 of F s :

EQUATION (3): Not extracted; please refer to original document.

will also be constructing probing datasets similarly for the transformed dataset T(D).

10 The second part of the bi-partition, by definition, is F s \ F s1 and is omitted for brevity both here and in the notation of the grouped metric m f (q; F s1 ).

The grouped metric, as we will explain shortly, is closely tied to the trivial combination operator discussed in Section 4 that non-multifact reasoning models can use to cheat on f . We next describe how P f (q) is constructed for the three tests f under consideration, and what the corresponding probe metric is.

5.1 Probing Answer Prediction Test

To measure how cheatable this test is (via nonmultifact reasoning), we convert q into 2 |F s |−1 − 1 groups, corresponding to all proper bi-partitions in P s . For each bi-partition {F s1 , F s2 }, the collection P ans (q) contains a group of two instances:

1. (Q, F s1 ∪ F r1 ∪ F ′ d ; L ans =A) 2. (Q, F s2 ∪ F r2 ∪ F ′ d ; L ans =A)

Models are expected to operate independently over instances within a group. Similar to our characterization of non-multifact reasoning from the perspective of this test (Section 4.1), we assume models assign a score s(a) to the answer a that they output. The grouped metric here, m ans (q; F s1 ), corresponds to the max operator: a model receives 1 point on the above group if the highest scoring answer it outputs (across the two instances in the group) is A.

11

The overall probe metric m ans (q) captures a disjunction of this undesirable behavior across all possible proper bi-partitions of F s , and is defined using Eqn. (3) with f set to ans. In other words, a model gets a point on P ans (q) under this probe if it operates correctly (in above sense) on the group in P ans (q) corresponding to at least one proper bipartition of F s , indicating that it has the capacity of cheating on the Answer Prediction test for q.

5.2 Probing Support Identification Test

The probing dataset for this test works similar to that for the Answer Prediction test above, where instance q is converted into a collection P supp (q) of 2 |F s |−1 − 1 groups, each of the form:

1. (Q, F s1 ∪ F r1 ∪ F ′ d ; L supp =F s1 ) 2. (Q, F s2 ∪ F r2 ∪ F ′ d ; L supp =F s2 )

The difference is that the output label is now a subset of the supporting facts. Models are again expected to operate independently over these two instances. The grouped metric here, m supp (q; F s1 ), corresponds to set union operator from Section 4.2: a model receives 1 point on the above group if the union of the facts it outputs is F s , which here is equivalent to outputting the correct label for both instances in the group.

As before, the overall probe metric, m supp (q), follows from Eqn. 3, capturing a disjunction of undesirable behavior across all bi-partitions.

5.3 Probing Support Sufficiency Test

Recall that the Contrastive Support Sufficiency test is defined using the transformed dataset T(D). Consider the three instances in T(D) corresponding to the full support set F s and to partial support sets F s1 and F s2 that form a proper bi-partition of F s . These instances take the form:

1. (Q, F s ∪ F ′ d ; L ans =A, L supp =F s , L suff =1) 2. (Q, F s1 ∪ F r1 ∪ F ′ d ; L suff =0) 3. (Q, F s2 ∪ F r2 ∪ F ′ d ; L suff =0)

By construction, they have the property that |F r1 | = |F s2 | and |F r2 | = |F s1 |. In this sense, F r1 can be viewed as a replacement for F s2 ; and similarly for F r2 . As discussed in Section 4.3, a model can cheat on this test if it can correctly determine the presence of F s1 in the provided context without regarding to F s2 (and similarly for the presence of F s2 ). To probe such behavior, for the bi-partition (F s1 , F s2 ), we create the following group of instances:

1. (Q, F s1 ∪ F ′ d ; L ans =A, L supp =F s1 , L * suff =0) 2. (Q, F r1 ∪ F ′ d ; L * suff = − 1) 3. (Q, F s2 ∪ F ′ d ; L ans =A, L supp =F s2 , L * suff =0) 4. (Q, F r2 ∪ F ′ d ; L * suff = − 1)

We use the notation L * suff here to highlight that this label is semantically different from L suff in the transformed dataset, in the sense that when L * suff = 0, the model during this probe is expected to produce the partial support and the answer, if it is present in the context. When not even partial support is present in the context, the output label is L * suff = −1 and we don't care what the model outputs as the answer or supporting facts. Note that the label semantics being different is not an issue, as the probing method involves training the model under consideration on the probe dataset.

Since the probing dataset here includes labels for all three tests, we refer to it as P ans+supp+suff (T(D)). By removing output labels for any tests not of interest (and removing resulting instances if they no longer have any output label), one can easily derive P suff (T(D)), P ans+suff (T(D)), etc., as alternative probing datasets.

The joint grouped metric here, denoted m ans+supp+suff (q; F s1 ), is the following: the model receives 1 point on the above group if it correctly predicts the L * suff label for all 4 instances, predicts the correct L supp label for instances (2) and 4, and predicts a score s(a) for the answers it outputs for instances (2) and 4such that the answer maximizing this score is the correct answer A. Joint grouped metrics such as m ans+suff (q; F s1 ) corresponding to the alternative probing datasets mentioned above are defined analogously.

As before, the overall probe metric, m suff (q), follows from Eqn. 3, capturing a disjunction of undesirable behavior across all bi-partitions.

6 Experiments

To evaluate the amount of non-multifact reasoning and the impact of our Contrastive Support Sufficiency approach, we conduct experiments using an XLNet model (Yang et al., 2019) on the multi-hop reading comprehension dataset HotpotQA (Yang et al., 2018) . This transformer-based model (unlike several strong token-length limited BERT and RoBERTa QA models) takes the entire context of a question as input and is thus, in principle, capable of performing multifact reasoning.

We show that while this model achieves a high score on the Answer Prediction and Support Identification tests, most of this could have been achieved via non-multifact reasoning. In contrast, our proposed Contrastive Support Sufficiency test on our transformation of HotpotQA notably reduces the fraction of questions that can be answered by non-multifact reasoning.

6.1 Experimental Setup

Dataset, D: HotpotQA is a popular multi-hop question answering dataset with 100K questions. It has led to development of many QA models (Nishida et al., 2019; Xiao et al., 2019; Tu et al., 2020; Fang et al., 2019) . We focus on the distractor setting where each question has a set of 10 associated paragraphs, of which two paragraphs were used to create the multifact question.

Apart from the answer span, each question is also annotated with these two supporting paragraphs as well as the supporting sentences within them. These two paragraphs, by construction, are neces- XLNet-Base 57.7 | 71.9 50.4 | 83.9 Table 1 : Performance of XLNet-Base compared to other transformer models (of similar size) on Hot-potQA. Our model is able to achieve scores better than BERT-Base models: QFE (Nishida et al., 2019) and DFGN (Xiao et al., 2019) . It achieves comparable performance to recent models using RoBERTa and Long-Former (Beltagy et al., 2020) sary and sufficient to answer the question, and can thus be used to create our transformation and tests.

Table 1: Performance of XLNet-Base compared to other transformer models (of similar size) on HotpotQA. Our model is able to achieve scores better than BERT-Base models: QFE (Nishida et al., 2019) and DFGN (Xiao et al., 2019). It achieves comparable performance to recent models using RoBERTa and LongFormer (Beltagy et al., 2020)

Transformed Dataset, T(D):

To apply our Contrastive Support Sufficiency test, we must first transform the QA dataset using annotated supporting facts. We use the supporting paragraph annotation in HotpotQA to define F s , and then create the transformed dataset as described in Sec. 3.2 and illustrated in Figure 4 . We prefer to use the paragraph annotations (rather than the sentence-level annotations) for the transformation for the following reason: Since crowd-workers were asked to generate a multihop question from a pair of two paragraphs, the two annotated paragraphs are guaranteed to have sufficient information to answer the question. As others have also noted, the sentence-level support annotation in the dataset is noisy, because of which the above property often does not hold at the sentence level. Moreover, naively dropping and adding sentences (instead of paragraphs) during our transformation can introduce other artifacts (e.g., missing sentences in a paragraph) that one would have to guard against.

Figure 4: Proposed dataset transformation and probes for the case of ∣Fs∣ = 2 supporting facts.

Models:

We use pretrained language model transformers (Devlin et al., 2019; Yang et al., 2019) , which have proven to be highly effective on a wide range of NLP tasks. However, the context of 10 paragraphs in HotpotQA is too long

12 for models such as BERT (Devlin et al., 2019) and RoBERTa , which are limited to inputs of a maximum length of 512 tokens. Instead, we use the XLNet model (Yang et al., 2019) , which uses relative position embeddings and can theoretically be applied to a context Table 2 : Left: Score of the XLNet-Base model using different metrics with and without the sufficiency transform. Right: The correspoding % of non-multifact (NMF) reasoning as determined by our probe. Adding the sufficiency test always reduces the model score as well as NMF reasoning % in the model. of any length. We train two types of QA models from the XL-Net transformer:

Table 2: Left: Score of the XLNet-Base model using different metrics with and without the sufficiency transform. Right: The correspoding % of non-multifact (NMF) reasoning as determined by our probe. Adding the sufficiency test always reduces the model score as well as NMF reasoning % in the model.

1. XLNet-Base (Full): This is the XLNet-Base model trained to predict the answer, the supporting sentences or paragraphs, and the sufficiency label, given all 10 context paragraphs as input text. As we show in Table 1 , our model performs comparable to other existing models with similar model complexity.

2. XLNet-Base (Single-Fact): This is an XLNet-Base model that makes the answer and supporting sentence/paragraph prediction on each paragraph independently (model details in appendix).

Appendix A provides further details of the models, including their input/output representation and the training regime.

Metrics: Models on the HotpotQA dataset are evaluated against two key metrics: answer span prediction and support sentence selection. For each of these, Exact Match (EM) and F1 scores are used. Additionally, a Joint score (again via both EM and F1) is computed by averaging across the dataset the product of these scores for each example.

In this work, we use the Joint EM score as the metric, because of its ease of interpretation and use: A model gets one point if it gets all the labels exactly correct. We apply this Joint EM metric on combinations of four metrics: Answer Span (Ans), Supporting Sentence (Supp s ), Supporting Paragraph (Supp p ) and Contrastive Support Sufficiency (Suff p ) label. Each model was trained on all the supervision labels associated with an example even if the models were finally evaluated only on a subset of these metrics.

6.2 Better Metric For Multi-Fact Reasoning

We first evaluate the impact of our proposed transformation to current answer and support identification metrics. As shown in the top half of Table 2, adding the sufficiency test (Suff p ) to any of the proposed metrics, results in a reduction of the model score (indicated by ∆). Specifically we observe a drop of 8.6 pts to 14.1 pts on the Ans + Supp p and Suff p metrics respectively. Since our transformation should not be any harder for true multi-hop model, this drop in score indicates that previous scores were over-estimating the amount of multi-hop reasoning in current state-of-the-art model. By combining the sufficiency test to these existing metrics, we hence get a better estimate of the true progress on multi-fact reasoning.

Additionally, we can use our NMF probe to estimate the percentage of non-multifact reasoning in these models (computed as the ratio of the model score on the NMF-probe, P(D) to the score on the dataset, D). Again, we see that a large percentage of the score with prior metrics can be achieved via non-multifact reasoning (as determined by our probe). Specifically, we can see that 81.6% of the answer prediction score can be achieved using non-multifact reasoning and this number only goes down to 77% if we add the support paragraph prediction score (Ans + Supp p ). However, by introducing the Suff p test to these metrics, the % of NMF reasoning goes down to 70.5%. These differences are slightly lower when we consider the sentence-level support prediction EM metric (Supp s ), due to it being a really hard metric to get exactly right. But we still observe the same trends as with the paragraph-level metrics.

6.3 Harder Test For Nmf Models

We perform the same tests on the XLNet (Single-Fact) model to verify the ability of our tests to truly measure multi-hop reasoning and being harder to answer via disconnected reasoning. We compare the scores of the full XLNet model to this singlefact model in Table 3 The scores of the single-fact model don't drop by much on the previous metrics (at maximum by 8.7 pts on Supp p ) showing that a model completely incapable of performing multi-fact reasoning could achieve high scores on the answer + support metrics. On the other hand, the single-fact model performs much worse when the sufficiency test is added to our metrics. Specifically we see a drop of 19.3 pts on the standard answer + support identification metric (Ans + Supp p ) as compared to the full model, showing that our test is much harder for a non-multifact model. Table 4 : Model scores of the XLNet-Base (Single-fact) model on original vs. transformed datasets. The singlefact model does reasonably well on the original dataset, but its performance on our transformed dataset clearly highlights its poor multi-fact nature.

Table 3: Scores of XLNet-Base model (Full) and XLNet Single-Fact model (SiFa), with and without the contrastive support sufficiency transform. Single-fact models are not much worse than Full under previously proposed metrics (m), but show a much larger drop under our proposed transform (m+Suffp).

Table 4: Model scores of the XLNet-Base (Single-fact) model on original vs. transformed datasets. The singlefact model does reasonably well on the original dataset, but its performance on our transformed dataset clearly highlights its poor multi-fact nature.

In Table 4 , we can also see that the drop in the single-fact model scores by adding the Suff p test is twice that of the drop observed in the full XLNet model ( Table 2) , showing that our transformation would have a much larger impact on "cheating" models. Our NMF probe also showed that 99%+ of the m+Suff p scores in the single-fact XLNet were achieved by non-multifact reasoning, show-ing that our probe is also a reliable measure of this disconnected reasoning.

7 Closing Remarks

It is difficult to create large-scale multihop QA datasets without unintended artifacts that let models take reasoning shortcuts. It is also difficult to design models that do not exploit such shortcuts. We need effective ways to characterize how far models can go with non-multifact reasoning, and to design model-agnostic mechanisms that discourage models from resorting to non-multifact reasoning.

To this end, we introduced a new test, named Contrastive Support Sufficiency, that asks models to decide whether the input context has all facts necessary to answer a question, importantly in a contrastive setting involving groups of sufficient and insufficient context cases. Using this notion, we also developed a way to automatically transform support annotated datasets into extended datasets that rely more strongly on multifact reasoning, and devised probes that can quantify how much a model can score on a specific QA metric using non-multifact reasoning.

While our transformed dataset turned out to be a notably harder task for existing models, we believe it shouldn't be any harder for humans to recognize insufficient support (this remains to be verified by collecting human annotations). Even though we conducted most of our experiments using a QA model built upon XLNet-Base, we believe similar results will hold for larger transformers and other state-of-the-art models as well.

Our empirical evaluation demonstrated that the mechanisms we developed are more effective at measuring as well as reducing non-multifact reasoning than prior efforts, and thus get us closer to an improved estimate of the true progress in multifact reasoning.

A XLNet QA Model Details A.1 XLNet-Base (Full)

We concatenate all 10 paragraphs together into one long context with special paragraph marker token [PP] at the beginning of each paragraph and special sentence marker token at the beginning of each sentence in the paragraph. Lastly, the question is concatenated at the end of this long context. Apart of questions that have answer as a span in the context, Hotpotqa also has comparison questions for which the answer is "yes" or "no" and it's not contained in the context. So we also prepend text " " to the context to deal with both types of questions directly by answer span extraction. Concretely, we have, [CLS] We generate logits for each paragraph and sentence by passing marker tokens through feedforward network. Supporting paragraphs and sentences are supervised with binary cross entropy loss. Answer span extraction is using standard way (Devlin et al., 2019) where span start and span end logits are generated with feedforward on each token and it's supervised with cross entropy loss.

We use first answer occurrence among of the answer text among the supporting paragraphs as the correct span. This setting is very similar to recent work (Beltagy et al., 2020) , and our results in Table 1 , show that this model achieves comparable accuracy to other models with similar model complexity. We haven't done any hyperparameter (learning rate, num epoch) tuning on the development set because of the expensive runs, which could explain the minor difference.

For predicting sufficiency classification, we use feedforward on [CLS] token and train it with cross entropy loss. In our transformed dataset, because HotpotQa has K=2, there are 2X instances with insufficient supporting information than the instances with insufficient supporting information. So during training we balance the number of insufficient instances by dropping half of them.

A.2 Xlnet-Base (Single Fact)

To verify the validity of our tests, we also evaluate a variant of XLNet incapable of Multifact reasoning. Specifically, we train our XLNet model that makes predictions one paragraph at a time (similar to Min et al. (2019) ). Although these previous works showed that answer prediction is hackable, we adapt it to predict supporting facts and sufficiency as well.

Specifically, we process the following through the XLNet transformer [CLS] [PP] [SS] sent1,1 [SS] sent1,2 [QQ] q for each paragraph. We then supervise [PP] tokens for two tasks: identify if paragraph is a supporting paragraph and identify if paragraph has the answer span (for yes/no question both supporting paragraphs are supervised to be having the answer). We then select top ranked paragraph for having the answer and generate the best answer span. Similarly, select top two ranked paragraphs for having being supporting and predict the corresponding supporting sentences. The logits for answer span and supporting sentences are ignored when the paragraph doesn't have the answer and is not supporting respectively. We train for three losses jointly: (i) ranking answer containing paragraph, (ii) ranking supporting paragraphs (iii) predicting answer from answer containing paragraph (iv) predicting supporting sentences from supporting paragraphs. We use binary cross entropy for ranking of paragraphs, so there's absolutely no interaction the paragraphs in this model. To get the sufficiency label, we apply check if the sufficiency classification label based on the number of supporting paragraphs predicted 13 . For original dataset, if | predicted(Supp p )| > 1, then C = 1 otherwise C = 0. For probing dataset, if | predicted(Supp p )| > 0, then C = 0 otherwise C = −1.

B Implementation And Model Training

Our implementation is based on AllenNLP (Gardner et al., 2017) and Huggingface Transformers (Wolf et al., 2019) . We train for two epochs, checkpointing every 15K instances and use batch size of 32. We have trained all transformer models with learning rate of 0.00005 and linear decay without any warmup. The hyper-parameters were chosen as the default parameters used by huggingface transformers to reproduce BERT results on SQuAD dataset.

We choose F ′ d uniformly at random.

{X, Y } is a proper bi-partition of a set Z if X ∪ Y = Z, X ∩ Y = φ, X ≠ φ, and Y ≠ φ.

The discussion here refers to D only for simplicity. We

The two instances correspond to F s1 and F s2 = F s \F s1 .

Average context length is 1,316 word-pieces.

This heuristic exploits the fixed number of hops=2 and doesn't need any training on the sufficiency label. It's the only way one can predict sufficiency label without any interaction across any of the facts.