WIQA: A dataset for “What if...” reasoning over procedural text

Niket Tandon
Bhavana Dalvi
Keisuke Sakaguchi
Antoine Bosselut
P. Clark
EMNLP
2019
View in Semantic Scholar

Abstract

We introduce WIQA, the first large-scale dataset of “What if...” questions over procedural text. WIQA contains a collection of paragraphs, each annotated with multiple influence graphs describing how one change affects another, and a large (40k) collection of “What if...?” multiple-choice questions derived from these. For example, given a paragraph about beach erosion, would stormy weather hasten or decelerate erosion? WIQA contains three kinds of questions: perturbations to steps mentioned in the paragraph; external (out-of-paragraph) perturbations requiring commonsense knowledge; and irrelevant (no effect) perturbations. We find that state-of-the-art models achieve 73.8% accuracy, well below the human performance of 96.3%. We analyze the challenges, in particular tracking chains of influences, and present the dataset as an open challenge to the community.

1 Introduction

Procedural text is common in language, but challenging to comprehend because it describes a dynamically changing world. While recent systems for procedural text comprehension can answer questions about what events happen, e.g., Henaff et al., 2017; , the extent to which they understand the influences between those events remains unclear.

One important test of understanding is to predict what would happen if a process was perturbed in some way, requiring understanding and tracing the chain of influences through a paragraph. However, to date there is no dataset available to help develop this capability. We aim to fill this gap with WIQA 1 , the first large-scale dataset testing "What if..." reasoning over procedural text.

WIQA contains 40.7K questions, for 379 process paragraphs. To efficiently create the questions, crowdworkers created 2107 influence graphs (IGs) for the paragraphs, describing how one perturbation positively or negatively influences another . Questions were then derived from paths in the graphs, each asking how the change described in one node affects another. Each question is a templated, multiple choice (MC) question of the form Does changeX result in changeY? (A) Correct (B) Opposite (C) No effect, where Opposite indicates a negative influence between changeX and changeY. To bound the task, perturbations are typically qualitative (e.g., "the wind is blowing harder"), and possible effects are restricted to changes to entities and events mentioned in the paragraph (e.g., "the waves are bigger"). Perturbations themselves include in-paragraph, out-of-paragraph, and irrelevant (no effect) changes. The WIQA task is to answer the questions, given the paragraph (but not the IG).

We first describe the task and how the dataset was constructed, and then present results from baselines and strong BERT-based models. We find that the best model is still 23% behind human performance and the gap further widens with indirect and out-of-paragraph effects, illustrating that the dataset is hard. We present a detailed analysis showing WIQA is rich in linguistic and semantic phenomena. Our contributions are: (1) the new dataset (2) performance measures and an analysis of its challenges, to support research on counterfactual, textual reasoning over procedural text.

2 Related Work

While there are several NLP datasets now available for procedural text understanding, e.g., (Kiddon et al., 2016; Weston et al., 2015) , these have all targeted the task of tracking entity states throughout the text. WIQA takes the next step of asking how states might change if a perturbation was introduced.

Predicting the effects of qualitative change has been studied in the qualitative reasoning (QR) community, but primarily using formal models (Forbus, 1984; Weld and De Kleer, 2013) . Similarly, counterfactual reasoning has been studied in the logic community (Lewis, 2013) , but again using formal frameworks. In contrast, WIQA treats the task as a mixture of reading comprehension and commonsense reasoning, creating a new NLP challenge.

3 Dataset Construction

To efficiently generate questions, we first asked crowdworkers to create influence graphs (IGs) for each paragraph. We then create questions from the IGs using paths in the IGs. We now describe this process.

Influence Graphs An influence graph G(V, E) for a procedural text T is an unweighted directed graph. Each vertex v i is labeled with one or more text strings, each describing a change to the original conditions described or assumed in T , such that all those changes have the same influence on a connected node v j . Each edge is labeled with a polarity, + or -, indicating whether the influence is positive (causes/increases) or negative (prevents/ reduces).

Indirect effects can be found by traversing G. It is useful to distinguish two kinds of nodes: 1. Out-of-para nodes: denoting events or changes to entities/events not mentioned in the paragraph, e.g, "during storms" in Figure 1 . 2. In-para nodes: denoting events or changes to entities/events mentioned in the paragraph, e.g, "the wind is blowing harder" in Figure 1 .

Figure 1: WIQA contains procedural paragraphs, crowdsourced influence graphs associated with them, and a large collection of “Does changeX result in changeY?” (what-if) questions, derived from the graphs.

Acquiring influence graphs For a source of paragraphs, we used the 377 training set paragraphs from the ProPara dataset . (Multiple) influence graphs were then crowdsourced for each. To do this, we use an influence graph template, shown in Figure 2 . Workers were asked to populate this (hidden) template using a sequence of five questions, where the later questions were automatically constructed from their answers to the earlier questions. The first question asks the worker to supply an X and Y in: "If [X] occurs, it will have the intermediate effect [Y] resulting in accelerated_outcome" (where the accelerated outcome phrase was pre-authored for each paragraph). For X and Y, workers were asked to describe a change in some property/phenomonon mentioned in the paragraph, e.g., if a paragraph sentence x i is "Wind creates waves.", they may author an X saying "the wind is blowing harder" (Figure 1 ). (The alignment of X and x i , and whether X describes an increase or decrease of x i , denoted by d X ∈ {+, −}, was also recorded.). This fills X and Y in Figure 2 . Similar questions populate the remaining nodes (see Appendix). 2107 influence graphs were collected in this way.

Figure 2: The template used to acquire influence graphs

Generating Questions from Graphs Each path in a graph forms a "change→effect?" question, whose answer is either "correct" or "opposite" depending on the product of the polarities of the traversed edges. Questions are labeled with the number of edges traversed (1 = "1-hop", etc.). We also distinguish in-para and out-of-para questions depending on the type of node they originated from. We then created a third category of question, whose answer is "no effect", by selecting out-ofpara changes from other paragraphs and asking for their effect on nodes in the current graph. Occasionally these changes did affect the selected node, resulting in an erroneous label of "no effect", but this was rare (and such cases were removed from the test partition, as we now describe).

Using a separate crowdsourcing task, questions in the test set were filtered to improve the test set quality. First, five workers independently answered each question, given the paragraph. The inter-annotator agreement between workers, using Krippendorff's alpha, was moderately high (0.6). We then retained only questions with majority agreement (i.e., at least 3 out of 5 workers agreed), resulting in 88% of questions being retained.

Balancing the Dataset From the (many) questions thus generated, we (randomly) selected a subset that approximately balanced the numbers of (a) in-para, out-of-para, and no-effect questions, and (b) questions with each answer (correct, opposite, no-effect), resulting in 40,695 questions. Train, dev, and test partitions do not share paragraphs about the same topic. Statistics are shown in Table 1 .

Explanations As each question is derived from a path in an IG, we can also generate explanations for each answer using that path. Although explanation is not part of the WIQA task, we create an explanation database to support a possible explanation task at a future date.

Consider a question "Does perturbation q p result in q e ?" with answer d e ∈ {+, −} (as a shorthand for {correct, opposite}), created from an IG path:

q p r pX − − → X r XY − − → Y r Ye − − → q e

Here, r pX , r XY , and r Ye denote the polarities (+/-) of the edges q p X, XY, and Xq e in the IG respectively. (As described earlier, answer d e is the product of the polarities r pX .r XY .r Ye ). To define an explanation in terms of the paragraph's sentences x 1 , ..., x K , we define the gold explanation E gold as the structure:

q p → d i x i → d j x j → d e q e

where x i is the sentence corresponding to X, x j is the sentence corresponding to Y, and d i , d j , and d e denote the directions of influence (+/-, denoting {more,correct}/{less,opposite}). As workers already annotated the alignment between X and x i (similarly Y and x j ) we know x i and x j . Similarly, as workers also annotated whether X describes an increase of decrease of x i , denoted by d X ∈ {+, −}, we can straightforwardly compute the directions of influence:

d i = r pX .d X d j = d i (in-paragraph influence 2 ) d e = r pX .r XY .

r Ye = the answer We can similarly generate explanations for answers derived from 1-hop and 2-hop paths.

We generated a full database of explanations for all the questions with answer "correct" or "opposite" (For "no effect" answers, there is no explanation as there is by definition no path of influence). We then removed the (occasional) explanation where worker annotations were contradictory (e.g., j < i) or had no majority decision for an annotation. This database is available for a possible future explanation task (given question + paragraph, produce the answer + explanation).

4.1 Models

We measured the performance of two baselines and three strong neural models on WIQA, to understand how it stresses these models: Majority predicts the most frequent label, correct, in the training dataset. Polarity is a rule-based baseline that assumes influences of the form "more X → more Y" (similarly for "less") are correct, hence "more X → less Y" are opposite. A small lexicon of positive ("more") and negative ("less") words is used to assign the more/less polarities. A random class label is chosen when assignments cannot be made. Adaboost (Freund and Schapire, 1995) was used to make the 3-way classification using several bag-of-word features computed from change and effect. Decomp-Attn applies the Decomposable Attention (DA) model of (Parikh et al., 2016) to our task. The original DA model computes entailment, i.e., the confidence that a premise entails (or contradicts) a hypothesis. We recast WIQA as an entailment task where cause-effect becomes premise-hypothesis, and (correct/opposite/no-effect) correspond to (entailment/contradiction/neutral). We retrain the DA model on WIQA using this mapping. BERT (Devlin et al., 2018 ) is a pre-trained transformer-based language model that has achieved state of the art performance on many NLP tasks. We supply questions to BERT as [CLS] paragraph [SEP] question [SEP] answer-option for each of the three options. The [CLS] token is then projected to a single logit and fed through a softmax layer across the three options, using cross entropy loss, and the highest-scoring option selected. We fine-tune BERT on the WIQA training data in this way. We also measure an ablated version where the paragraph is omitted (train and test). Human Performance was estimated by having three people independently answering the same 100 questions (with paragraphs) drawn randomly from the test set. Krippendorff's alpha (nominal metric) for these answers was 0.908 (high agreement) (Krippendorff, 1970) .

5.1 Prediction Accuracy

The results (Table 2 ) provide several insights: 1. The dataset is hard. Our strongest model (73.8) is over 20 points behind human performance, suggesting WIQA poses significant challenges. Prediction of out-of-para effects is particularly challenging, 37 points behind human performance. 2. BERT already "knows" some change-effect knowledge. Even without the paragraph, and even though the test paragraphs are on topics unseen in training, BERT scores substantially above the baselines. This suggests BERT has some type of cause-effect information embedded in it. 3. Supplying the paragraph helps, resulting in 10 points higher score, illustrating that WIQA contains questions that require understanding of the paragraph. This suggests more sophisticated reading strategies may further improve results.

Table 2: Comparing models on WIQA test partition

5.2 Predicting Indirect Effects

In-para and out-of-para questions were derived from chains of different lengths ("hops") in the influence graphs. Figure 3 shows how performance varies with respect to those lengths, and shows that "indirect" (2/3-hop) effects are harder to predict than "direct" (1-hop) effects. For example, it is easier to predict "cloudy day" results in "less sunshine" (direct) than "less photosynthesis" (indirect). This suggests that some form of reasoning along influence chains may be needed to predict indirect effects reliably, as those effects are less likely to be explicitly stated in corpora and embedded in pre-trained language models.

Figure 3: Accuracy of the best baselines drops as number of hops increase, quicker for ‘no para’ version.

5.3 Consistency

Are the models making consistent predictions? If a model predicts both X→Y and Y→Z are correct, it should, if it were consistent, also predict X →Z is correct. To measure a model's transitivity consistency, for each influence graph, we measure how often its indirect predictions (2/3-hop) are consistent 3 with its 1-hop predictions. Similarly, we measure disjunctive consistency by how often its predictions for edges known to be opposite (eg X→Y and X→opp-effect-in-para in Fig 2) are indeed so 4 . The results in Figure 4 illustrate that the models are far from consistent. This suggests that reasoning with global consistency constraints may improve results, e.g., (Ning et al., 2017; .

Figure 4: The best models (red,yellow) make substantially less consistent predictions than humans (green).

5.4 Linguistic And Semantic Phenomena

We analyzed 200 descriptions of changes in 100 random questions, and observe the following challenging (overlapping) phenomena to handle: 1. Qualitative Language: ≈65% of the change statements are expressed qualitatively, using a broad vocabulary of comparatives (e.g., more, fewer, smaller, larger, cooler, slower, higher, harder, decreased, hotter) or their corresponding adjectives (small, cool, etc.). In addition, whether the change is a positive or negative influence on the process is context-dependent ("more X" can be positive or negative, depending on X, and sometimes depending on the paragraph topic itself). 2. Commonsense (≈45%): Exogenous influences are (by definition) not stated in the paragraph, and so require substantial commonsense to understand, e.g., that "heavy rainfall" (out of para) negatively influences "more wild fires" (in para); or that "over- Figure 4 : The best models (red,yellow) make substantially less consistent predictions than humans (green).

fishing" (out of para) negatively influences "fish lay eggs" (in para).

3. Lexical matching ≈15% of the in-para changes refer to paragraph entities using different terms, e.g., "insect" (para) ↔ "bee" (question), "becomes" ↔ "forms", "removes" ↔ "expels", complicating aligning questions with the paragraph. 4. Negation (≈6%): Negation occurs in about 6% of the changes, e.g., "drought does not occur", "soil is not fertile", "magma does not get larger". 5. Juxtaposed polarities (≈3%): Sometimes positive-and negative-related terms are juxtaposed, (e.g., "much less", "increased deforestation", "less severe") again challenging to process. These all illustrate the diversity of linguistic and semantic challenges in WIQA.

6 Conclusion

An important test of understanding procedural text is whether the effects of perturbations to the process can be predicted. To that end, we have introduced WIQA, the first large-scale dataset for "what if" reasoning over text. While our experiments suggest language models have some builtin knowledge of influences, and some ability to identify influences in paragraphs, these capabilities are limited, producing predictions that are over 20 points worse than humans, often inconsistent, and particularly erroneous about indirect (multi-hop) effects. WIQA aims to improve this state of affairs, offering a new challenge and resource to the community. The dataset is available at http://data.allenai.org/wiqa/.

The dataset is available at http://data.allenai.org/wiqa/ arXiv:1909.04739v1 [cs.CL] 10 Sep 2019

Paragraph sentences always describe correct, not opposite, influences on later sentences, HENCE if x i is more/accelerated, x j will be too (similarly for less/decelerated).

i.e., the polarity (+/-, for correct/opposite) of edge XZ = the product of the polarities of edges chaining from X to Z. As models can also predict "no effect", random score is 1/3.4 Only edge pairs with labels +&-, or -&+, are disjunctively consistent (of 9 possible labelings), hence random is 2/9.