Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

What-if I ask you to explain: Explaining the effects of perturbations in procedural text



Our goal is to explain the effects of perturbations in procedural text, e.g., given a passage describing a rabbit’s life cycle, explain why illness (the perturbation) may reduce the rabbit population (the effect). Although modern systems are able to solve the original prediction task well (e.g., illness results in less rabbits), the explanation task - identifying the causal chain of events from perturbation to effect - remains largely unaddressed, and is the goal of this research. We present QUARTET, a system that constructs such explanations from paragraphs, by modeling the explanation task as a multitask learning problem. QUARTET constructs explanations from the sentences in the procedural text, achieving ~18 points better on explanation accuracy compared to several strong baselines on a recent process comprehension benchmark. On an end task on this benchmark, we show a surprising finding that good explanations do not have to come at the expense of end task performance, in fact leading to a 7% F1 improvement over SOTA.

1 Introduction

Procedural text is common in natural language (in recipes, how-to guides, etc.) and finds many applications such as automatic execution of biology experiments (Mysore et al., 2019), cooking recipes (Bollini et al., 2012) and everyday activities (Yang and Nyberg, 2015) . However, the goal of procedural text understanding in these settings remains a major challenge and requires two key abilities, (i) understanding dynamics of the world inside a procedure by tracking entities and what events happen as the narrative unfolds. (ii) understanding the dynamics of the world outside the procedure that can influence the procedure.

While recent systems for procedural text comprehension have focused on understanding the dynamics of the world inside the process, such as tracking entities and answering questions about what events Figure 1 : An example of the task. Given a procedural text, the task asks for the effect of a perturbation. The explanation includes two supporting sentences and their corresponding effects (more/less/no-effect) from the procedural text and how those steps will be affected (in pink) by the perturbation. In this example, the valid output explanation would be female rabbit is ill (leading to the opposite of) male and female rabbit mating (leading to the opposite of) female rabbit getting pregnant (leading to the opposite of) more rabbits happen, e.g., Henaff et al., 2017) , the extent to which they understand the influences of outside events remains unclear. In particular, if a system fully understands a process, it should be able to predict what would happen if it was perturbed in some way due to an event from the outside world. Such counterfactual reasoning is particularly challenging because, rather than asking what happened (described in text), it asks about what would happen in an alternative world where the change occurred.

Figure 1: An example of the task. Given a procedural text, the task asks for the effect of a perturbation. The explanation includes two supporting sentences and their corresponding effects (more/less/no-effect) from the procedural text and how those steps will be affected (in pink) by the perturbation. In this example, the valid output explanation would be female rabbit is ill (leading to the opposite of) male and female rabbit mating (leading to the opposite of) female rabbit getting pregnant (leading to the opposite of) more rabbits

Recently, Tandon et al. (2019) introduced the WIQA dataset that contains such problems, requiring prediction of the effect of perturbations in a procedural text. They also presented several arXiv:2005.01526v1 [cs.CL] Table 1 : Related work across different dimensions: -Whether the domain is procedural text and does the input contain perturbations. -Whether an explanation is generated (natural lang. or structured). strong models on this task. However, it is unclear whether those high scores indicate that the models fully understand the described procedures, i.e., that the models have knowledge of the causal chain from perturbation to effect. To test this, Tandon et al. (2019) also proposed an explanation task. While the general problem of synthesizing explanations is hard, they proposed a simplified version in which explanations were instead assembled from sentences in the input paragraph and qualitative indicators (more/less/unchanged). Although they introduced this explanation task and dataset, they did not present a model to address it. We fill this gap by proposing the first solution to this explanation task.

Table 1: Related work across different dimensions: - Whether the domain is procedural text and does the input contain perturbations. - Whether an explanation is generated (natural lang. or structured).

We present a model, QUARTET (QUAlitative Reasoning wiTh ExplanaTions) that takes as input a passage and a perturbation, and its qualitative effect. The output contains the qualitative effect and an explanation structure over the passage. See Figure 1 for an example. The explanation structure includes up to two supporting sentences from the procedural text, together with the qualitative affect of the perturbation on the supporting sentences (the qualitative effect is represented in pink in Figure 1 ). QUARTET models this qualitative reasoning task as a multitask learning problem to explain the effect of a perturbation. The main contributions of this work are:

• We present the first model that explains the effects of perturbations in procedural text. On a recent process comprehension benchmark, QUARTET generates better explanations compared to strong baselines. • We also found a surprising secondary effect:

Although we trained to generate good explanations, it also resulted in a downstream QA task scores significantly improving over SOTA by 7% absolute F1 (refer §6). Although im-proved QA was not the goal, it is an interesting result: Prior work has found that optimizing for explanation often hurts QA. Ours is a useful datapoint illustrating that good explanations do not have to come at the expense of QA performance.

2 Related Work

Procedural text understanding: Machine reading has seen tremendous progress. With machines reaching human performance in standard QA benchmarks (Devlin et al., 2018; Rajpurkar et al., 2016) , more challenging datasets have been proposed (Dua et al., 2019) that require background knowledge, commonsense reasoning (Talmor et al., 2019) and visual reasoning (Antol et al., 2015; Zellers et al., 2018) . In the context of procedural text understanding which has gained considerable amount of attention recently, Henaff et al., 2017; ) address the task of tracking entity states throughout the text. Recently, (Tandon et al., 2019) introduced the WIQA task to predict the effect of perturbations.

Understanding the effects of perturbations, specifically, qualitative change, has been studied using formal frameworks in the qualitative reasoning community (Forbus, 1984; Weld and De Kleer, 2013) and counterfactual reasoning in the logic community (Lewis, 2013) . The WIQA dataset situates this task in terms of natural language rather than formal reasoning, by treating the task as a mixture of reading comprehension and commonsense reasoning. However, existing models do not explain the effects of perturbations. Explanations: Despite large-scale QA benchmarks, high scores do not necessarily reflect understanding (Min et al., 2019) . Current models may not be robust or exploit annotation artifacts (Gururangan et al., 2018) . This makes explanations desirable for interpretation (Selvaraju et al., 2017) .

Attention based explanation has been successfully used in vision tasks such as object detection (Petsiuk et al., 2018) because pixel information is explainable to humans. These and other token level attention models used in NLP tasks (Wiegreffe and Pinter, 2019) do not provide full-sentence explanations of a model's decisions.

Recently, several datasets with natural language explanations have been introduced, e.g., in natural language inference (Camburu et al., 2018) , visual question answering (Park et al., 2018) , and multi-ears less protected → (MORE/+) sound enters the ear → (MORE/+) sound hits ear drum → (MORE/+) more sound detected blood clotting disorder → (LESS/-) blood clots → (LESS/-) scab forms → (MORE/+) less scab formation breathing exercise → (MORE/+) air enters lungs → (MORE/+) air enters windpipe → (MORE/+) oxygen enters bloodstream squirrels store food → (MORE/+) squirrels eat more → (MORE/+) squirrels gain weight → (MORE/+) hard survival in winter less trucks run → (LESS/-) trucks go to refineries → (LESS/-) trucks carry oil → (MORE/+) less fuel in gas stations coal is expensive → (LESS/-) coal burns → (LESS/-) heat produced from coal → (LESS/-) electricity produced legible address → (MORE/+) mailman reads address → (MORE/+) mail reaches destination → (MORE/+) on-time delivery more water to roots → (MORE/+) root attract water → MORE/+) roots suck up water → (LESS/-) plants malnourished in a quiet place → (LESS/-) sound enters the ear → (LESS/-) sound hits ear drum → (LESS/-) more sound detected eagle hungry → (MORE/+) eagle swoops down → (MORE/+) eagle catches mouse → (MORE/+) eagle gets more food Table 2 : Examples of our model's predictions on the dev. set in the format:

Table 2: Examples of our model’s predictions on the dev. set in the format: “qp → di xi → dj xj → de qe”. Supporting sentences xi, xj are compressed e.g., “the person has his ears less protected”→ “ears less protected”

"q p → d i x i → d j x j → d e q e ".

Supporting sentences x i , x j are compressed e.g., "the person has his ears less protected" → "ears less protected" hop reading comprehension (HotpotQA dataset) . In contrast to these datasets, we explain the effects of perturbations in procedural text. HotpotQA contains explanations based on two sentences from a Wikipedia paragraph. Models on the HotpotQA would not be directly applicable to our task and require substantial modification for the following reasons: (i) HotpotQA models are not trained to predict the qualitative structure (more or less of chosen explanation sentences in Figure 1 ). (ii) HotpotQA involves reasoning over named entities, whereas the current task focuses on common nouns and actions (models that work well on named entities need to be adapted to common nouns and actions (Sedghi and Sabharwal, 2018) ). (iii) explanation paragraphs in HotpotQA are not procedural while the current input is procedural in nature with a specific chronological structure.

Another line of work provides more structure and organization to explanations, e.g., using scene graphs in computer vision (Ghosh et al., 2019) . For elementary science questions, Jansen et al. 2018uses a science knowledge graph. These approaches rely on a knowledge structure or graph but knowledge graphs are incomplete and costly to construct for every domain (Weikum and Theobald, 2010) .

There are trade-offs between unstructured and structured explanations. Unstructured explanations are available abundantly while structured explanations need to be constructed and hence are less scalable (Camburu et al., 2018) . On the other hand, evaluating structured explanations is simpler than free-form or generated unstructured explanations (Cui et al., 2018; Zhang et al., 2019) . Our explanations have a qualitative structure over sentences in the paragraph. This retains the rich interpretability and simpler evaluation of structured explanations as well as leverages the large-scale availability of sentences required for these explanation.

It is an open research problem whether explanation helps end-task. On the natural language inference task (e-SNLI), Camburu et al. (2018) observed that models generate correct explanations at the expense of good performance. On the Cos-E task, recently Rajani et al. 2019showed that explanations help the end-task. Our work extends along this line in a new task setting that involves perturbations and enriches natural language explanations with qualitative structure.

3 Problem Definition

We adopt the problem definition described in Tandon et al. 2019, and summarize it here.

Input: 1. Procedural text with steps x 1 . . . x K . Here, x k denotes step k (i.e., a sentence) in a procedural text comprising K steps. 2. A perturbation q p to the procedural text and its likely candidate effect q e .

Output: An explanation structure that explains the effect of the perturbation q p :

q p → d i x i → d j x j → d e q e

• i: step id for the first supporting sentence.

• j: step id for the second supporting sentence.

• d i ∈ {+ − • }: how step id i is affected.

• d j ∈ {+ − • }: how step id j is affected.

• d e ∈ {+ − • }: how q e is affected. See Figure 1 for an example of the task, and Table 2 for examples of explanations.

An explanation consists of up to two (i.e., zero, one or two) supporting sentences i, j along with their qualitative directions

d i , d j .

If there is only one supporting sentence, then j = i. If d e = • , then i =Ø, j =Ø (there is no valid explanation for no-effect).

While there can be potentially many correct explanation paths in a passage, the WIQA dataset consists of only one gold explanation considered best by human annotators. Our task is to predict that particular gold explanation.

Assumption: In a procedural text, steps x 1 . . . x K are chronologically ordered and have a forward flowing effect i.e., if j > i then more/increase of x i will result in more/increase of x j . Prior work on procedural text makes a similar assumption . Note that this assumption does not hold for cyclic processes, and cyclic processes have already been flattened in WIQA dataset. We make the following observations based on this forward-flow assumption. a1: i <= j (forward-flow order) a2:

d j = d i (forward-flow assumption)

a3: For the WIQA task, d e is the answer label because it is the end node in the explanation structure.


If d i = • then answer label = • (since q p does not affect q e

, there is no valid explanation.)

a5: 1 ≤ i ≤ K; if d i = • , then i = Ø (see a4) a6: i ≤ j ≤ K; if d e = • , then j = Ø (see a4)

This assumption reduces the number of predictions, removing d j and answer label (see a2, a3). Given x 1 . . . x K , q p , q e the model must predict four labels:

i, j, d i , d e .

4 Quartet Model

We can solve the problem as a classification task, predicting four labels: i, j, d i , d e . If these predictions are performed independently, it requires several independent classifications and this can cause error propagation: prediction errors that are made in the initial stages cannot be fixed and can propagate into larger errors later on (Goldberg, 2017) .

To avoid this, QUARTET predicts and explains the effect of q p as a multitask learning problem, where the representation layer is shared across different tasks. We apply the widely used parameter sharing approach, where a single representation layer is followed by task specific output layers (Baxter, 1997) . This reduces the risk of overfitting to a single task and allows decisions on i, j, d i , d e to influence each other in the hidden layers of the network. We first describe our encoder and then the other layers on top, see Figure 2 for the model architecture.

Figure 2: QUARTET model. Input: Concatenated passage and question using standard BERT word-piece tokenization. Representation Layer: The input is encoded using BERT transformer. We obtain [CLS] and sentence level representations. Prediction: From the sentence level representation, we use an MLP to model the distributions for i and j (using attended sentence representation). From [CLS] representation, we use MLP for di (and dj , since di = dj) and de distributions. Output: Softmax to predict {i, j, di, dj , de}

Encoder: To encode x 1 . . . x K and question q we use the BERT architecture (Devlin et al., 2018) that has achieved state-of-the-art performance across several NLP tasks , where the question q = q p ⊕ q e (⊕ stands for concatenation). We start with a byte-pair tokenization (Sennrich et al., 2015) These byte-pair tokens are passed through a 12layered Transformer network, resulting in a contextualized representation for every byte-pair token. In this contextualized representation, the vector

u = [u 1 , ...u K , u q ]

where u k denotes the encoding for [x k ], and u q denotes question encoding. Let E l be the embedding size resulting from l th transformer layer. In that

l th layer, [u 1 , ...u K ] ∈ R K * E l .

The hidden representation of all transformer layers are initialized with weights from a self-supervised pre-training phase, in line with contemporary research that uses pre-trained language models (Devlin et al., 2018) .

To compute the final logits, we add a linear layer over different transformer layers in BERT are individual winners for individual tasks in our multitask problem. For instance, out of the total 12 transformer layers, lower layers (layer 2) are the best predictors for [i, j] while upper layers (layer 10 and 11) are the best performing predictors for Zhang et al. (2019) found that the last layer is not necessarily the best performing layer. Different layers seem to be learn some complementary information because their fusion helps. Combining different layers by weighted averaging of the layers has been attempted with mixed success (Zhang et al., 2019; . We observed the same trend for simple weighted transformation. However, we found that learning a linear layer over concatenated features from winning layers improves performance. This is probably because there is very different information encoded in a particular dimension across different layers, and the concatenation preserves it better than simple weighted averaging.

[d i , d e ].

Classification tasks: To predict the first supporting sentence x i , we obtain a softmax distribution Figure 2 : QUARTET model. Input: Concatenated passage and question using standard BERT word-piece tokenization. Representation Layer: The input is encoded using BERT transformer. We obtain [CLS] and sentence level representations. Prediction: From the sentence level representation, we use an MLP to model the distributions for i and j (using attended sentence representation). From [CLS] representation, we use MLP for d i (and d j , since

d i = d j ) and d e distributions. Output: Softmax to predict {i, j, d i , d j , d e } s i ∈ R K over [u 1 , ...u K ]

. From the forward-flow assumption made in the problem definition section earlier, we know that i ≤ j, making it possible to model this as a span prediction x i:j . Inline with standard span based prediction models (Seo et al., 2017) , we use an attended sentence representation

(s i [u 1 , ...u K ]) ⊕ ([u 1 , ...u K ]) ∈ R K * 2E l to pre- dict a softmax distribution s j ∈ R K to obtain x j .

Here, denotes element-wise multiplication and ⊕ denotes concatenation.

For classification of d i (and d j , since

d i = d j )

, we use the representation of the first token (i.e., CLS token ∈ R E l ) and a linear layer followed by softmax to predict d i ∈ { + − • }. Classification of d e is performed in exactly the same manner.

The network is trained end-to-end to minimize the sum of cross-entropy losses for the individual classification tasks i, j, d i , d e . At prediction time, we leverage assumptions (a4, a5, a6) to generate consistent predictions.

5 Experiments

Dataset: We train and evaluate QUARTET on the recently published WIQA dataset 1 comprising of 30,099 questions from 2107 paragraphs with explanations (23K train, 5K dev, 2.5K test). The perturbations q p are either linguistic variation (17% examples) of a passage sentence (these are called in-para questions) or require commonsense reasoning to connect to a passage sentence (41% examples) (called, out-of-para questions). Explanations are supported by up to two sentences from the passage: 52.7% length 2, 5.5% length 1, 41.8% length 0. Length zero explanations indicate that d e = • (called, no-effect questions), and ensure that random guessing on explanations gets low score on the end task.


We evaluate on both explainability and the downstream end task (QA). For explainability, we define explanation accuracy as the average accuracy of the four components of the explanation: acc expl = 1 4 * i∈{i,j,d i ,de} acc(i) and acc qa = acc(d e ) (by assumption a3). The QA task is measured in terms of accuracy.

Hyperparameters: QUARTET fine-tunes BERT, allowing us to re-use the same hyperparameters as BERT with small adjustments in the recommended range (Devlin et al., 2018) . We use the BERT-baseuncased version with a hidden size of 768. We use the standard adam optimizer with a learning rate 1e-05, weight decay 0.01, and dropout 0.2 across all the layers. We will publicly release the code.


We measure the performance of the following baselines (two non-neural and three neural).

• RANDOM: Randomly predicts one of the three

labels {+ − • } to guess [d i , d e ].

Supporting sentences i and j are picked randomly from |avg sent | sentences.

• MAJORITY: Predicts the most frequent label (no effect i.e. d e = • in the case of WIQA dataset.)

• q e ONLY : Inspired by existing works (Gururangan et al., 2018) , this baseline exploits annotation artifacts (if any) in the explanation dataset by retraining QUARTET using only q e while hiding the permutation q p in the question.

• HUMAN upper bound (Krippendorff's alpha interannotator values on [i, j, d i ]) on explainability reported in (Tandon et al., 2019) .

• TAGGING: We can reduce our task to a structured prediction task. An explanation i, j, d i , d e requires span prediction x i:j and labels on that span. So, for example, the explanation i = 1, j = 2, Formulating as a sequence tagging task allows us to use any standard sequence tagging model such as CRF as baseline. The decoder invalidates sequences that violate assumptions (a3 -a6). To make the encoder strong and yet comparable to our model, we use exactly the same BERT encoder as QUARTET. For each sentence representation u k , we predict a tag ∈ T . A CRF over these local predictions additionally provides global consistency. The model is trained end-to-end by minimizing the negative log likelihood from the CRF layer.

d i =+, d j =− for input x 1 • x

• BERT-NO-EXPL: State-of-the-art BERT model (Tandon et al., 2019) that only predicts the final answer d e , but cannot predict the explanation.

• BERT-W/-EXPL: A standard BERT based approach to the explanation task that predicts the explanation structure. This model minimizes only the cross-entropy loss of the final answer d e , predicting an explanation that provides the best answer accuracy.

• QUARTET: our model described in §4 that optimizes for the best explanation structure.

5.1 Explanation Accuracy

QUARTET is also the best model on explanation accuracy. Table 3 shows the performance on

Table 3: Accuracy of the explanation structure (i, j, di, de). Overall explanation accuracy is accexpl.

[i, j, d i , d e ].

QUARTET also outperforms baselines on every component of the explanation. QUARTET performs better at predicting i than j. This trend correlates with human performance-picking on the second supporting sentence is harder because in a procedural text neighboring steps can have similar effects.

We found that the explanation dataset does not contain substantial annotation artifacts for the q e ONLY model to leverage (q e ONLY < MAJORITY) We also tried a simple bag of words and embedding vector based alignment between q p and x i in order to pick the most similar x i . These baselines perform worse than random, showing that aligning q p and x i involves commonsense reasoning that the these models cannot address.

6 Downstream Qa

In this section, we investigate whether a good explanation structure leads to better question answering. QUARTET advocates explanations as a first class citizen from which an answer can be derived.

6.1 Accuracy On A Qa Task

We compare against the existing SOTA on WIQA no-explanation task. Table 4 shows that QUARTET improves over the previous SOTA BERT-NO-EXPL by 7%, achieving a new SOTA results. Both these models are trained on the same dataset 2 . The major difference between BERT-NO-EXPL and QUARTET is that BERT-NO-EXPL solves only the QA task, whereas QUARTET solves explanations, and the answer to the QA task is derived from the explanation. Multi-tasking (i.e., explaining the answer) provides the gains to QUARTET. All the models get strong improvements over RANDOM and MAJORITY. The least performing model is TAGGING. The space of possible sequences of correct labels is large, and we believe that the current training data is sparse, so a larger training data might help. QUARTET avoids this sparsity problem because rather than a sequence it learns on four separate explanation components. Table 5 presents the accuracy based on question types. Both QUARTET achieves large gains over BERT-NO-EXPL on the most challenging outof-para questions. This suggests that QUARTET improves the alignment of q p and x i that involves some commonsense reasoning.

Table 4: QUARTET improves accuracy on the QA (end task) by 7% points.
Table 5: QUARTET improves accuracy over SOTA BERT-NO-EXPL across question types.

6.2 Correlation Between Qa And Explanation

QUARTET not only improves QA accuracy but also the explanation accuracy. We find that QA accuracy (acc de in Table 3 ) is positively correlated (Pearson coeff. 0.98) with explanation accuracy (acc expl ). This shows that if a model is optimized for explanations, it leads to better performance on end-task. Thus, with this result we establish that (at least on our task) models can make better predictions when forced to generate a sensible explanation structure. An educational psychology study (Dunlosky et al., 2013) hypothesizes that student performance improves when they are asked to explain while learning. However, their hypothesis is not conclusively validated due to lack of evidence. Results in Table 3 hint that, at least on our task, machines that learn to explain, ace the end task.

7 Error Analysis

We analyze our model's errors (marked in red) over the dev set, and observe the following phenomena.

1. Multiple Explanations:

As mentioned in Section 3, more than one explanations can be correct. 22% of the incorrect explanations were reasonable, suggesting that overall explanation accuracy scores might under-estimate the explanation quality. The following example illustrates that while gathering firewood is appropriate when fire is needed for survival, one can argue that going to wilderness is less precise but possibly correct. 2. i, j errors: Fig. 3 shows that predicted and gold distributions of i and j are similar. Here, sentence id = −1 indicates no effect. The model has learned from the data to never predict j < i without any hard constraints. The model is generally good at predicting i, j and in many cases when the model errs, the explanation seems plausible. Perhaps for the same underlying reason, human upper bound is not high on i (75.9%) and on j (66.1%). We show an example where i, j are incorrectly predicted (in red), but sound plausible. The following example shows an instance where '−' is misclassified as '+'. It implies that there is more scope for improvement here.

Figure 3: Gold vs. predicted distribution of i & j resp.

Gold: less seeds fall to the ground → (OPP/-) seed falls to the ground → (OPP/-) seeds germinate → (MORE/+) fewer plants Pred: less seeds fall to the ground → (OPP/-) seed falls to the ground → (OPP/-) seeds germinate → (OPP/-) fewer plants 4. in-para vs. out-of-para: The model performs better on in-para questions (typically, linguistic variations) than out-of-para questions (typically, commonsense reasoning). Also see empirical evidence of this in Table 5 .

The model is challenged by questions involving commonsense reasoning, especially to connect q p with x i in out-of-para questions. For example, in the following passage, the model incorrectly predicts • (no effect) because it fails to draw a connection between sleep and noise:

Table 6: Confusion matrix for di (left) and de overall (right). (gold is on x-axis, predicted on y-axis.)

Pack up your camping gear, food. Drive to your campsite. Set up your tent. Start a fire in the fire pit. Cook your food in the fire. Put the fire out when you are finished. Go to sleep. Wake up ... qp: less noise from outside qe: you will have more energy Analogous to i and j, the model also makes more errors between labels '+' and '−' in out-of-para questions compared to in-para questions (39.4% vs 29.7%) -see Table 7 . (Tandon et al., 2019) discuss that some in-para questions may involve commonsense reasoning similar to out-of-para questions. The following is an example of an in-para question where the model fails to predict d i correctly because it cannot find the connection between protected ears and amount of sound entering.

Table 7: Confusion matrix di for in-para & out-of-para

Gold: ears less protected → (MORE/+) sound enters the ear → (MORE/+) sound hits ear drum → (MORE/+) more sound detected Pred: ears less protected → (OPP/-) sound enters the ear → (OPP/-) sound hits ear drum → (MORE/+) more sound detected 5. Injecting background knowledge: To study whether additional background knowledge can improve the model, we revisit the out-of-para question that the model failed on. The model fails to draw a connection between sleep and noise, leading to an incorrect (no effect) ' • ' prediction.

By adding the following relevant background knowledge sentence to the paragraph "sleep requires quietness and less noise", the model was able to correctly change probability mass from d e = ' • ' to '+'. This shows that providing commonsense through Web paragraphs and sentences is a useful direction.

Pack up your camping gear, food ... Sleeping requires quietness and less noise. Go to sleep. Wake up ... qp: less noise from outside qe: you will have more energy

8 Conclusion

Explaining the effects of a perturbation is critical, and we have presented the first system that can do this reliably. QUARTET not only predicts meaningful explanations, but also achieves a new state-ofthe-art on the end-task itself, leading to an interesting finding that models can make better predictions when forced to explain.

Our work opens up new directions for future research. 1.) can structured explanations also improve performance on other NLP tasks such as

WIQA data is publicly available at http://data. allenai.org/wiqa/

We used the same code and parameters as provided by the authors of WIQA-BERT. The WIQA with-explanations dataset has about 20% fewer examples than WIQA withoutexplanations dataset [http://data.allenai.org/wiqa/] This is because the authors removed about 20% instances with incorrect explanations (e.g., where turkers didnt have an agreement). So we trained both QUARTET and WIQA-BERT on exactly the same vetted dataset. This helped to increase the score of WIQA-BERT by 1.5 points.