Abductive Commonsense Reasoning

Chandra Bhagavatula
Ronan Le Bras
Chaitanya Malaviya
Keisuke Sakaguchi
Ari Holtzman
Hannah Rashkin
Doug Downey
S. Yih
Yejin Choi
ICLR
2020
View in Semantic Scholar

Abstract

Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation. While abduction has long been considered to be at the core of how people interpret and read between the lines in natural language (Hobbs et al., 1988), there has been relatively little research in support of abductive natural language inference and generation. We present the first study that investigates the viability of language-based abductive reasoning. We introduce a challenge dataset, ART, that consists of over 20k commonsense narrative contexts and 200k explanations. Based on this dataset, we conceptualize two new tasks -- (i) Abductive NLI: a multiple-choice question answering task for choosing the more likely explanation, and (ii) Abductive NLG: a conditional generation task for explaining given observations in natural language. On Abductive NLI, the best model achieves 68.9% accuracy, well below human performance of 91.4%. On Abductive NLG, the current best language generators struggle even more, as they lack reasoning capabilities that are trivial for humans. Our analysis leads to new insights into the types of reasoning that deep pre-trained language models fail to perform--despite their strong performance on the related but more narrowly defined task of entailment NLI--pointing to interesting avenues for future research.

1 Introduction

The brain is an abduction machine, continuously trying to prove abductively that the observables in its environment constitute a coherent situation.

-Jerry Hobbs, ACL 2013 Lifetime Achievement Award 1

Abductive reasoning is inference to the most plausible explanation for incomplete observations Peirce (1965a). Figure 1 illustrates an example. Given the incomplete observations about the world that O 1 : "Jenny cleaned her house and went to work, leaving the window just a crack open." and sometime later O 2 : "When Jenny returned home, she saw her house was a mess.", we can hypothesize different potential explanations and reason about which is the most likely. We can readily rule out H 3 since it fails to justify the observation O 2 . While H 1 and H 2 are both plausible, the most likely explanation based on commonsense is H 1 as H 2 is somewhat implausible given O 1 .

Figure 1: Example of Abductive Reasoning. Given observations O1 and O2, the αNLI task is to select the most plausible explanatory hypothesis. Since the number of hypotheses is massive in any given situation, we make a simplifying assumption in our ART dataset to only choose between a pair of explanations.

One crucial point Pierce makes about abductive reasoning is that abduction is "the only logical operation which introduces any new ideas", which contrasts with other types of inference such as entailment, that focuses on inferring only such information that is already provided in the premise. Abductive reasoning has long been considered to be at the core of understanding narratives Hobbs et al. (1988) , reading between the lines Norvig (1987) ; Charniak & Shimony (1990) , reasoning about everyday situations Peirce (1965b) ; Andersen (1973) , and counterfactual reasoning Pearl (2002) ; Pearl & Mackenzie (2018) . Despite the broad recognition of its importance, however, the study of The bird got stuck inside the house, flew around while trying to escape, and made a mess.

O1 O2

At work, she opened her window and the wind blew her papers everywhere.

H2 H3

Jenny left an insecure opening to her house.

Fails To Justify O2

Although wind caused a mess, the event happened at

Jenny's workplace.

The thief got into the house through the window and made a mess.

When Jenny returned home, she saw that her house was a mess! Jenny cleaned her house and went to work, leaving the window just a crack open.

It was a breezy day and a large bird flew into the house. Figure 1: Example of Abductive Reasoning. Given observations O 1 and O 2 , the αNLI task is to select the most plausible explanatory hypothesis. Since the number of hypotheses is massive in any given situation, we make a simplifying assumption in our ART dataset to only choose between a pair of explanations.

abductive reasoning in narrative text has very rarely appeared in the NLP literature, in large part because most previous work on abductive reasoning has focused on formal logic, which has proven to be too rigid to generalize to the full complexity of natural language.

In this paper, we present the first study to investigate the viability of language-based abductive reasoning. This shift from logic-based to language-based reasoning draws inspirations from a significant body of work on language-based entailment Bowman et al. 2015; Williams et al. (2018b) , language-based logic Lakoff (1970); MacCartney & Manning (2007) , and language-based commonsense reasoning Mostafazadeh et al. (2016) ; Zellers et al. (2018) . In particular, we investigate the use of natural language as the representation medium, and probe deep neural models on language-based abductive reasoning.

More concretely, we propose Abductive Natural Language Inference (αNLI) 2 as a novel reasoning task in narrative contexts. We formulate αNLI as a multiple-choice task in order to support easy and reliable automatic evaluation: given a context, the task requires to choose the more likely explanation from a pair of hypotheses. We also introduce a new challenge dataset, ART, that consists of 20K narratives accompanied by over 200K explanatory hypothesis. 3 We then establish comprehensive baseline performance based on state-of-the-art NLI and language models. The best baseline based on BERT achieves 68.9%accuracy, with a considerable gap compared to human performance of 91.4%. Our analysis leads to insights into the types of reasoning that deep pre-trained language models fail to perform -despite their strong performance on the closely related but different task of entailment NLI -pointing to future research directions.

2 Task Definition

We formulate αNLI as multiple choice problems consisting of a pair of observations as context and a pair of hypothesis choices. Each instance in ART is defined as follows:

• O 1 : The observation at time t 1 .

• O 2 : The observation at time t 2 > t 1 .

• h + : A plausible hypothesis that explains the two observations O 1 and O 2 .

• h − : An implausible (or less plausible) hypothesis for observations O 1 and O 2 .

Given the observations and a pair of hypotheses, the αNLI task is to select the most plausible explanation (hypothesis).

3 A Probabilistic Framework For Αnli

A distinct feature of the αNLI task is that it requires jointly considering all available observations and their commonsense implications, to identify the correct hypothesis. Formally, the αNLI task is to select the hypothesis h * that is most probable given the observations.

EQUATION (1): Not extracted; please refer to original document.

Rewriting the objective using Bayes Rule, we have:

EQUATION (2): Not extracted; please refer to original document.

We formulate a set of probabilistic models for αNLI that make various independence assumptions on Equation 2 -starting from a simple baseline that ignores the observations entirely, and building up to a fully joint model. These models are depicted as Bayesian Networks in Figure 2 . Figure 2 : Illustration of the graphical models described in the probabilistic framework. The "Fully Connected" model can, in theory, combine information from both available observations.

Figure 2: Illustration of the graphical models described in the probabilistic framework. The “Fully Connected” model can, in theory, combine information from both available observations.

Hypothesis Only: Our simplest model makes the strong assumption that the hypothesis is entirely independent of both observations, i.e. (H ⊥ O 1 , O 2 ), in which case we simply aim to maximize the marginal P (H).

First (or Second) Observation Only: Our next two models make weaker assumptions: that the hypothesis depends on only one of the first O 1 or second O 2 observation.

Linear Chain: Our next more sophisticated model uses both observations, but considers each observation's influence on the hypothesis separately, i.e. it does not combine information across the observations. Formally, the model assumes that the three variables O 1 , H, O 2 form a linear Markov chain, where the second observation is conditionally independent of the first, given the hypothesis (i.e. (O 1 ⊥ O 2 |H)). Under this assumption, we aim to maximize a somewhat simpler objective than Equation 2:

EQUATION (3): Not extracted; please refer to original document.

where

(O 1 ⊥ O 2 |h i )

Fully Connected: Finally, our most sophisticated model jointly models all three random variables as in Equation 2, and can in principle combine information across both observations to choose the correct hypothesis.

To help illustrate the subtle distinction between how the Linear Chain and Fully Connected models consider both observations, consider the following example. Let observation O 1 : "Carl went to the store desperately searching for flour tortillas for a recipe." and O 2 : "Carl left the store very frustrated.". Then consider two distinct hypotheses, an incorrect h 1 : "The cashier was rude" and the correct h 2 : "The store had corn tortillas, but not flour ones.". For this example, a Linear Chain model could arrive at the wrong answer, because it reasons about the observations separately. That is, taking O 1 in isolation, both h 1 and h 2 seem plausible next events, albeit each a priori unlikely. And for O 2 in isolation-i.e. in the absence of O 1 , as for a randomly drawn shopper-the h 1 explanation of a rude cashier seems a much more plausible explanation of Carl's frustration than are the details of the store's tortilla selection. Combining these two separate factors leads the Linear Chain to select h 1 as the more plausible explanation. It is only by reasoning about Carl's goal in O 1 jointly with his frustration in O 2 , as in the Fully Connected model, that we arrive at the correct answer h 2 as the more plausible explanation.

3.1 Probabilistic Model Implementation

In our experiments, we encode the different independence assumptions in the best performing neural network model. For the hypothesis-only and single observation models, we can enforce the independencies by simply restricting the inputs of the model to only the relevant variables. On the other hand, the Linear Chain model takes all three variables as input, but we restrict the form of the model to enforce the conditional independence. Specifically, we learn a discriminative classifier: o2) where φ and φ are neural networks that produce scalar values.

P Linear Chain (h|o 1 , o 2 ) ∝ e φ(o1,h)+φ (h,

4 Art: Abductive Reasoning In Narrative Text

ART is the first large-scale benchmark dataset for studying abductive reasoning in narrative texts. A major challenge in creating such a resource is avoiding annotation artifacts -unintentional patterns in the data that leak information about the target label -that several recent studies have reported on crowdsourced datasets Gururangan et al. 2018; Poliak et al. (2018) ; Tsuchiya (2018) . Machine learning algorithms can exploit these annotation artifacts allowing them to score highly on performance metrics without learning to generalize well to the task. To tackle this challenge, we crowdsource several human written hypotheses choices (both plausible and implausible) for each instance and then apply an adversarial filtering (AF) algorithm to retain a challenging pair of hypotheses that are hard to distinguish between. Figure 3 shows some illustrative examples from ART. The first example is from the train set, while the others are from the dev set -the best model based on BERT fails to correctly predict the first two dev examples.

Figure 3: Some illustrative examples from the ART dataset. The first example is from the train set, while the others are from the dev set – the best model based on BERT fails to correctly predict the first two dev examples.

ART consists of ∼20K narrative contexts (pairs of observations O 1 , O 2 ) with over 200K explanatory hypotheses. Table 1 shows some corpus-level statistics of the ART dataset. The main characteristics of the dataset remain similar across the train, dev and test splits. 4

Table 1: Some statistics summarizing the ART dataset. The train set includes all plausible and implausible hypotheses collected via crowdsourcing, while the dev and test sets include the hypotheses selected through the Adversarial Filtering algorithm.

4.1 Collecting Observations And Hypotheses

Observations: To obtain pairs of observations for ART, we use stories from the ROCStories dataset Mostafazadeh et al. (2016) . ROCStories is a large collection of short, manually curated five-sentence stories. It was designed to have a clear beginning and ending for each story, which naturally map to the first (O 1 ) and second (O 2 ) observations in ART. Hypotheses Options: The plausible and implausible hypothesis options were crowdsourced on Amazon Mechanical Turk (AMT) in two separate tasks:

• Plausible Hypothesis Options: We presented O 1 and O 2 as narrative context to crowdworkers who were prompted to fill in "What happened in-between?" in natural language. The design of the task motivates the use of abductive reasoning to hypothesize likely explanations for the two given observations.

We collected multiple plausible hypotheses H + for each O 1 , O 2 pair in the dataset.

• Implausible Hypothesis Options: In this task, we presented workers with observations O 1 , O 2 and one plausible hypothesis option h + ∈ H + collected from the previous task. Crowdworkers were instructed to make minimal edits (up to 5 words) to a given h + to create implausible hypothesis variations for each plausible hypothesis.

We collected multiple implausible hypothesis H − for each O 1 , O 2 pair in the dataset.

Both crowdsourcing tasks are complex and require creative writing. Along with the ART dataset, we will publicly release templates and the full set of instructions for all crowdsourcing tasks to facilitate future data collection and research in this direction.

4.2 Adversarial Filtering Of Hypotheses Choices

Given an observation pair and sets of plausible and implausible hypotheses O 1 , O 2 , H + , H − , our adversarial filtering algorithm selects one plausible and one implausible hypothesis

O 1 , O 2 , h + , h

− such that h + and h − are hard to distinguish between. We make three key improvements over the previously proposed Adversarial Filtering (AF) approach in Zellers et al. (2018):1. Instead of a single positive sample, we exploit a pool H + of positive samples to choose from (i.e. plausible hypotheses). 2. Instead of machine generated distractors, the pool H − of negative samples (i.e. implausible hypotheses) is human-generated. Thus, the distractors share stylistic features of the positive samples as well as that of the context (i.e. observations O 1 and O 2 ) -making the negative samples harder to distinguish from positive samples. 3. We use BERT, a large pre-trained language model as the adversary and introduce a temperature parameter that controls the maximum number of instances that can be modified in each iteration of AF. In later iterations, fewer instances get modified resulting in a smoother convergence of the AF algorithm (described in more detail below).

Algorithm 1 provides a formal description of our approach. In each iteration i, we train an adversarial model M i on a random subset T i of the data and update the validation set V i to make it more challenging for M i . For a pair (h + k , h − k ) of plausible and implausible hypotheses for an instance k in the dataset, we denote δ =

∆ Mi (h + k , h − k )

t i = t e + ts−te 1+e 0.3(i− 3n 4 ) Randomly partition D i into (T i , V i ). Train model M i on T i . S i = ∅, the selected hypotheses for V i . for (h + k , h − k ) ∈ V i do Pick r uniformly at random in [0, 1]. if r > t i or ∆ Mi (h + k , h − k ) < 0 then 9 Add (h + k , h − k ) to S i . else 11 Pick (h + , h − ) ∈ H + k × H − k s.t. ∆ Mi (h + , h − ) < ∆ Mi (h + k , h − k ) 12 Add (h + , h − ) to S i . end end D i+1 = T i ∪ S i 16 end

We ran AF for 50 iterations and the temperature t i follows a sigmoid function, parameterized by the iteration number, between t s = 1.0 and t e = 0.2. Our final dataset, ART, is generated using BERT as the adversary in Algorithm 1. While our final dataset uses BERT as the adversary, preliminary 201750.9 50.8 ESIM+ELMo Chen et al. (2017) 58.2 58.8 Finetuning Pre-trained LMs GPT-ft 52.6 (0.9) 63.1 (0.5) BERT-ft [Hypothesis-Only] 55.9 (0.7) 59.5 (0.2) BERT-ft [First Obs. Only] 63.9 (0.8) 63.5 (0.7) BERT-ft [Second Obs. Only] 68.1 (0.6) 66.6 (0.2) BERT-ft [Linear Chain] 65.3 (1.4) 68.9 (0.5) BERT-ft [Fully Connected] 72.0 (0.5) 68.6 (0.5)

Human Performance -91.4 experiments that used GPT as an adversary resulted in similar drops in performance of all models, including all BERT variants. We compare the results of the two adversaries in Table 2 .

Table 2: Performance of baselines and finetuned-LM approaches on the test set of ART. Test accuracy is reported as the mean of five models trained with random seeds, with the standard deviation in parenthesis.

5 Experiments And Results

We now present our evaluation of finetuned state-of-the-art pre-trained language models on the ART dataset, and several other baseline systems. Since αNLI is framed as a binary classification problem, we choose accuracy as our primary metric.

In spite of strong performance on several other NLP benchmark datasets, the best baseline model based on BERT achieves an accuracy of just 68.9% on ART compared to human performance of 91.4%. The large gap between human performance and that of the best system provides significant scope for development of more sophisticated abductive reasoning models. Our experiments show that introducing the additional independence assumptions described in Section 3 over the fully connected model tends to degrade system performance (see Table 2 ) in general.

Human Performance We compute human performance through AMT. Each instance (two observations and two hypothesis choices) is shown to three workers who were prompted to choose the more plausible hypothesis choice. 5 We compute majority vote on the labels assigned which leads to a human accuracy of 91.4% on the ART test set.

Baselines We include baselines that rely on simple features to verify that ART is not trivially solvable due to noticeable annotation artifacts, observed on several crowdsourced datasets Gururangan et al. 2018; Poliak et al. (2018) ; Tsuchiya (2018) . The accuracies of all simple baselines is close to chance-performance on the task -indicating that the dataset is free of simple annotation artifacts.

Specifically, we train an SVM classifier and a bag-of-words model using GLoVE embeddings. Both models achieve accuracies close to 50%. 6 An Infersent Conneau et al. (2017) baseline that uses sentences embedded by max-pooling over Bi-LSTM token representations achieves only 50.8% accuracy. A model for the related but distinct task of entailment NLI (e.g. SNLI) forms a natural baseline for αNLI. We re-train the ESIM+ELMo Chen et al. (2017) ; Peters et al. (2018) model as its performance on entailment NLI (88.9%) is close to state-of-the-art models (excluding pre-trained language models). This model only achieves an accuracy of 58.8% highlighting that performing well on ART requires models to go far beyond the linguistic notion of entailment. Pre-trained Language Models BERT Devlin et al. (2018) and GPT Radford (2018) have recently been shown to achieve state-of-the-art results on several NLP benchmarks Wang et al. (2018) . We finetune both BERT-Large and GPT as suggested in previous work and we present each instance in their natural narrative order. BERT-ft (fully connected) is the best performing model achieving 68.9% accuracy, compared to GPT's 63.1%. 7 Our AF approach was able to reduce BERT performance from over 88% by 20 points.

Learning Curve and Dataset Size While there is enough scope for considerably scaling up the dataset based on ROCStories, the learning curve in Figure 4 shows that the performance of the best model plateaus after ∼10, 000 instances. In addition, there is still a wide gap (∼23%) between the performance of the best model and human performance. Table 2 also includes results of our experiments where GPT was used as the adversary. Notably, in this case, adversarially filtering the dataset brings down GPT performance under 53%. On the other hand, the best BERT model, that encodes the fully connected bayesian network performs significantly better than the BERT model that encodes the linear chain assumptions -72% compared to 65%. Therefore, we use the BERT fully connected model as the adversary in ART. The gap between the linear chain and fully connected BERT models diminishes when BERT is used as an adversary -in spite of being a more powerful model -which indicates that adversarial filtering disproportionately impacts the model used as the adversary. However, the dataset also becomes more difficult for the other models that were not used as adversaries. For example, before any filtering, BERT scores 88% and OpenGPT gets 80%, which is much higher than either model achieves in Table 2 when the other model is used for filtering. This result is a reasonable indicator, albeit not a guarantee, that ART will remain challenging for new models released in the future.

Figure 4: BERT learning curve on the dev set of ART. For each point on the x-axis, we fine-tune BERT with five random seeds. Human performance is 91.4%.

6 Analysis

Commonsense reasoning categories We investigate the categories of commonsense-based abductive reasoning that are challenging for current systems and the ones where the best model overperforms. While there have been previous attempts to categorize commonsense knowledge required for entailment LoBue & Yates (2011); Clark et al. (2007) , crowdsourcing this task at scale with high fidelity and high agreement across annotators remains challenging. Instead, we aim to probe the model with soft categories identified by matching lists of category-specific keywords to the hypothesis choices. Figure 3 shows the accuracy of the best model (BERT-ft) across various categories of commonsense knowledge. BERT-ft significantly underperforms on instances involving Numerical (56.8%) and Spatial (65.4%) commonsense. These two categories include reasoning about numerical quantities and the spatial location of agents and objects, and highlight some of the limitations of 8486.9 72.6 14.3 Table 3 : BERT's performance and human evaluation on categories for 1, 000 instances from the test set, based on commonsense reasoning domains (Numerical, Spatial, Emotional) . The number in parenthesis indicates the size of the category.

Table 3: BERT’s performance and human evaluation on categories for 1, 000 instances from the test set, based on commonsense reasoning domains (Numerical, Spatial, Emotional). The number in parenthesis indicates the size of the category.

the language models. In contrast, it significantly overperforms on the Emotional category (72.6%) where the hypotheses exhibit strong textual cues about emotions and sentiments. Implausible transitions A model for an instance of the ART dataset should discard implausible hypotheses in the context of the two given observations. In narrative contexts, there are three main reasons for an implausible hypothesis to be labeled as such: We analyze the prevalence of each of these reasons in ART. We design a crowdsourcing task in which we show the implausible option along with the narrative context O 1 , O 2 and get labels for which transition (O 1 →h − , h − →O 2 or neither) in the narrative chain is broken. Table 4 shows the proportion of each category from a subset of 1, 000 instances from the test set. While h − →O 2 accounts for almost half of the implausible transitions in ART, all three categories are substantially present in the dataset. BERT performance on each of these categories indicates that the model finds it particularly hard when the narrative created by the incorrect hypothesis is plausible, but less plausible than the correct hypothesis. On that subset of the test set, the fully connected model performs better than the linear chain model where it is important to consider both observations jointly to arrive at the more likely hypothesis. Table 4 : Fraction of dataset for which a particular transition in the story is broken for the negative hypothesis, for 1, 000 random instances from the test set.

Table 4: Fraction of dataset for which a particular transition in the story is broken for the negative hypothesis, for 1, 000 random instances from the test set.

1. O 1 →h − : h − is

7 Related Work

Cloze-Style Task vs. Abductive Reasoning Since abduction is fundamentally concerned with plausible chains of cause-and-effect, our work draws inspiration from previous works that deal with narratives such as script learning Schank & Abelson (1975) Rudinger et al. (2015) . Rather than learning prototypical scripts or narrative chains, we instead reason about the most plausible events conditioned on observations. We make use of the ROCStories dataset Mostafazadeh et al. (2016) , which was specifically designed for the narrative cloze task. But, instead of reasoning about plausible event sequences, our task requires reasoning about plausible explanations for narrative omissions.

Entailment vs. Abductive Reasoning The formulation of αNLI is closely related to entailment NLI, but there are two critical distinctions that make abductive reasoning uniquely challenging. First, abduction requires reasoning about commonsense implications of observations (e.g., if we observe that the "grass is wet", a likely hypothesis is that "it rained earlier") which go beyond the linguistic notion of entailment (also noted by Josephson (2000) ). Second, abduction requires non-monotonic reasoning about a set of commonsense implications collectively, to check the potential contradictions against multiple observations and to compare the level of plausibility of different hypotheses. This makes abductive reasoning distinctly challenging compared to other forms of reasoning such as induction and deduction Shank (1998) . Perhaps more importantly, abduction is closely related to the kind of reasoning humans perform in everyday situations, where information is incomplete and definite inferences cannot be made.

Related Datasets Our new resource ART complements ongoing efforts in building resources for natural language inference (Dagan et al., 2006; MacCartney & Manning, 2009; Bowman et al., 2015; Williams et al., 2018a; Camburu et al., 2018) . Existing datasets have mostly focused on textual entailment in a deductive reasoning set-up (Bowman et al., 2015; Williams et al., 2018a) and making inferences about plausible events (Maslan et al., 2015; Zhang et al., 2017) . In their typical setting, these datasets require a system to deduce the logically entailed consequences of a given premise. In contrast, the nature of abduction requires the use of commonsense reasoning capabilities, with less focus on lexical entailment. While abductive reasoning has been applied to entailment datasets Raina et al. (2005) , they have been applied in a logical theorem-proving framework as an intermediate step to perform textual entailment -a fundamentally different task than αNLI.

Pronounced as alpha-NLI 3 ART: Abductive Reasoning in narrative Text.

We include the full validation set in the Supplemental Data. We will publicly release the ART dataset upon acceptance.

Additional crowdsourcing details in the Appendix 6 Details about training the SVM and BOW baselines are in the Appendix.

The input format for the GPT model and BERT variants is described in the Appendix.

CONCLUSIONWe present the first study that investigates the viability of language-based abductive reasoning. We conceptualize and introduce Abductive Natural Language Inference (αNLI) -a novel task focused on abductive reasoning in narrative contexts. The task is formulated as a multiple-choice questionanswering problem. We also create and introduce a new challenge dataset, ART, which consists of 20,000 commonsense narratives accompanied with over 200,000 explanatory hypotheses. In our experiments, we establish comprehensive baseline performance on this new task based on stateof-the-art NLI and language models, which leads to 68.9% accuracy with a considerable gap with human performance (91.4%). Our analysis leads to new insights into the types of reasoning that deep pre-trained language models fail to perform -despite their strong performance on the closely related but different task of entailment NLI -pointing to interesting avenues for future research. We hope that ART will serve as a challenging benchmark for future research in language-based abductive reasoning.

Table 5: Input formats for GPT and BERT fine-tuning.