Thinking Like a Skeptic: Defeasible Inference in Natural Language

Rachel Rudinger
Vered Shwartz
Jena D. Hwang
Chandra Bhagavatula
Maxwell Forbes
R. Le Bras
Noah A. Smith
Yejin Choi
FINDINGS
2020
View in Semantic Scholar

Abstract

Defeasible inference is a mode of reasoning in which an inference (X is a bird, therefore X flies) may be weakened or overturned in light of new evidence (X is a penguin). Though long recognized in classical AI and philosophy, defeasible inference has not been extensively studied in the context of contemporary data-driven research on natural language inference and commonsense reasoning. We introduce Defeasible NLI (abbreviated \delta-NLI), a dataset for defeasible inference in natural language. Defeasible NLI contains extensions to three existing inference datasets covering diverse modes of reasoning: common sense, natural language inference, and social norms. From Defeasible NLI, we develop both a classification and generation task for defeasible inference, and demonstrate that the generation task is much more challenging. Despite lagging human performance, however, generative models trained on this data are capable of writing sentences that weaken or strengthen a specified inference up to 68% of the time.

1 Introduction

Commonsense reasoning tasks are frequently formulated in terms of soft inferences: what is likely or plausibly true given some context, rather than (or in addition to) what is necessarily true. Given a context such as "The drinking glass fell", it is common sense to infer that what likely happened next is "The drinking glass broke". However, with the addition of new information, this inference may be blocked or weakened. If, for example, we subsequently learn that "The glass fell onto a pile of laundry" or that "The glass was made of durable material", our original expectation that the glass will break is greatly diminished. This pattern of reasoning, in which an initially supported inference may subsequently be weakened or retracted in light Two men and a dog are standing among rolling green hills.

They are wearing backpacks.

One man is using his binoculars.

The men are studying a tour map.

The men are holding pitchforks.

The men are facing their granary.

The dog is a sheep dog.

The men are farmers.

Premise:

Hypothesis: Figure 1 : Examples from the δ-SNLI portion of the δ-NLI dataset. A neutral premise-hypothesis pair from SNLI is augmented with three update sentences that weaken the hypothesis (left, red) and three update sentences that strengthen it (right, blue).

of new evidence, is known as defeasible reasoning (Koons, 2017) . To the extent, then, that commonsense and natural language inference systems must be able to reason about plausible or likely inferences, they must also be able to reason about the defeasibility of those inferences. While most contemporary resources and datasets for these tasks attempt to directly address the former, few provide the context to facilitate the latter mode of reasoning.

Tasks like the Recognizing Textual Entailment (RTE) challenge (Dagan et al., 2005) or Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) capture entailment relations between a fixed context (premise) and inference (hypothesis), but do not reveal how these relations may shift in light of new information about the context. Similarly, knowledge graphs for commonsense reasoning like ATOMIC (Sap et al., 2019) or ConceptNet (Havasi et al., 2007; Speer et al., 2017) encode inference rules about generic situations, but do not elaborate on possible exceptions to the applications of those rules.

In this work, we introduce Defeasible NLI (abbreviated δ-NLI, pronounced "delta-NLI"), a new dataset to study defeasible inference in natural lan- guage. 1 δ-NLI is a collection of extensions to three existing English-language inference datasets, covering a broad range phenomena: natural language inference (SNLI (Bowman et al., 2015) ), commonsense reasoning (ATOMIC (Sap et al., 2019) ), and reasoning about social norms (SOCIAL-CHEM-101 (Forbes et al., 2020) ). We refer to these subsections of the dataset as δ-SNLI, δ-ATOMIC, and δ-SOCIAL, respectively. We augment each resource by eliciting additional contextual information ("updates") that either strengthen or weaken a given inference (which we term "strengtheners" and "weakeners," respectively). An example is provided in Fig. 1 . From these three augmented datasets, we are able to devise two tasks for defeasible inference:

(1) a classification task for predicting whether a provided update sentence acts as a strengthener or a weakener; and (2) a generation task in which a premise-hypothesis pair are provided as input and an update sentence that weakens or strengthens the hypothesis must be generated as output. Through experiments in which we fine-tune pretrained language models for both tasks, we demonstrate that the generative task is much more challenging than the classification task. While system performance approaches human-level agreement on the classification task, the gap between system and human performance on the generative task is still considerable. We perform an extensive analysis of the failure and success modes of the generative defeasible inference models.

Finally, we observe that, not only is the generative task more challenging than the classification 1 Data available at https://github.com/ rudinger/defeasible-nli task, but it has an additional meaningful interpretation, namely, a system's ability to "think like a skeptic." That is to say, informally, a human who is engaging in skeptical reasoning is considering the possible weaknesses of a given claim or argument in order to come up with examples or counterarguments that may undermine it; by analogy, the generative task we introduce here requires a model to come up with (rather than simply verify) examples of circumstances that undermine the given hypothesis.

2 Background And Related Work

Defeasible reasoning is soft inference based on default assumptions to account for unknown facts, for example, "Tweety is a bird" entails that "Tweety flies", because birds usually fly. Such a conclusion is not deductively valid, and might be invalidated by new information such as "Tweety is a penguin" (Reiter, 1980; Lascarides and Asher, 1991) . Defeasible reasoning is a type of nonmonotonic logic, as it contrasts the monotonicity property of classical logic, according to which valid inferences cannot be defeated by adding additional information (Kraus et al., 1990) . Defeasible reasoning has been studied in a range of fields from logic, through linguistics and artificial intelligence.

Classical AI. In early AI, defeasible reasoning was used as a solution to the "frame problem": it is impossible to list all the potential effects of actions without describing mundane and obvious effects (McCarthy and Hayes, 1969) . McDermott and Doyle (1980) offered a formal account of the proof systems and model theories of nonmonotonic logics. Default logic (Reiter, 1980) was suggested as a nonmonotonic logic that specifies a set of default assumptions, i.e., predicates that are true unless specified otherwise (e.g., bird(X) → fly(X)). In circumscription (McCarthy, 1980) , defaults are expressed in natural language ("a bird will fly if it is not abnormal"). Pollock (1987) outlined a system for defeasible reasoning based on several different types of warranted inferences. Finally, Levesque (1990) suggested a special 'all I know is ...' operator, e.g. "All I know is that Tweety is a bird" entails that "Tweety flies".

Linguistics. In semantics and pragmatics, a distinction is drawn between entailments and implicatures. Entailments are inferences which are necessarily true, arising from the semantics of an utterance (e.g., "Pat is a bachelor," entails "Pat is unmarried."). Linguistic utterances also invite unstated pragmatic inferences, or implicatures, which depend not only on the semantics of the utterance but also its conversational context (Grice, 1975) . Implicatures are cancellable (defeasible), meaning they could be revoked in light of further evidence. For instance, the comment "that cake looks delicious" might invite the inference that the speaker is requesting a slice, until they clarify that they have a food allergy. Building on this notion of default assumptions, Lascarides and Asher (1993) proposed to interpret discourse relations by defining defeasible rules based on commonsense knowledge of typical causes and effects.

Natural Language Processing. Textual entailment was defined as a softer version of semantic entailment, doubly hedging it with "a human would typically think that the hypothesis is likely true" (see Section 3, Dagan et al., 2005) . It gained tremendous popularity again 10 years later, with the release of the large-scale Stanford Natural Language Inference dataset (SNLI; Bowman et al., 2015) , that facilitated training neural models, and which was followed by several other datasets in that nature (Williams et al., 2018; Nie et al., 2019) . But-among other criticisms of the task-it has been shown that people generally don't agree on entailment annotations (Pavlick and Kwiatkowski, 2019) , and new variants of the task suggested to shift away from categorical labels to ordinal or numeric values denoting plausibility (Zhang et al., 2017; Sakaguchi and Van Durme, 2018; Chen et al., 2020) . In this paper we focus on the defeasibil-ity of textual entailments, a less well-studied phenomenon in this context.

3 Definition

In this paper, we employ a working definition of defeasible inference that may be seen as an outgrowth of prior work. Dagan et al. (2005) introduced the following informal definition for the Recognizing Textual Entailment (RTE) task:

...textual entailment is defined as a directional relationship between pairs of text expressions, denoted by T, the entailing "Text", and H, the entailed "Hypothesis". We say that T entails H if, typically, a human reading T would infer that H is most likely true.

Similarly, the task of Natural Language Inference (NLI) seeks to determine whether a (onedirectional) entailment relation exists between a premise sentence and a hypothesis sentence (Mac-Cartney, 2009; Bowman et al., 2015) .

While the RTE and NLI tasks treat entailment relations as fixed, in this work we seek to understand how the introduction of new information can dynamically and directionally affect the strength of inference. Thus, our working definition of defeasible inference extends the RTE and NLI task formulations to model the relationship between a premise, hypothesis, and a third update sentence:

Given premise P, a hypothesis H is defeasible if there exists an update U (consistent with P) such that a human would find H less likely to be true after learning U. Specifically, an update U is called a weakener if, given a premise P and hypothesis H, a human would most likely find H less likely to be true after learning U; if they would find H more likely to be true, then we call U a strengthener.

By introducing both strengtheners and weakeners, we generalize from defeasibility as a onedirectional phenomenon (weakening only) to study the bi-directional phenomenon.

4 Data Sources

We collect strengtheners and weakeners for three different types of data sources that illustrate the generality of the defeasible inference framework. Table 1 shows example strengtheners and weakeners collected for the various tasks, detailed below.

Table 1: Examples of strengtheners and weakeners collected for the δ-SNLI, δ-ATOMIC, and δ-SOCIAL portions of the Defeasible NLI dataset.

Natural Language Inference

The SNLI dataset (Bowman et al., 2015 ) is a largescale human-labeled dataset created for natural language inference. It is a collection of 570K crowdsourced English premise-hypothesis sentence pairs, each hypothesis manually classified as entailment, contradiction, or neutral with respect to its premise. The neutral pairs are of central interest in this work. In SNLI, neutral premise-hypothesis pairs are those in which the hypothesis is neither entailed nor contradicted by the premise (see Figure 1 for example), leaving room for the potential for strengthening or weakening the statement if the appropriate conditions are provided. In our dataset we include 10K neutral premise and hypothesis pairs, as well as a small subset of instances that lacked annotation consensus. 2

Commonsense Knowledge Graph

The ATOMIC knowledge graph is a collection of 877K textual commonsense descriptions for inferential knowledge (Sap et al., 2019) . The data was collected through crowdsourcing if-then knowledge about events and their commonsense relations to other events and states (relation targets). In ATOMIC, an event involving a PersonX is linked to multiple relation targets via relation types like xAttr (attribute of PersonX). For example, if "PersonX adopts a cat", then PersonX might take a subsequent action (xEffect; "buy cat litter"), be seen as of a particular persona (xAttr; "as seeking companionship"), or have a particular mental state as a result (xReact; "feels loved"). While these relations capture commonsense inferences that are plausible or even very likely, their likelihood could be dampened with additional context, e.g., in the above case "PersonX needs a barn cat for their mice problem". Thus, for the purposes of this study, we cast events as the premise and the relation targets as the defeasible hypotheses. In particular, we extract a total of 24K event (premise) and relation target (hypothesis) pairs. We limit the relation targets to six of nine relations corresponding to the explicit agent or PersonX in the event. The other three relations that concern 'others', which may or may not be explicit participants in the event, are excluded.

2 Instances that were labeled '-' in SNLI. Figure 2: Confusion matrix of human validation. Rows: the original update type for which updates were elicited. Columns: the update type that workers categorized them into during the validation step. Cells: percent of assignment into each category. "None" indicates no agreement between the annotators.

Figure 2: Confusion matrix of human validation. Rows: the original update type for which updates were elicited. Columns: the update type that workers categorized them into during the validation step. Cells: percent of assignment into each category. “None” indicates no agreement between the annotators.

Statements Of Social Norms

The SOCIAL-CHEM-101 dataset of social norms (henceforth, Social Norms) compiles a collection of 292K crowdsourced natural language statements about commonsense social judgments made given everyday situations (Forbes et al., 2020) . These statements represent generic commonsense hypotheses about social behaviors and their acceptability that are held as norms in a society. However, such normative judgments can also be strengthened or weakened given appropriate context. For example, a norm like "It is good to respect your parents" might be weakened in certain contexts (e.g., "Your parents are abusive and hurtful towards you") and strengthened in others (e.g., "Your parents want what's right for you"). In other words, we consider this set of norms of social behavior as hypotheses capable of being strengthened or weakened. For our dataset, we randomly extract 10K statements of social norms.

5 Data Collection

Our data collection is performed via crowdsourcing ( §5.1) and consists of two steps: update sentence elicitation ( §5.2) and validation ( §5.3).

5.1 Crowdsourcing

We carry out both the elicitation and validation steps via crowdsourcing in Amazon Mechanical Turk. To ensure the quality of our annotations, we have workers take a paid qualification test to assess their ability to follow instructions and to produce reasonable strengtheners and weakeners. The qualification test contains 6 manually selected premisehypothesis pairs from SNLI that range from easy to difficult hypotheses to defeat. We then manually evaluate their responses for quality and adherence to the guidelines. The 230 workers that provided acceptable updates (both strengtheners and weakeners) to a minimum of four test questions were selected to participate in the data collection tasks. Based on the feedback received from our worker pool, we updated the instructions with further clarifications and examples as necessary. Workers were paid over $15 per hour on average for all annotation tasks.

5.2 Update Sentence Elicitation

To collect update sentences for data sourced from SNLI and ATOMIC, we provide workers with a premise-hypothesis pair as prompt for which they are required to generate two free-text sentences: a strengthener and a weakener that will increase or decrease, respectively, the likelihood of the hypothesis being true. For the collection of updates for the Social Norms data, the workers are given the hypothesis and asked to provide two free-text sentences: a strengthener that supports the socionormative assumption made in the hypothesis ("especially if...") and a weakener that undermines such assumption ("unless..."). Each elicitation HIT is performed by five workers.

In both cases, we provide the workers with the option to specify that a hypothesis cannot be updated. In order to prevent workers from creating incorrect or trivial updates, we require that the update does not contradict the premise, repeat or rephrase any of the premise or hypothesis, or simply negate the hypothesis. 3 We also instruct workers to avoid writing sentences that involve making stereotyped or prejudicial assumptions about people based on their identities (see §8 for additional information).

5.3 Validation

In order to evaluate the validity of human annotations, we ask crowd workers to rate the collected strengtheners and weakeners with respect to the original premise-hypothesis pairs. The rating is on a 5-point Likert scale ranging from "weakens a lot" to "strengthens a lot" with a middle response category of "neutral" for those updates that have no update effect. Each validation HIT is annotated by three workers. The annotations yielded inter-annotator agreement with Krippendorff's α = 0.62, 0.67, 0.69 for SNLI, ATOMIC and Social Norms, respectively (Krippendorff, 1980) . Figure 2 shows the results of the validation step. As evident, workers in the validation step successfully identified the intended update type of elicited updates, indicating the high quality of the elicited updates. In general, strengtheners showed higher agreement than weakeners.

The size of each dataset is given in Table 2 . We assign instances into train, development, and test sets based on their split in the original datasets.

Table 2: Number of unique P-H pairs, strengtheners (S) and weakeners (W) in each section of the δ-NLI dataset.

6 Defeasible Inference Tasks

We formulate two tasks: a discriminative defeasible inference task ( §6.1) and a generative defeasible inference task ( §6.2).

6.1 Defeasible Inference As Classification

We pose a binary classification task for defeasible inference: given a hypothesis H, an optional premise P, and an update U, the goal is to determine the update type, i.e., whether U weakens or strengthens H. That is, given an input tuple (P, H, U), output a label in the set {STRENGTHENER, WEAKENER}.

To establish baseline performance, we finetune the transformer pretrained language model RoBERTa-base , which performs well in classification tasks, with a standard cross entropy loss, using the Transformers library (Wolf et al., 2019) . We concatenate the sentences P, H, and U (separated by a special token) as input to RoBERTa, and select the best training run over five trials, run for two epochs each. Further training details are provided in the appendix. Following the hypothesis-only baseline suggested by Poliak et al. (2018) , we also report the performance of versions of the model with partial inputs, i.e., (∅, H, U) or (∅, ∅, U).

Results. Table 3 displays the classification accuracy on each task. For the models which have access to the full input (P, H, U), accuracy is very close to human performance on each dataset. This suggests that discriminating between strengtheners and weakeners is a comparatively easy task for a strong pretrained language model like RoBERTa. For this reason, we primarily focus on the much more challenging task of generating strengtheners and weakeners, as described in the following subsection.

Table 3: Accuracy (%) on the test set of each classification task.

A partial explanation for the easiness of the classification task is due to annotation artifacts (Gururangan et al., 2018; Poliak et al., 2018; Tsuchiya, 2018) , which are a known limitation of some text datasets directly elicited from human annotators. (See §8 for a full discussion of these limitations.) To check, we train a baseline with access to only the update as input. While this baseline performs 15 to 20 points above the uninformed majority baselines (indicating the presence of annotation artifacts), it is still 13 to 15 points below the fully-informed models.

Interestingly, removing only the premise (but not hypothesis) from the input only slightly decreases overall accuracy. This suggests most of the necessary signal is present in the hypothesis and update. See §7 for further discussion.

6.2 Generative Defeasible Inference

In the generative defeasible task, given a hypothesis H, an optional premise P, and a required update type (weakener or strengthener), the goal is to generate an update U that satisfies this constraint, i.e., weakens or strengthens H.

We report the performance of several strong baselines, namely fine-tuning pre-trained transformer- Note that the perplexities of the encoder-decoder models are not comparable to those of the standard LM models, since the latter compute the loss over the entire sequence, while the former compute the loss only over the output sequence. All models score well below human performance.

based language models (LM). Specifically, we study two types of models: (1) encoder-decoder models such as Bart and T5 Raffel et al., 2019) , which initialize the state of the transformer by reading the input, and learn to generate the output; and (2) standard LMs such as GPT and GPT2 (Radford et al., 2018 (Radford et al., , 2019 , which are trained with the LM objective to predict the next word. We use the Transformers package (Wolf et al., 2019 ) and train each model for a single epoch with a batch size of 64. Further training details are provided in the appendix.

Automatic Evaluation. We follow the common practice of reporting automated generation evaluation metrics. We report the perplexity on the test set, as is often used to measure the performance of a language model. 4 In addition, we generated predictions for the test set using beam search with 5 beams, and evaluated them using standard n-gram based metrics: the precision-oriented BLEU-4 score (Papineni et al., 2002) , which considers n-grams up to n = 4, and the recall-oriented Premise Hypothesis Type Generated Update 1 A man just roaming on the streets during night. A man is roaming the streets at night, drunk.

S The man has a beer in his hand -It is rude to point out their weight problem. W You are a nutritionist GPT2-XL 56.0 8.00 8.00 8.00 4.00 0.00 4.00 0.00 12.0 Bart-large 32.0 24.0 8.00 24.0 0.00 4.00 0.00 0.00 8.00 Overall 44.0 16.0 8.00 16.0 2.00 2.00 2.00 4.00 6.00 Table 6 : Percentage distribution of generated updates over the analysis categories 1 -9 (described in §7), for each combination of task and model. ROUGE-L score (Lin, 2004) , which considers the longest common subsequences. Table 4 presents the automatic evaluation results. We observe that the model preferences are consistent among BLEU and ROUGE. The GPT2-XL models perform best for δ-ATOMIC and the social norms dataset, and only slightly worse than the best model (Bart-large) on δ-SNLI. The model size does not have a major impact on performance, with GPT2-S performing moderately worse than GPT2-XL. The T5 model had the lowest performance across tasks in terms of BLEU and ROUGE.

Table 4: Automatic and human evaluation results on the test set, for the generative models. The input for each task was [premise] p [hypo] h [type] ([hypo] h [type] for Social Norms and for the hypothesis-only models), where [type] ∈ {[strengthener], [weakener]}, and p and h are the premise and hypothesis tokens, respectively. Note that the perplexities of the encoder-decoder models are not comparable to those of the standard LM models, since the latter compute the loss over the entire sequence, while the former compute the loss only over the output sequence. All models score well below human performance.

Table 6: Percentage distribution of generated updates over the analysis categories 1© – 9© (described in §7), for each combination of task and model.

Human Evaluation. Automatic metrics penalize models for lexical variability and often do not correlate with human judgements (Novikova et al., 2017) . Thus, our main evaluation is human evaluation. The goal of the human evaluation is to determine the effectiveness of the models at generating weakeners and strengtheners, focusing on the best model in each category, namely GPT2-XL and Bart-large. We used the same crowdsourcing setup as the validation step in §5.3, and asked workers to rate the generated strengtheners and weakeners on a 5-point Likert scale. Table 4 shows the human evaluation for Bartlarge and GPT2-XL, in terms of accuracy score (e.g. a generated weakener was considered "correct" if the workers judged it as a weakener). As opposed to the automatic evaluation, in which these two models were comparable, here the outputs from GPT2-XL were judged as substantially better than Bart, but even so leaving room for improvement. Across models, strengtheners were not only easily agreed upon ( § 5.3) but also easier to predict than weakeners. In addition, the gap between the accuracy on strengtheners versus weakeners was narrower for GPT2-XL (17%) than for Bart (34%).

When applicable, we also report the performance of a hypothesis-only variant of the best-performing model (GPT2-XL H-only in Table 4 ), for which the input consists of the hypothesis and the update type, excluding the premise. While this baseline performs similarly to the full model in terms of automatic metrics, the human evaluation reveals that the H-only δ-SNLI model substantially underperforms the full model, suggesting that the generative model is making productive use of the premise in δ-SNLI; in the case of δ-ATOMIC, the disparity between the H-only and full models is much smaller.

7 Analysis Of Generated Updates

In order to analyze the quality of the generated updates, we sampled 150 instances from the development set (25 for each combination of task and model), and categorized their top prediction into the following categories, exemplified in Table 5. 1 Good: a strengthener that strengthens the hypothesis or a weakener that weakened the hypothesis. For instance, it is rude to discuss people's weight problems, unless you are their nutritionist, then it is socially acceptable.

Table 5: Examples of generations with update type (W = weakener, S = strengthener), across tasks and models, that fall into each of the nine analysis categories 1© – 9© described in §7.

2 Neutral: the update neither strengthened nor weakened the hypothesis. For example, the fact that the the debt is to the IRS doesn't change our perception about the extent that PersonX wants to become debt free.

3 Weakener instead of strengthener: the generated strengthener weakened the hypothesis.

4 Strengthener instead of weakener: the generated weakener strengthened the hypothesis.

5 Restating the premise: updates that roughly repeated the premise.

6 Restating the hypothesis: updates that roughly repeated the hypothesis.

7 Contradicting the premise: the generated update (implicitly or explicitly) contradicted the premise. For instance, when the premise mentions picking up someone at the airport, but the update talks about driving them there.

8 Premise or hypothesis are nonsensical: stemming from annotation errors in the original datasets. 9 Update is nonsensical or other: updates that are nonsensical. Table 6 displays the percent of categories in each task and model. The results reconfirm the findings from the human evaluation in §6.2, with GPT2-XL leading with good generations with more than half of its generations for δ-SNLI and δ-SOCIAL judged as good. The Bart models suffer from various types of errors.

Dual-purpose updates. In addition, we looked into instances from the development where a single model generated an identical sentence as both a strengthener and weakener (for a given premisehypothesis pair). Ideally, such instances should be rare, as a sentence may increase or decrease the likelihood of a hypothesis, but not both. In practice, we found such overlaps to be a very common failure mode. For a given premise-hypothesis input, we measure the frequency with which each model generates an identical sentence across the top five sampled strengtheners and top five sampled weakeners. The percentage of inputs resulting in such overlaps was extremely high for the Bart models: 96.53%, 97.53%, and 99.48% for δ-ATOMIC, δ-SOCIAL, and δ-SNLI, respectively (among 1900, 979, and 194 instances) . The corresponding rates for the GPT2 models were much lower (although non-negligible): 48.42%, 33.91%, and 33.91%, respectively.

Is the Premise Necessary? In the classification task, we observe that models trained without access to the premise perform nearly as well as those trained with access to the full context (premise, hypothesis, update). This raises the interesting question of what role the premise plays in defeasible natural language inference. It is possible that in many cases, the premise is not as crucial as one might expect. Recall the classic example of defeasible reasoning: "Tweety is a bird" (premise), therefore "Tweety flies" (hypothesis), however "Tweety is a penguin" (update), and thus Tweety does not fly. In this case, it is evident that, while the premise was necessary to originally derive the hypothesis, the update alone is sufficient to conclude the hypothesis no longer holds. 5 In fact, the premise is entailed by the update, and perhaps even discernible from the hypothesis.

However, we should not conclude the premise is unnecessary in all cases. In the generative task, removing the premise makes only a slight difference in performance for δ-ATOMIC (∆1.64%) but a substantial difference for δ-SNLI (∆10.93%) (perhaps due to more specific contexts in SNLI). Because all generative models lag human performance, however, it may simply be a property of current models that they are unable to effectively leverage information from the premise; to match human performance, they may need to leverage this information.

For further analysis, we took outputs from the GPT2-XL H-only model on SNLI and ask human evaluators to assess the outputs under two conditions: (1) annotator observing only the hypothesis, and (2) annotator observing both the premise and hypothesis. In 47.8% of cases, the output was labeled correct in both conditions; 34.1% of cases were labeled incorrect in both conditions. Interestingly, in 12.4% of cases, the output was labeled correct in condition (1) and incorrect in condition (2). This finding points to a proportion of cases where the model would need to integrate information from the premise to generate valid strengtheners and weakeners.

8 Limitations Of Elicitation

To collect the strengthener and weakener sentences in this work, we elicited sentences from crowdsource workers. Elicitation as a method of text data collection has a number of known flaws. In particular, (1) annotators may use label-dependent heuristics or strategies to produce sentences that introduce superficial correlations between text features and labels (Poliak et al., 2018; Gururangan et al., 2018; Tsuchiya, 2018) ; (2) elicitation may result in repeated responses of salient answers that are a small subset of all possible valid answers (McRae et al., 2005) ; and (3) elicited responses may contain implicit judgments or stereotypic associations about gender, race, and age, among others .

To avoid the first issue of annotation artifacts, we focus primarily on the generative task formulation, which is less susceptible to this problem. Regarding the second issue of coverage (or recall), we note that in this work we are concerned with whether it is possible for models to generate any correct weakeners or strengtheners in the first place; evaluating their ability to generate more exhaustively is a challenge we leave for future work. To address the third concern, we explicitly ask annotators to avoid such stereotyped associations in their responses. (See supplement for details.) This is an imperfect but expedient solution and for this reason we caution that the collected data is intended at this stage for scientific purposes only. Furthermore, we note that the elicited strengtheners and weakeners about social norms are subjective and, often, culturally dependent. This data should therefore be understood as descriptive of social norms (and their inherent subjectivity), rather than prescriptive of them.

9 Conclusion And Future Work

To the best of our knowledge, this is the first work to attempt merging long-standing ideas in AI about defeasible reasoning with contemporary formulations of natural language inference and commonsense reasoning tasks. We do this by crowdsourc-ing extensions to three existing inference datasets with enriched contexts that exemplify cases in which an inference is strengthened or weakened. From the collected data, we formulate a classification task and a generation task for defeasible inference in natural language. After demonstrating that the classification task is easily solved by stateof-the-art pretrained language models, we focus instead on the generative task of creating strengtheners or weakeners for a given premise-hypothesis pair, which we liken to "thinking like a skeptic." We demonstrate that fine-tuned language models successfully generate good-quality weakeners and strengtheners in 61-68% of cases.

Machine reasoning about the plausibility of inferences (Wang et al., 2018) , let alone plausibility under different circumstances, is considered an unsolved problem and an obstacle to developing machine commonsense (Davis and Marcus, 2015) . An inference engine with such capabilities may potentially be useful for various applications that require reassessing conclusions under changing conditions, such as processing legal texts (Hage, 2005) and mining arguments (Bilu and Slonim, 2016) . In knowledge base completion, a "closed world" or default assumptions require the ability to defeat such assumptions given the appropriate counter evidence. Such ability was built into the Cyc inference engine (Lenat, 1995) , but was largely absent from modern knowledge bases.

Yet, a number of challenges remain for future work. In our qualitative analysis of generated outputs ( §7), we identify a number of systematic error types that future modeling efforts may seek to address. While this work addresses the quality and accuracy of generated outputs, we leave the more challenging task of evaluating the coverage (recall) of those outputs to future work. Finally, joint modeling between defeasible inference and related reasoning tasks such as abductive reasoning (Peirce, 1960; and counterfactual reasoning (Goodman, 1947; Qin et al., 2019; Tandon et al., 2019 ) is a potentially fruitful line of inquiry. ernment is authorized to reproduce and distribute reprints for governmental purposes. The views and conclusions contained in this publication are those of the authors and should not be interpreted as representing official policies or endorsement.

A Model Hyperparameters

Classification Task All models for the classification task were trained on a single NVIDIA Tesla P100 GPU on a Google Cloud instance. All models were fine-tuned with RoBERTa-base, which has 115M parameters. Best accuracy of five runs on the development set is reported in Table 7 . 72.0 Table 7 : Accuracy (%) on the dev set of each classification baseline.

Table 7: Accuracy (%) on the dev set of each classification baseline.

Generation Task All models were trained on a single NVIDIA Quadro RTX 8000 GPU. Runtime ranged between 5 minutes (GPT2-S on ATOMIC) to 3.5 hours (GPT2-XL on SNLI). The number of parameters ranges from 117M (GPT2-S) to 1.558B (GPT2-XL). Table 8 shows the generative models' performance on the dev set.

Table 8: Automatic evaluation results on the dev set, for the generative models.

B Crowdsourcing Task

Figures 3 and 4 display the full instructions shown to the crowdsourcing workers for the δ-SNLI and δ-ATOMIC update elicitation and for the social norms update elicitation, respectively.

See supplementary material for the complete HIT template.

Micro and macro perplexities were identical.

The question of the importance of the premise is perhaps relevant to another question that arose in earlier studies of defeasible inference, namely the role of human memory, and whether a belief could be defeated with new evidence if the holder of that belief did not recall the reason for it(Pollock, 1987).

Figure 3: HIT Template for update elicitation for the δ-SNLI and δ-ATOMIC data.

Figure 4: HIT Template for update elicitation for the δ-SOCIAL data.