Probing Natural Language Inference Models through Semantic Fragments

Kyle Richardson
Hai Hu
L. Moss
Ashish Sabharwal
AAAI
2020
View in Semantic Scholar

Abstract

Do state-of-the-art models for language understanding already have, or can they easily learn, abilities such as boolean coordination, quantification, conditionals, comparatives, and monotonicity reasoning (i.e., reasoning about word substitutions in sentential contexts)? While such phenomena are involved in natural language inference (NLI) and go beyond basic linguistic understanding, it is unclear the extent to which they are captured in existing NLI benchmarks and effectively learned by models. To investigate this, we propose the use of semantic fragments---systematically generated datasets that each target a different semantic phenomenon---for probing, and efficiently improving, such capabilities of linguistic models. This approach to creating challenge datasets allows direct control over the semantic diversity and complexity of the targeted linguistic phenomena, and results in a more precise characterization of a model's linguistic behavior. Our experiments, using a library of 8 such semantic fragments, reveal two remarkable findings: (a) State-of-the-art models, including BERT, that are pre-trained on existing NLI benchmark datasets perform poorly on these new fragments, even though the phenomena probed here are central to the NLI task. (b) On the other hand, with only a few minutes of additional fine-tuning---with a carefully selected learning rate and a novel variation of "inoculation"---a BERT-based model can master all of these logic and monotonicity fragments while retaining its performance on established NLI benchmarks.

Introduction

Natural language inference (NLI) is the task of detecting inferential relationships between natural language descriptions. For example, given the pair of sentences All dogs chased some cat and All small dogs chased a cat shown in Figure 1 , the goal for an NLI model is to determine that the second sentence, known as the hypothesis sentence, follows from the meaning of the first sentence (the premise sentence). Such a task is known to involve a wide range of reasoning and knowledge phenomena, including knowledge that goes beyond basic linguistic understanding (e.g., elementary logic). As one example of such knowledge, the inference in Figure 1 involves monotonicity reasoning (i.e., reasoning about word substitutions in context); here the position of dogs in the premise occurs in a downward monotone context (marked as ↓), meaning that it can be special-premise:

Figure 1: An illustration of our proposed method for studying NLI model behavior through semantic fragments.

All ↑ dogs ↓ chased ↑ some ↑ cat ↑ hypothesis: (entails) All small dogs chased a cat (Linguistically interesting issue, e.g., monotonicity)

Formal Specification: Fragment with Idealized NLI examples challenge dataset (NLI pairs) 1. Is this fragment learnable using existing NLI architectures? 2. How do pre-trained NLI models perform on this fragment? 3. Can models be fine-tuned/re-trained to master this fragment?

Generate

Empirical Questions Figure 1 : An illustration of our proposed method for studying NLI model behavior through semantic fragments.

ized (i.e., substituted with a more specific concept such as small dogs) to generate an entailment relation. In contrast, substituting dogs for a more generic concept, such as animal, has the effect of generating a NEUTRAL inference.

In an empirical setting, it is desirable to be able to measure the extent to which a given model captures such types of knowledge. We propose to do this using a suite of controlled dataset probes that we call semantic fragments.

While NLI has long been studied in linguistics and logic and has focused on specific types of logical phenomena such as monotonicity inference, attention to these topics has come only recently to empirical NLP. Progress in empirical NLP has accelerated due to the introduction of new large-scale NLI datasets, such as the Stanford Natural Language Inference (SNLI) dataset (Bowman et al. 2015) and MultiNLI (MNLI) (Williams, Nangia, and Bowman 2018) , coupled with new advances in neural modeling and model pre-training (Conneau et al. 2017; Devlin et al. 2018) . With these performance increases has come increased scrutiny of systematic annotation biases in existing datasets (Poliak et al. 2018b; Gururangan et al. 2018) , as well as attempts to build new challenge datasets that focus on particular linguistic phenomena (Glockner, Shwartz, and Goldberg 2018; Naik et al. 2018; Poliak et al. 2018a) . The latter aim to more definitively answer questions such as: are models able to effectively learn and extrapolate complex knowledge and reasoning abilities when trained on benchmark tasks?

To date, studies using challenge datasets have largely been limited by the simple types of inferences that they included (e.g., lexical and negation inferences). They fail to cover more complex reasoning phenomena related to logic, and primarily use adversarially generated corpus data, which sometimes makes it difficult to identify exactly the particular semantic phenomena being tested for. There is also a focus on datasets that are easily able to be constructed and/or verified using crowd-sourcing techniques. Adequately evaluating a model's competence on a given reasoning phenomena, however, often requires datasets that are hard even for humans, but that are nonetheless based on sound formal principles (e.g., reasoning about monotonicity where, in contrast to the simple example in Figure 1 , several nested downward monotone contexts are involved to test the model's capacity for compositionality, cf. Lake and Baroni (2017)).

In contrast to existing work on challenge datasets, we propose using semantic fragments-synthetically generated challenge datasets, of the sort used in linguistics, to study NLI model behavior. Semantic fragments provide the ability to systematically control the semantic complexity of each new challenge dataset by bringing to bear the expert knowledge excapsulated in formal theories of reasoning, making it possible to more precisely identify model performance and competence on a given linguistic phenomenon. While our idea of using fragments is broadly applicable to any linguistic or reasoning phenomena, we look at eight types of fragments that cover several fundamental aspects of reasoning in NLI, namely, monotonicity reasoning using two newly constructed challenge datasets as well as six other fragments that probe into rudimentary logic using new versions of the data from Salvatore, Finger, and Hirata Jr (2019) .

As illustrated in Figure 1 , our proposed method works in the following way: starting with a particular linguistic fragment of interest, we create a formal specification (or a formal rule system with certain guarantees of correctness) of that fragment, with which we then automatically generate a new idealized challenge dataset, and ask the following three empirical questions. 1) Is this particular fragment learnable from scratch using existing NLI architectures (if so, are the resulting models useful)? 2) How well do large state-of-theart pre-trained NLI models (i.e., models trained on all known NLI data such as SNLI/MNLI) do on this task? 3) Can existing models be quickly re-trained or re-purposed to be robust on these fragments (if so, does mastering a given linguistic fragment affect performance on the original task)?

We emphasize the quickly part in the last question; given the multitude of possible fragments and linguistic phenomena that can be formulated and that we expect a widecoverage NLI model to cover, we believe that models should be able to efficiently learn and adapt to new phenomena as they are encountered without having to learn entirely from scratch. In this paper we look specifically at the question: are there particular linguistic fragments (relative to other fragments) that are hard for these pre-trained models to adapt to or that confuse the model on its original task?

On these eight fragments, we find that while existing NLI architectures can effectively learn these particular linguistic pheneomena, pre-trained NLI models do not perform well. This, as in other studies (Glockner, Shwartz, and Goldberg 2018) , reveals weaknesses in the ability of these models to generalize. While most studies into linguistic probing end the story there, we take the additional step to see if attempts to continue the learning and re-fine-tune these models on fragments (using a novel and cheap inoculation strategy) can improve performance. Interestingly, we show that this yields mixed results depending on the particular linguistic phenomena and model being considered. For some fragments (e.g., comparatives), re-training some models comes at the cost of degrading performance on the original tasks, whereas for other phenomena (e.g., monotonicity) the learning is more stable, even across different models. These findings, and our technique of obtaining them, make it possible to identify the degree to which a given linguistic phenomenon stresses a benchmark NLI model, and suggest a new methodology for quickly making models more robust.

Related Work

The use of semantic fragments has a long tradition in logical semantics, starting with the seminal work of Montague (1973) . We follow Pratt-Hartmann (2004) in defining a semantic fragment more precisely as a subset of a language equipped with semantics which translate sentences in a formal system such as first-order logic. In contrast to work on empirical NLI, such linguistic work often emphasizes the complex cases of each phenomena in order measure competence (see Chomsky (1965) for a discussion about competence vs. performance). For our fragments that test basic logic, the target formal system includes basic boolean algebra, quantification, set comparisons and counting (see Figure 2 ), and builds on the datasets from Salvatore, Finger, and Hirata Jr (2019). For our second set of fragments that focus on monotonicity reasoning, the target formal system is based on the monotonicity calculus of van Benthem (1986) (see review by Icard and Moss (2014) ). To construct these datasets, we build on recent work on automatic polarity projection (Hu and Moss 2018; Hu, Chen, and Moss 2019) .

Figure 2: Information about the semantic fragments considered in this paper, where the top four fragments test basic logic (Logic Fragments) and the last fragment covers monotonicity reasoning (Mono. Fragment).

Our work follows other attempts to learn neural models from fragments and small subsets of language, which includes work on syntactic probing (McCoy, Pavlick, and Linzen 2019; Goldberg 2019), probing basic reasoning (Weston et al. 2015; Geiger et al. 2018) and probing other tasks (Lake and Baroni 2017; Chrupała and Alishahi 2019). Geiger et al. (2018) is the closest work to ours. However, they intentionally focus on artificial fragments that deviate from ordinary language, whereas our fragments (despite being automatically constructed and sometimes a bit pedantic) aim to test naturalistic subsets of English. In a similar spirit, there have been other attempts to collect datasets that target different types of inference phenomena (White et al. 2017; Poliak et al. 2018a) , which has been limited in linguistic complexity. Other attempts to study complex phenomena such as monotonicity reasoning in NLI models has been limited to training data augmentation (Yanaka et al. 2019b (Glockner, Shwartz, and Goldberg 2018; Naik et al. 2018) , we focus on the trade-off between mastering a particular linguistic fragment or phenomena independent of other tasks and data (i.e., Question 1 from Figure 1 ), while also maintaining performance on other NLI benchmark tasks (i.e., related to Question 3 in Figure 1 ). To study this, we introduce a novel variation of the inoculation through finetuning methodology of Liu, Schwartz, and Smith (2019) , which emphasizes maximizing the model's aggregate score over multiple tasks (as opposed to only on challenge tasks). Since our new challenge datasets focus narrowly on particular linguistic phenomena, we take this in the direction of seeing more precisely the extent to which a particular linguistic fragment stresses an existing NLI model. In addition to the task-specific NLI models looked at in Liu, Schwartz, and Smith (2019), we inoculate with the state-of-the-art pretrained BERT model, using the fine-tuning approach of Devlin et al. (2018) , which itself is based on the transformer architecture of Vaswani et al. (2017) .

Some Semantic Fragments

As shown in Figure 1 , given a particular semantic fragment or linguistic phenomenon that we want to study, our starting point is a formal specification of that fragment (e.g., in the form of a set of templates/formal grammar that encapsulate expert knowledge of that phenomenon), which we can sample in order to obtain a new challenge set. In this section, we describe the construction of the particular fragments we investigate in this paper, which are illustrated in Figure 2 .

The Logic Fragments The first set of fragments probe into problems involving rudimentary logical reasoning. Using a fixed vocabulary of people and place names, individual fragments cover boolean coordination (boolean reasoning about conjunction and), simple negation, quantification and quantifier scope (quantifier), comparative relations, set counting, and conditional phenomena all related to a small set of traveling and height relations.

These fragments (with the exception of the conditional fragment, which was built specially for this study) were first built using the set of verb-argument templates first described in Salvatore, Finger, and Hirata Jr (2019) . Since their original rules were meant for 2-way NLI classification (i.e., ENTAILMENT and CONTRADICTION), we repurposed their rule sets to handle 3-way classification, and added other inference rules, which resulted in some of the templates shown in Figure 3 . For each fragment, we uniformly generated 3,000 training examples and reserved 1,000 examples for testing. As in Salvatore, Finger, and Hirata Jr (2019), the people and place names for testing are drawn from an entirely disjoint set from training. We also reserve 1,000 for development. While we were capable of generating more data, we follow Weston et al. (2015) in limiting the size of our training sets to 3,000 since our goal is to learn from as little data as possible, and found 3,000 training examples to be sufficient for most fragments and models.

Figure 3: A description of some of the templates used for 4 of the logic fragments (stemming from Salvatore, Finger, and Hirata Jr (2019)) expressed in a quasi-logical notation with predicates p,q,only-did-p and quantifiers ∃(there exists),∀(for all), ι(there exists a unique) and boolean connectives (∧ (and),→ (if-then), ¬ (not)).

As detailed in Figure 2 , these new fragments vary in complexity, with the negation fragment being the least complex in terms of linguistic phenomena. We also note that all other fragments include basic negation and boolean operators, which we found to help preserve the naturalness of the examples in each fragment. As shown in last column of Figure 2 , some of our fragments (notably, negation and comparatives) have, on average, sentence lengths that exceed that of benchmark datasets. This is largely due to the productive nature of some of our rules. For example, the comparatives rule set allow us to create arbitrarily long sentences by generating long lists of people that are being compared (e.g., In John is taller than .., we can list up to 15 people in the subsequent list of people).

Whenever creating synthetic data, it is important to ensure that one is not introducing into the rule sets particular annotation artifacts that make the resulting challenge datasets trivially learnable. As shown in the top part of Table 1, which we discuss later, we found that several strong baselines failed to solve our fragments, showing that the fragments, despite their simplicity and constrained nature, are indeed not trivial to solve.

Table 1: Baseline models and model performance (accuracy %) on NLI benchmarks and challenge test sets (before and after re-training), including the Breaking NLI challenge set from Glockner, Shwartz, and Goldberg (2018). The arrows ↓ in the last section show the average drop in accuracy on MNLI benchmark after re-training with the fragments.

The Monotonicity Fragments The second set of fragments cover monotonicity reasoning, as first discussed in the Figure 3 : A description of some of the templates used for 4 of the logic fragments (stemming from Salvatore, Finger, and Hirata Jr (2019)) expressed in a quasi-logical notation with predicates p,q,only-did-p and quantifiers ∃(there exists), ∀(for all), ι(there exists a unique) and boolean connectives (∧ (and), → (if-then), ¬ (not)).

introduction. This fragment can be described using a regular grammar with polarity facts according to the monotonicity calculus, such as the following: every is downward monotone/entailing in its first argument but upward monotone/entailing in the second, denoted by the ↓ and ↑ arrows in the example sentence every ↑ small ↓ dog ↓ ran ↑ . We have manually encoded monotonicity information for 14 types of quantifiers (every, some, no, most, at least 5, at most 4, etc.) and negators (not, without) and generated sentences using a simple regular grammar and a small lexicon of about 100 words (see Appendix for details). We then use the system described by Hu and Moss (2018) 1 to automatically assign arrows to every token (see Figure 4 , note that = means that the inference is neither monotonically up or down in general). Because we manually encoded the monotonicity information of each token in the lexicon and built sentences via a controlled set of grammar rules, the resulting arrows assigned by Hu and Moss (2018) can be proved to be correct. Once we have the sentences with arrows, we use the algorithm of Hu, Chen, and Moss (2019) to generate pairs of sentences with ENTAIL, NEUTRAL or CONTRADICTORY relations, as exemplified in Figure 4 . Specifically, we first define a knowledge base that stores the relations of the lexical items in our lexicon, e.g., poodle ≤ dog ≤ mammal ≤ animal; also, waltz ≤ dance ≤ move; and every ≤ most ≤ some = a. For nouns, ≤ can be understood as the subset-superset relation. For higher-type objects like the determiners above, see Icard and Moss (2013) for discussion. Then to generate entailments, we perform substitution. That is, we substitute upward entailing tokens or constituents with something "greater than or equal to" (≥) them, or downward entailing ones with something "less than or equal to" them. To generate neutrals, substitution goes the reverse way. For example, all ↑ dogs ↓ danced ↑ ENTAIL all poodles danced, while all ↑ dogs ↓ danced ↑ NEUTRAL all mammals danced. This is due to the facts which we have seen: poodle ≤ dog ≤ mammal. Simple rules such as "replace some/many/every in subjects by no" or "negate the main verb" are applied to generate 1 https://github.com/huhailinguist/ccg2mono premise: All ↑ black ↓ mammals ↓ saw ↑ exactly = 5 = stallions = who = danced = All ↑ black ↓ dogs ↓ saw ↑ exactly = 5 = stallions = who = danced = All ↑ black ↓ doodles ↓ saw ↑ exactly = 5 = stallions = who = danced = Some ↑ black ↑ mammal ↑ saw ↑ exactly = 5 = stallions = who = danced = Some ↑ mammal ↑ saw ↑ exactly = 5 = stallions = who = danced = Figure 4 : Generating ENTAILMENT for monotonicity fragments starting from the premise (top). Each node in the tree shows an entailment generated by one substitution. Substitutions are based on a hand-coded knowledge base with information such as: all ≤ some/a, poodle ≤ dog ≤ mammal, and black mammal ≤ mammal. CONTRADICTION examples are generated for each inference using simple rules such as "replace some/many/every in subjects by no". NEUTRALs are generated in a reverse manner as the entailments.

Figure 4: Generating ENTAILMENT for monotonicity fragments starting from the premise (top). Each node in the tree shows an entailment generated by one substitution. Substitutions are based on a hand-coded knowledge base with information such as: all ≤ some/a, poodle ≤ dog ≤ mammal, and black mammal ≤ mammal. CONTRADICTION examples are generated for each inference using simple rules such as “replace some/many/every in subjects by no”. NEUTRALs are generated in a reverse manner as the entailments.

contradictions. Using this basic machinery, we generated two separate challenge datasets, one with limited complexity (e.g., each example is limited to 1 relative clause and uses an inventory of 5 quantifiers), which we refer to throughout as monotonicity (simple), and one with more overall quantifiers and substitutions, or monotonicity (hard) (up to 3 relative clauses and a larger inventory of 14 unique quantifiers). Both are defined over the same set of lexical items (see Figure 2 ).

Experimental Setup And Methodology

To address the questions in Figure 1 , we experiment with two task-specific NLI models from the literature, the ESIM model of Chen et al. (2017) and the decomposable-attention (Decomp-Attn) model of Parikh et al. (2016) as implemented in the AllenNLP toolkit (Gardner et al. 2018) , and the pre-trained BERT architecture of Devlin et al. (2018). 2 When evaluating whether fragments can be learned from scratch (Question 1), we simply train models on these fragments directly using standard training protocols. To evaluate pre-trained NLI models on individual fragments (Question 2), we train BERT models on combinations of the SNLI and MNLI datasets from GLUE (Wang et al. 2018) , and use pretrained ESIM and Decomp-Attn models trained on MNLI following Liu, Schwartz, and Smith (2019) .

To evaluate whether a pre-trained NLI model can be retrained to improve on a fragment (Question 3), we employ the recent inoculation by fine-tuning method (Liu, Schwartz, and Smith 2019) . The idea is to re-fine-tune (i.e., continue training) the models above using k pieces of fragment training data, where k ranges from 50 to 3,000 (i.e., a very small subset of the fragment dataset to the full training set; see horizontal axes in Figures 5, 6, and 7) . The intuition is that by doing this, we see the extent to which this additional data makes the model more robust to handle each fragment, or stresses it, resulting in performance loss on its original benchmark. In contrast to re-training models from scratch with the original data augmented with our fragment data, fine-tuning on only the new data is substantially faster, requiring in many cases only a few minutes. This is consistent with our requirement discussed previously that training existing models to be robust on new fragments should be quick, given the multitude of fragments that we expect to encounter over time. For example, in coming up with new linguistic fragments, we might find newer fragments that are not represented in the model; it would be prohibitive to re-train the model each time entirely from scratch with its original data (e.g., the 900k examples in SNLI+MNLI) augmented with the new fragment.

Our approach to inoculation differs from Liu, Schwartz, and Smith (2019) in explicitly optimizing the aggregate score of each model on both its original and new task. More formally, let k denote the number of examples of fragment data used for fine-tuning. Ideally, we would like to be able to fine-tune each pre-trained NLI model architecture a (e.g., BERT) to learn a new fragment perfectly with a minimal k, while-importantly-not losing performance on the original task that the model was trained for (e.g., SNLI or MNLI). Given that fine-tuning is sensitive to hyper-parameters, 3 we use the following methodology: For each k we fine-tune J variations of a model architecture, denoted M a,k j for j ∈ {1, . . . , J}, each characterized by a different set of hyperparameters. We then identify a model M a,k * with the best aggregated performance based on its score S frag M a,k j on the fragment dataset and S orig M a,k j on the original dataset. For simplicity, we use the average of these two scores as the 2 We use the BERT-base uncased model in all experiments, as implemented in HuggingFace: https://github.com/huggingface/ pytorch-pretrained-BERT. 3 We found all models to be sensitive to learning rate, and performed comprehensive hyper-parameters searches to consider different learning rates, # iterations and (for BERT) random seeds. aggregated score. 4 Thus, we have:

M a,k * = argmax M ∈{M a,k 1 ,...,M a,k J } AVG S frag (M ), S orig (M )

By keeping the hyper-parameter space consistent among all fragments, the point is to observe how certain fragments behave relative to one another.

Additional Baselines To ensure that the challenge datasets that are generated from our fragments are not trivially solvable and subject to annotation artifacts, we implemented variants of the Hypothesis-Only baselines from Poliak et al. (2018b) , as shown at the top of Table 1 . This involves training a single-layered biLSTM encoder for the hypothesis side of the input, which generates a representation for the input using max-pooling over the hidden states, as originally done in Conneau et al. (2017) . We used the same model to train a Premise-Only model that instead uses the premise text, as well as an encoder that looks at both the premise and hypothesis (Premise+Hyp.) separated by an artificial token (for more baselines, see Salvatore, Finger, and Hirata Jr (2019) ).

Results And Findings

We discuss the different questions posed in Figure 1 .

Answering Questions 1 and 2. Table 1 shows the performance of baseline models and pre-trained NLI models on our different fragments. In all cases, the baseline models did poorly on our datasets, showing the inherent difficulty of our challenge sets. In the second case, we see clearly that stateof-the-art models do not perform well on our fragments, consistent with findings on other challenge datasets. One result to note is the high accuracy of BERT-based pre-trained models on the Breaking NLI challenge set of Glockner, Shwartz, and Goldberg (2018) , which previously proved to be a difficult benchmark for NLI models. This result, we believe, highlights the need for more challenging NLI benchmarks, such as our new datasets. Figure 5 shows the results of training NLI models from scratch (i.e., without NLI pre-training) on the different fragments. In nearly all cases, it is possible to train a model to master a fragment (with counting being the hardest fragment to learn). In other studies on learning fragments (Geiger et al. 2018) , this is the main result reported, however, we also show that the resulting models perform below random chance on benchmark tasks, meaning that these models are not by themselves very useful for general NLI. This even holds for results on the GLUE diagnostic test (Wang et al. 2018) , which was hand-created and designed to model many of the logical phenomena captured in our fragments.

Figure 5: Dev. results on training NLI models from scratch on the different fragments and architectures.

We note that in the monotonicity examples, we included results on a development set (in dashed green) that was built Table 1 : Baseline models and model performance (accuracy %) on NLI benchmarks and challenge test sets (before and after re-training), including the Breaking NLI challenge set from Glockner, Shwartz, and Goldberg (2018) . The arrows ↓ in the last section show the average drop in accuracy on MNLI benchmark after re-training with the fragments. by systematically paraphrasing all the nouns and verbs in the fragment to be disjoint from training. Even in this case, when lexical variation is introduced, the BERT model is robust (see Rozen et al. (2019) for a more systematic study of this using BERT for NLI in different settings).

Answering Question 3. Figures 6 and 7 show the results of the re-training study. They compare the performance of a retrained model on the challenge tasks (dashed lines) as well as on its original benchmark tasks (solid lines) 5 . We discuss here results from the two illustrative fragments depicted in Figure 6 . All 4 models can master Monotonicity Reasoning while retaining accuracy on their original benchmarks. However, non-BERT models lose substantial accuracy on their original benchmark when trying to learn comparatives, suggesting that comparatives are generally harder for models to learn. In Figure 7 , we show the results for all other fragments, which show varied, though largely stable, trends depending on the particular linguistic phenomena. At the bottom of Table 1 , we show the resulting accuracies on the challenge sets and MNLI bechmark for each model after re-training (using the optimal model M a,k * , as described previously). In the case of BERT SNLI+MNLI+frag , we see that despite performing poorly on these new challenge dataset before re-training, it can learn to master these fragments with minimal losses to performance on its original task (i.e., it only loses on average about 1.3% accuracy of the original MNLI dev set). In other words, it is possible teach BERT (given it's inherent capacity) a new fragment quickly through re-training without affecting its original performance, assuming however that time is spent on carefully finding the optimal model. 6 For the other models, there is more of a trade-off; Decomp-Attn on average never quite masters the logic fragments (but does master the Monotonicity Fragments), and incurs an average 6.7% loss on MNLI after re-training. In the case of comparatives, the inability of the model to master this fragment likely reveals a certain architectural limitation of the model given that it is not sensitive to word-order. Given such losses, perhaps in such cases a more sophisticated re-training scheme is needed in order to optimally learn particular fragments.

Figure 6: Inoculation results for two illustrative semantic fragments, Monotonicity Reasoning (left) and Comparatives (right), for 4 NLI models shown in different colors. Horizontal axis: number of fine-tuning challenge set examples used. Each point represents the model M∗k trained using hyperparameters that maximize the accuracy averaged across the model’s original benchmark dataset (solid line) and challenge dataset (dashed line).

Figure 7: Inoculation results for 6 semantic fragments not included in Figure 6, using the same setup.

Discussion And Conclusion

In this work, we explored the use of semantic fragmentssystematically controlled subsets of language, to probe into NLI models and benchmarks, and investigated 8 fragments and new challenge datasets that center around basic logic and monotonicity reasoning. In answering the questions first introduced in Figure 1 , we found that while existing NLI architectures are able to learn these fragments from scratch, the resulting models are of limited interest, and that pretrained models perform poorly on these new datasets (even relative to other available challenge benchmarks) showing the weaknesses of these models. Interestingly, however, we show that many models can be quickly re-tuned (e.g., often in a matter of minutes) to master these different fragments using a variant of the inoculation through fine-tuning strategy from Liu, Schwartz, and Smith (2019) .

Our results indicate the following methodology for improving models: given a particular linguistic hole in an NLI model, one can plug this hole by simply generating synthetic data and using this data to re-train a model to cover the target phenomenon. This methodology comes with some caveats, however: depending on the model and particular linguistic phenomena, there may be some trade-offs with the model's original performance, which should first be looked at empirically and compared against other linguistic phenomena.

Can we find more difficult fragments? Despite differences across different fragments, we largely found NLI models to be robust when tackling new linguistic phenomena and easy to quickly re-purpose (especially with BERT). This generally positive result begs the question: are there more challenging fragments and linguistic phenomena that we should be studying?

Given the ubiquity of logical and monotonicity reasoning, we feel that there is justification for our particular fragments, and take it as a positive sign that models are able to solve these tasks. As we emphasize throughout, our general ap-proach is amenable to any linguistic phenomena, and future work will focus on developing more complicated fragments. brown ≤ brown or black, black ≤ brown or black iron ≤ metal, steel ≤ metal, steel ≤ iron, hardwood ≤ wooden nouns x ≤ animal for x in {dog, cat, rabbit, mammal, poodle, beagle, bulldog, bat, horse, stallion, badger} x ≤ mammal for x in {dog, cat, rabbit, poodle, beagle, bulldog, bat, horse, stallion, badger} x ≤ dog for x in {poodle, beagle, bulldog} x ≤ object for x in Ninanimate verbs x ≤ moved for x in {ran, swam, waltzed, danced} stare at ≤ saw, hit ≤ touch, waltzed ≤ danced determiners every = all = each ≤ most ≤ some = a, many ≤ several ≤ at least 3 ≤ at least 2 ≤ some = a, no ≤ at most 1 ≤ at most 2 ≤

• • • Other Rules

Adj N ≤ N, N + (SRC | ORC) ≤ N, . . . antonyms moved-towards ⊥ moved-away-from, x ⊥ slept for x in {ran, swam, waltzed, danced, moved} at most 4 ⊥ at least 5, exactly 4 ⊥ exactly 5, every ⊥ some but not all, . . . Figure 8 : A specification of the grammar and lexicon used to generate the monotonicity fragments

Figure 8: A specification of the grammar and lexicon used to generate the monotonicity fragments

Other ways of aggregating the two scores can be substituted. E.g., one could maximize Sfrag M a,k j while requiring that Sorig M a,k j is not much worse relative to when the model's hyperparameters are optimized directly for the original dataset.

For MNLI, we report results on the mismatched dev. set.6 We note that models without optimal aggregate performance are often prone to catastrophic forgetting.