Critical Thinking for Language Models

Gregor Betz
Kyle Richardson
ArXiv
2020
View in Semantic Scholar

Abstract

This paper takes a first step towards a critical thinking curriculum for neural auto-regressive language models. We introduce a synthetic text corpus of deductively valid arguments, and use this artificial argument corpus to train and evaluate GPT-2. Significant transfer learning effects can be observed: Training a model on a few simple core schemes allows it to accurately complete conclusions of different, and more complex types of arguments, too. The language models seem to connect and generalize the core argument schemes in a correct way. Moreover, we obtain consistent and promising results for the GLUE and SNLI benchmarks. The findings suggest that there might exist a representative sample of paradigmatic instances of good reasoning that will suffice to acquire general reasoning skills and that might form the core of a critical thinking curriculum for language models.

1 Introduction

Pre-trained autoregressive language models (LM) such as GPT-2 and GPT-3 achieve, remarkably, competitive results in a variety of language modeling benchmarks without task-specific fine-tuning (Radford et al., 2019 ). Yet, it is also widely acknowledged that these models struggle with reasoning tasks, such as natural language inference (NLI) or textual entailment (Askell, 2020) . Actually, that doesn't come as a surprise, given the tendency of humans to commit errors in reasoning (Kahneman, 2011; Sunstein and Hastie, 2015) , their limited critical thinking skills (Paglieri, 2017) , the resulting omnipresence of fallacies and biases in texts and the frequently low argumentative quality of online debates (Hansson, 2004; Guiaşu and Tindale, 2018; Cheng et al., 2017) . Neural language models are known to pick up and reproduce normative biases (e.g., regarding gender or race) present in the dataset they are trained on (Gilburt, 2019) ; no wonder this happens with argumentative biases and reasoning flaws, too. This diagnosis suggests that there is an obvious remedy for LMs' poor reasoning capability: make sure that the training corpus contains a sufficient amount of exemplary episodes of sound reasoning.

In this paper, we take a first step towards the creation of a critical thinking curriculum for neural language models. First, we design and build a synthetic corpus of deductively valid arguments which instantiate a variety of (syllogistic) argument schemes, and which are rendered as text paragraphs (Section 3). Second, we use our synthetic argument text corpus to train and to evaluate GPT-2 (Section 4). Evaluating the models' ability to correctly complete conclusions of arguments, we observe strong transfer learning effects/generalization: Just training the models on a few central core schemes allows them to accurately complete conclusions of different types of arguments, too. The language models seem to connect and generalize the core argument schemes in a correct way. Moreover, tests with a simple hand-crafted argument produce evidence that generic language modeling skill facilitates the successful generalization of learned argument patterns. These findings suggest that there might exist a representative sample of paradigmatic instances of good reasoning that will suffice to acquire general reasoning skills. Such a collection might form the core of a critical thinking curriculum for language models.

Moreover, we test the trained models on different reasoning benchmarks. We obtain consistent and promising results for the GLUE and SNLI benchmarks, finding that training on core schemes seems to slightly improve NLI skill. However, training on the argument corpus doesn't affect the performance with regard to the semantically more demanding Argument Reasoning Comprehension task or the critical thinking assessment compiled in LogiQA. In the final section, we advance some ideas for complementing the artificial argument corpus so as to further improve the performance of LMs with regard to these reasoning benchmarks.

A philosophical caveat: Throughout this paper, we conceive of reasoning as a linguistic practice that is governed by specific (epistemological) norms. Thus understood, a system possess reasoning skill if it is able to successfully engage in this practice, and argumentation doesn't necessarily require mental activity.

2 Related Work

To our knowledge, this paper is the first to show that autoregressive language models like GPT-2 can learn to reason by training on a text corpus of correct natural language arguments. Previous work in this field, in contrast, has modeled natural language reasoning problems as classification tasks and trained neural systems to accomplish them.

Rule-based natural language reasoning Various datasets have been developed for rule-based deductive reasoning in natural language. In these tasks, one or multiple rules, i.e. (generalized) conditionals, must be applied to a fact base in order to deductively infer a conclusion. Facts and conclusions are represented by atomic statements. Rule application closely resembles the conclusion completion task for generalized modus ponens and generalized modus tollens schemes described below. However, we go beyond previous work in investigating the ability of language models to infer conclusions that have a more complex logicosemantic structure (e.g., existential or universal statements).

• The question answering bAbI dataset (Weston et al., 2016) contains a task which involves applying very specific rules of the form âȂIJXs are afraid of YsâȂİ to an instance (for example: "Mice are afraid of cats. Jerry is a mouse. What is Jerry afraid of? A:cats"). Equally simple, one-step rule applications are tested in , and also contained in the QuaRTz dataset .

• ROPES ) is a reading comprehension task that involves applying back-ground knowledge to a given situation (both being presented as paragraph long text). Correct answers can be inferred by one-step rule application; part of the challenge is to identify the relevant rule and fact in the text.

• RuleTaker, arguably the most general system for natural rule-based reasoning so far, is a transformer model that has been fine-tuned to predict whether a conclusion can be inferred from a set of rules and facts, not all of which are necessarily required to draw the conclusion (Clark et al., 2020) . Moreover, inferring the conclusion from the premise set might involve multiple inference steps. The authors show that the transformer model can be trained to perform this task nearly flawlessly and, moreover, to explain its inferences. They also observe substantial transfer learning effects.

Benchmarks for enthymematic reasoning An 'enthymeme' is an argument whose premises are not explicitly stated, e.g.: "Jerry is a mouse. Therefore, Jerry is afraid of cats." The Argument Reasoning Comprehension (ARC) dataset (Habernal et al., 2018) comprises simple informal arguments.

Each argument contains two premises: whereas the first premise is explicitly stated, there are two alternative formulations of the second premise. The task consists in identifying which of these two alternative formulations is actually assumed in the argument. For example: "Miss America gives honors and education scholarships. And since [scholarships would give women a chance to study | scholarships would take women from the home], Miss America is good for women." ARC therefore assesses the ability to make implicit premises explicit. An adversarial ARC dataset that eliminates clues in the original benchmark is available (Niven and Kao, 2019 ).

CLUTRR is a task generator for relational reasoning on kinship graphs (Sinha et al., 2019) . CLUTTR takes a set of (conceptual) rules about family relations as given and constructs settheoretic possible worlds (represented as graphs) which instantiate these rules. In such a possible (kinship) world, a target fact and a set of base facts are identified such that the base facts together with the rules deductively entail the target fact. The task consists in inferring the target fact from the base facts alone -the conceptual rules remain implicit. For example: "Kristin and her son Justin went to visit her mother Carol on a nice Sunday afternoon. They went out for a movie together and had a good time. Q: How is Carol related to Justin? A: Carol is the grandmother of Justin." So, CLUTRR assesses entyhmematic deductive reasoning with implicit conceptual rules.

Critical thinking tasks LogiQA (Liu et al., 2020 ) is a collection of publicly available critical thinking questions, used by the National Civil Servants Examination of China to assess candi-datesâȂŹ critical thinking and problem solving skills. LogiQA covers tasks of various types: different kinds of natural language inference problems as well as the identification of implicit premises or (practical) instrumental reasoning. The tasks are shown to be hard for current AI systems, of which a fine-tuned transformer model performs best with an accuracy score of 35% -50 percentage points below human performance.

3 An Artificial Argument Corpus

This section describes the construction of a synthetic corpus of natural language arguments used for evaluating and training The corpus is built around eight simple, deductively valid syllogistic argument schemes (top row in Figure 1 ). These base schemes have been chosen because of their logical simplicity as well as their relevance in critical thinking and argument analysis (Feldman, 2014; Bowell and Kemp, 2014; Brun and Betz, 2016) . Each of these eight base schemes is manually varied in specific ways to create further valid variants.

Figure 1: Syllogistic argument schemes used to create an artificial argument corpus.

Negation variants of base schemes (second row in Figure 1 ) are created by substituting a subformula with its negation and/or by applying duplex negatio affirmat.

Complex predicates variants (third row in Figure 1) build on base schemes or their respective negation variants and are obtained by substituting atomic predicates with compound disjunctive or conjunctive ones.

De Morgan variants of base schemes (fourth row in Figure 1 ) are finally derived by applying de Morgan's law to the respective variants created before.

All in all, we thus obtain 71 different handcrafted argument schemes. Obviously, some of these schemes can be derived from others. For example, generalized modus ponens and generalized contraposition (base schemes) entail a negation variant of generalized modus tollens. Likewise, generalized contraposition and hypothetical syllogism 1 entail a (negation variant of) hypothetical syllogism 2.

In view of these interdependencies, three of the eight base schemes are marked as core schemes: generalized modus ponens, generalized contraposition, hypothetical syllogism 1.

Natural language instances of the argument schemes can be created by means of a first-orderlogic domain (with names and predicates) and natural language templates for the formal schemes. In order to obtain a large variety of realistic natural language arguments, we have devised

• a multi-stage templating process with

• alternative templates at each stage and

• multiple domains.

As shown in Figure 2 , this process can be split into five consecutive steps.

Figure 2: Pipeline for creating natural language instances of argument schemes with multiple templating.

In step 1, the argument scheme, which serves as formal template for the natural language argument, is chosen.

In step 2, each sentence in the formal scheme (premises and conclusion) is individually replaced by a natural language pattern in accordance with a randomly chosen template. For example, the formula "∀xF x → Gx" might be replaced by any of the following natural language sentence schemes:

• "Every F is a G."

• "Whoever is a F is also a G."

• "Being a G is necessary for being a F."

In step 3, the entity-and property-placeholders in the resulting argument scheme are replaced argument-wise with names and predicates from a domain. (Each domain provides hundreds of entity-names, which can be paired with different binary predicates to create thousands of different unary predicates.) We hence obtain an instance of the formal argument scheme as premiseconclusion list.

In step 4, the premises of the natural language argument are randomly re-ordered. In step 5, the premise-conclusion list is packed into a text paragraph by adding an argument intro, framing the premises, and adding an inference indicator. Again, multiple templates are available for doing so, which yields a large variety of textual renderings of an argument.

--⇩-- Ga ∀x Fx→¬Gx Fa --⇩-- ¬Ga ∀x Fx∧Hx→Gx Fa Ha --⇩-- Ga ∀x Fx→¬Gx --⇩-- ∀x Gx→¬Fx ∀x Fx→Gx --⇩-- ∀x ¬Gx→¬Fx ∀x (Fx∧Hx)→¬Gx --⇩-- ∀x Gx→¬(Fx∧Hx) ∀x Fx→Gx ∀x Gx→Hx --⇩-- ∀x Fx→Hx ∀x Fx→¬Gx ∀x ¬Gx→Hx --⇩-- ∀x Fx→Hx ∀x Fx→Gx ∀x Fx→Ix ∀x Gx∧Ix→Hx --⇩-- ∀x Fx→Hx ∀x Fx→Gx ∀x ¬Hx→¬Gx --⇩-- ∀x Fx→Hx ∀x Fx→¬Gx ∀x ¬Hx→Gx --⇩-- ∀x Fx→Hx ∀x Fx→¬(Gx∨Ix) ∀x Hx→¬(Gx∨Ix) --⇩-- ∀x Fx→Hx ∀x Fx→Gx ∃x Hx∧¬Gx --⇩-- ∃x Hx∧¬Fx ∀x ¬Fx→Gx ∃x Hx∧¬Gx --⇩-- ∃x Hx∧Fx ∀x Fx→Gx ∀x Fx→Ix ∃x Hx∧¬(Gx∧Ix) --⇩-- ∃x Hx∧¬Fx ∀x Fx→Gx∨Hx ∀x Fx→¬Gx --⇩-- ∀x Fx→Hx ∀x Fx→Gx∨Hx ∀x Gx→¬Fx --⇩-- ∀x Fx→Hx ∀x Fx→Gx∨Hx∨Ix ∀x Fx→¬Gx ∀x Fx→¬Ix --⇩-- ∀x Fx→Hx ∀x Fx→Gx∨Hx ∀x Gx→Jx ∀x Hx→Jx --⇩-- ∀x Fx→Jx ∀x Fx→Gx∨Hx ∀x Jx→¬Gx ∀x Jx→¬Hx --⇩-- ∀x Fx→¬Jx ∀x Fx→Gx∨Hx∨Ix ∀x Gx→Jx ∀x Hx→Jx --⇩-- ∀x Fx→Jx∨Ix ∀x Fx→Gx ¬Ga --⇩-- ¬Fa ∀x Fx→¬Gx Ga --⇩-- ¬Fa ∀x Fx→Gx∧Hx ¬Ga --⇩-- ¬Fa n=2 n=3 n=3 n=3 n=3 n=2 n=3 n=3 n=3 n=2 n=3 n=3 n=3 n=2 n=3 n=3 de_morgan ∀x (¬Fx∧¬Ix)→Gx ∀x Gx → Hx --⇩-- ∀x ¬(Fx ∨ Ix)→Hx ∀x ¬(Fx∨Hx)→Gx ¬Fa ¬Ha --⇩-- Ga ∀x (Fx∧Hx)→¬Gx --⇩-- ∀x Gx→¬Fx∨¬Hx ∀x Fx→¬(Gx∨Ix) ∀x Hx→¬Gx∧¬Ix --⇩-- ∀x Fx→Hx ∀x Fx→Gx ∀x Fx→Ix ∃x Hx∧(¬Gx∨¬Ix) --⇩-- ∃x Hx∧¬Fx ∀x Fx∧Ix→Gx∨Hx ∀x Gx→¬Fx∨¬Ix --⇩-- ∀x Fx∧Ix→Hx ∀x Fx→¬(Gx∧Hx) ∀x ¬Gx→Jx ∀x ¬Hx→Jx --⇩--

Following this pipeline, we create 10,000 natural language instances of each formal argument scheme, split into 9,000 train and 1,000 test items. This represents the artificial argument text corpus we use to train and evaluate GPT-2.

4 Experiments With Gpt-2

We train and evaluate three compact versions of GPT-2 with 117M, 345M and 762M parameters respectively, all of which fall short of the fullscale model with 1542M parameters (Wolf et al., 2019 ). 2

4.1 Training

The Artificial Argument Corpus comprises 71 * 9,000 training items that are grouped into three training sets as follows (see also the color pattern in Figure 1 ): 2 The fine-tuned models will be relased through https: //huggingface.co/models.

• TRAIN01: all training items which are instances of a core scheme, i.e. generalized modus ponens, generalized contraposition, hypothetical syllogism 1 (N=27,000)

• TRAIN02: all training items which are instances of a base scheme (N=72,000)

• TRAIN03: all training items in the corpus (N=639,000)

In an attempt to avoid over-fitting, we blend the training arguments with snippets from Reuters news stories (Lewis et al., 2004) and the standardized Project Gutenberg Corpus (Gerlach and Font-Clos, 2018) , trying a mixing ratio of 1:1. The different versions of GPT-2 are fine-tuned on each of the three enhanced training sets. This gives us nine fine-tuned model versions plus the three BASE models to evaluate.

4.2 Results

Wiki103 Does fine-tuning on the (enhanced) argument corpus affect general language modeling skill? We address this question upfront and report the perplexity of our models on the Wiki103 dataset (split into sequences of 128 tokens), first. Figure 3 shows that training on the argument corpus drastically increases the Wiki103 perplexity.

Step 1: choose formal argument scheme artificial argument corpus config file topic-neutral formal argument schemes topic-neutral NL templates for formal sentence schemes

Step 2: choose & substitute NL schemes sentence-wise

Step 3: construct & substitute domain-specific predicates and names domain-specific NL names and binary predicates

Step 5: construct & apply argument template argument-, premise-, and inferenceindicators

Step 4:

permutate premises randomly ∀x Fx→¬Gx Ga --⇩-- ¬Fa No F is a G. a is a G. --⇩--

It is false that a is a F. Deterioration of overall language modeling skill is most severe for the largest training set TRAIN03 and the large model version. The increase in perplexity caused by fine-tuning on the core schemes is, in comparison, modest, though still significant. This suggests that, in future work, a higher proportion of common texts should be mixed into the training data.

Conclusion Completion On Artificial Argument

Corpus To test whether language models can reason correctly, we assess their ability to accurately complete conclusions of arguments in the artificial argument corpus. Here, we make use of the fact that, by construction, the conclusion of every argument in the corpus ends with a predicate (a property-term such as "sister of Chloe" or "supporter of Tottenham Hotspurs"). Two examples:

It is not always easy to see who is related to whom -and in which ways. The following argument pertains to this question: So, more specifically, our evaluation proceeds as follows: we query the models with the argument texts except the final predicate, we let them generate a conditional sample, and we calculate the accuracy of thus generated completions. Figure 4 reports the evaluation results in detailed and differentiated ways. Its subplots are arranged in a grid that mirrors the organisation of argument schemes in Figure 1 . Each subplot visualizes the ability of the models to correctly complete arguments of the corresponding scheme. Moreover, each subplot compares the BASE models (points at the very left) with the fine-tuned models trained on TRAIN01, TRAIN02, and TRAIN03 (in this order from left to right, see also Figure 5 ). Finally, the color code in each subplot indicates whether the corresponding scheme belongs to the respective training set. In other words, the accuracy of models that have not been fine-tuned on the particular scheme are plotted in green, the accuracy of those that have been fine-tuned on the scheme in blue.

Figure 4: Accuracy of conclusion completions for instances of different argument schemes (see Figure 1) and four model versions. See also the visual legend in Figure 5.

Figure 5: Detailed legend for Figure 4. In the illustrative accuracy plot at the right-hand side, the BASE models as well as those models fine-tuned on TRAIN01 are plotted in green, since the base scheme of disjunctive syllogism doesn’t belong to TRAIN01. However, some arguments in TRAIN02 and TRAIN03 instantiate the base scheme of disjunctive syllogism; thus the blue markers.

We may observe, first of all, that even the BASE models (only pre-training, no fine-tuning) display a remarkable ability to correctly complete conclusions of some kinds of arguments. For example, GPT-2-762M achieves 50% accuracy in completing contrapositions, 30% accuracy in completing generalized modus ponens, and still 20% accuracy in completing disjunctive syllogism and dilemma arguments. Moreover, the large BASE model is more skilful than its smaller versions. These findings further corroborate the hypothesis that NLMs learn (basic) linguistic and reasoning skills "on the fly" by training on a large generic corpus (Radford et al., 2019) .

What's also plain from Figure 4 is that training on the argument corpus effectively improves conclusion-completion-skill. Nearly without exceptions, the lines are monotonically upward sloping and reach 100% accuracy levels as soon as the model has been trained on instances of the corresponding scheme.

Yet, the most striking results unveiled in Figure 4 concerns transfer learning/generalization, i.e., the fact that training on a few schemes improves reasoning skill with respect to all schemes.

To see this, let us compare the BASE models with the models fine-tuned on the core schemes (TRAIN01). The TRAIN01-models achieve maximal accuracy in completing the instances of the three schemes they have been trained on (three upper left subplots). Remarkably, though, these models have also learned to complete arguments of different types: They display accuracy levels of 80% or higher for three quarters of the other schemes and only struggle with complex variants of disjunctive syllogism.

We take this to be a promising result: It suggests quite generally that there might exist a representative sample of paradigmatic instances of good reasoning that will suffice to acquire general reasoning skills. Such a collection might form the core of a critical thinking curriculum for language models.

Conclusion Completion On

Hand-crafted Argument But how far do the skills acquired on (a subset of) the artificial argument corpus generalize? We have checked the models by letting them complete a conclusion of a simple hand-crafted argument:

[Hermes] Every philosopher is mortal. Hermes is not mortal. Therefore, Hermes . . . This text differs syntactically and semantically from any argument possibly contained in the artificial argument corpus (where predicates have always the form "is/being a Y of X," and no domain covers philosophers or mortality). Obviously, it follows that Hermes "is not a philosopher." The argument instantiates generalized modus tollens, which is not a core scheme in TRAIN01. Can TRAIN01-models nonetheless fill out the unfinished argument in a sensible way? Table 1 counts and compares the most frequent completions generated by two TRAIN01 models (762M and 117M) and by the large untrained BASE model (762M). Exclusively the 762Mmodel trained on the core schemes reliably predicts the correct conclusion. The other two models rather repeat a premise, add an independent sentence or even generate a contradiction. This is consistent with our previous statistical findings. Remarkably, although both the small and the large TRAIN01 models have been fine-tuned on precisely the same arguments, only the large model seems to correctly recognize the logical Figure 4 . In the illustrative accuracy plot at the right-hand side, the BASE models as well as those models fine-tuned on TRAIN01 are plotted in green, since the base scheme of disjunctive syllogism doesn't belong to TRAIN01. However, some arguments in TRAIN02 and TRAIN03 instantiate the base scheme of disjunctive syllogism; thus the blue markers. structure of the [Hermes] argument. This suggests that generic language modeling skill facilitates the successful generalization of learned argument patterns beyond the templates used to create the synthetic training data. To further understand transfer learning effects, we next examine whether fine-tuning on the artificial argument corpus improves performance in other NLP reasoning tasks.

Table 1: Absolute frequency of predicted completions for the hand-made [Hermes] query by three different models. Completions are entailed (?), redundant (=), contradictory (†) or independent (◦).

Glue Ax

The GLUE datasets (Wang et al., 2018) represent standard benchmarks for natural language inference (NLI). We evaluate our models' NLI skill in terms of conditional perplexity on the curated GLUE diagnostics dataset.

As we proceed in a similar way with regard to other NLP benchmarks, we describe our evaluation method in general terms, first. Using templates, we translate each benchmark entry into alternative prompts (e.g., context and question) and alternative completions (e.g., answers). For example:

Prompt1: The girl is eating a pizza. Therefore, . . . Prompt2: The girl is eating a pizza. This rules out that . . . Completion: . . . the girl is eating food.

The correct match is obviously Prompt1-Completion. The ability of a language model to discern that "The girl is eating pizza" entails (and does not contradict) "The girl is eating food" will be reflected in a comparatively low conditional perplexity of Completion given Prompt1 and a correspondingly high conditional perplexity of Completion given Prompt2. That is what our metric measures.

Let us assume that there are n alternative prompts p 1 , . . . p n and m alternative completions c 1 , . . . , c m , with p * , c * being the correct matching. Moreover, nPP(c|p) refers to the normalized conditional perplexity of the completion c given prompt p, nPP(c|p) := PP(c|p) PP(c) .

Now, our evaluation metric is the relative correct perplexity generated by the model:

rcPP := 1 nm nPP(c * |p * ) i,j nPP(c j |p i )

For all benchmarks considered below, either n = 1 or m = 1. (N.B.: If m = 1, PP(c i ) will be the same for all prompt-completion pairs and may hence be dropped from the metric.)

Informally speaking, our metric measures

• the degree to which the correct promptcompared to the alternative prompts -makes the corresponding (correct) completion more likely (reduces its perplexity), or

• the degree to which a given (correct) prompt makes the correct completion -compared to the alternative completions -more likely (reduces its perplexity).

With these explanations in mind, let us turn to the assessment of our models on the GLUE benchmarks reported in Figure 6 .

Figure 6: Relative perplexity on the GLUE diagnostics data, the SNLI dataset, the argument reasoning comprehension (ARC) benchmark, and the LogiQA dataset.

For the BASE models, the relative perplexity scatters around one, whereas the models finetuned on the three core schemes (TRAIN01) seem to fare better, displaying significantly -though not substantially -lower relative perplexity values. Far from being ground-breaking, an improvement by two percentage points in terms of relative perplexity is still a promising finding that points at least into the right direction. In addition, concerning BASE and TRAIN01 models, the large model version achieves better results than the smaller ones, as one would expect. However, whereas the TRAIN01-models seem to slightly outperform the BASE models, this does not hold for all the TRAIN02-and TRAIN03-models anymore. Especially the performance of GPT-2-762M, when finetuned on larger training sets, deteriorates back to base-level. This might be due to the fact that the TRAIN02-and TRAIN03-models (and especially the large versions thereof) suffer the loss of generic language modeling skills (see Figure 3 ) and hence are deprived of linguistic and factual knowledge needed to address the NLI tasks. Yet, this is currently but an explanatory hypothesis in need of further investigation.

SNLI The SNLI dataset (Bowman et al., 2015) is another standard benchmark for NLI. Like the GLUE dataset, it consists in pairs of sentences which entail, contradict, or don't bear on each other. The assessment of our models with respect to SNLI data proceeds in close analogy to the GLUE benchmark. Figure 6 reports its results.

These results are consistent with the previous findings for the GLUE benchmark: First and Figure 6 : Relative perplexity on the GLUE diagnostics data, the SNLI dataset, the argument reasoning comprehension (ARC) benchmark, and the LogiQA dataset.

foremost, fine-tuning on the three core schemes (TRAIN01) improves the performance by 1-2 percentage points. In addition, the large model outperforms the smaller versions (for BASE and TRAIN01). Finally, fine-tuning on the larger and broader training set doesn't lead to further improvements, on the contrary, relative perplexity increases again, most remarkably for the large model (762M parameters). (Again, this might be due to a loss of generic language modeling skill.)

Argument Reasoning Comprehension Task

The Argument Reasoning Comprehension (ARC) task (Habernal et al., 2018) assesses the ability to identify a missing premise in an informally reconstructed and not necessarily deductively valid argument. It is a multiple-choice task where two alternative sentences are provided, one of which is the missing premise.

We design and apply specific templates to construct prompts and completions, and calculate relative perplexity as described above.

As shown in Figure 6 , the findings are inconclusive. Larger models are not necessarily better than smaller ones in this task, and training on the artificial argument corpus doesn't seem to have an effect.

LogiQA The LogiQA is a collection of nearly 9,000 multiple-choice questions (four alternative answers each) used in critical thinking assessments. These questions span the whole range of critical thinking tasks.

We design and apply specific templates to construct prompts and completions (one prompt and four completions per question), and calculate relative perplexity as described above.

As can be seen from Figure 6 , training on the artificial argument corpus has no effect whatsoever on the ability of the models to handle the critical thinking tasks collected in LogiQA.

5 Conclusion

This paper has taken a first step towards the creation of a critical thinking curriculum for neural language models. It presents a corpus of deductively valid, artificial arguments, and uses this artificial argument corpus to train and evaluate GPT-2 (Section 4). The observation of strong transfer learning effects/generalization is its main finding: Training a model on a few central core schemes allows it to accurately complete conclusions of different types of arguments, too. The language models seem to connect and to generalize the core argument schemes in a correct way. Moreover, there is evidence that generic language modeling skill facilitates the successful generalization of learned argument patterns. These findings are consistent with previous work on rule-based reasoning (Clark et al., 2020) and suggest that there might exist a representative sample of paradigmatic instances of good reasoning that will suffice to acquire general reasoning skills. Such a collection might form the core of a critical thinking curriculum for language models.

Moreover, the trained models have been tested on different reasoning benchmarks. We obtain consistent and promising results for the GLUE and SNLI benchmarks. But training on the argument corpus doesn't affect the performance with regard to the semantically more demanding Argument Reasoning Comprehension task or the critical thinking assessment compiled in LogiQA.

Our work suggests different directions for advancing the approach adopted in this paper and further improving the general reasoning skill of neural language models:

• The syllogistic argument text corpus might be complemented with corpora of arguments that instantiate different kinds of correct schemes, e.g., propositional inference schemes, modal schemes, argument schemes for practical reasoning, complex argument schemes with intermediary conclusions or assumptions for the sake of the argument, etc. (Technically, we provide the infrastructure for doing so, as all this might be achieved through adjusting the argument corpus configuration file.)

• To succeed in NLI tasks, it doesn't suffice to understand 'what follows.' In addition, a system needs to be able to explicitly discern contradictions and non sequiturs (relations of logical independence). This suggests that the artificial argument corpus might be fruitfully supplemented with corpora of correctly identified aporetic clusters (Rescher, 1987) as well as corpora containing correctly diagnosed fallacies.

• In addition, the idea of curriculum learning for ML (Bengio et al., 2009 ) might be given a try. Accordingly, a critical thinking curriculum with basic exemplars of good reasoning would not only used to fine-tune a pre-trained model, but would be employed as starting point for training a language model from scratch.

Natural language templating is a fundamental technique used throughout this paper: both in constructing the artificial argument corpus as well as in transforming the NLP benchmark datasets into text that can be processed by language models. The concrete templates applied have been designed in a trial-and-error process. It is far from clear that these represent optimal choices for effectively eliciting a language model's skills. Like (Petroni et al., 2020) , we may thus argue that our findings establish lower bounds to the actual semantic capabilities of the models. Still, following (Jiang et al., 2020) , it seems of great importance to gain a more systematic understanding of different templating strategies and their effects on metrics based on accuracy and perplexity.

In conclusion, designing a critical thinking curriculum for neural language models seems to be a promising and worthwhile research program to pursue.

The corpus as well as the source code used to generate it will be released at https://github.com/ debatelab/aacorpus.