Thinking Aloud: Dynamic Context Generation Improves Zero-Shot Reasoning Performance of GPT-2
Authors
Abstract
Thinking aloud is an effective meta-cognitive strategy human reasoners apply to solve difficult problems. We suggest to improve the reasoning ability of pre-trained neural language models in a similar way, namely by expanding a task’s context with problem elaborations that are dynamically generated by the language model itself. Our main result is that dynamic problem elaboration significantly improves the zero-shot performance of GPT-2 in a deductive reasoning and natural language inference task: While the model uses a syntactic heuristic for predicting an answer, it is capable (to some degree) of generating reasoned additional context which facilitates the successful application of its heuristic. We explore different ways of generating elaborations, including fewshot learning, and find that their relative performance varies with the specific problem characteristics (such as problem difficulty). Moreover, the effectiveness of an elaboration can be explained in terms of the degree to which the elaboration semantically coheres with the corresponding problem. In particular, elaborations that are most faithful to the original problem description may boost accuracy by up to 24%.
1 Introduction
Transformer-based language models [Vaswani et al., 2017] have conquered, over the last three years, the leaderboards of NLP benchmarks -bidirectional models like BERT [Devlin et al., 2019] and RoBERTa [Liu et al., 2019b] excel in ever more challenging natural language understanding (NLU) tasks, whereas autoregressive models such as BART [Lewis et al., 2019] or GPT-3 [Brown et al., 2020] are capable of generating high quality texts that humans fail to tell apart from passages written by human authors [Brown et al., 2020] . These technologies are not only reshaping the field of NLP, but are likely to have far-reaching repercussions for how we read, study, and write texts in academia (especially in the humanities and social sciences), and beyond.
As language models are continuously improving in terms of language understanding and linguistic reasoning skill, the question that naturally arises is whether there are any upper limits on what these systems will be able to do (with words). Are there hard problems that language models will never master? Shane Frederic's cognitive reflection test, which includes the following question, is an interesting case in point [Frederick, 2005] :
In a lake, there is a patch of lily pads. Every day, the patch doubles in size. If it takes 48 days for the patch to cover the entire lake, how long would it take for the patch to cover half of the lake?
In the remainder of the paper, we will refer to the generation of text that effectively analyzes a given problem as "dynamic problem elaboration," rather than using the term "thinking aloud" (because of its mental presumptions). "Dynamic" means that the language model is supposed to newly generate the elaboration in response to being challenged, and specifically for the problem at hand. Moreover, we will investigate a bootstrapping scenario where one and the same language model is used to answer the question and to generate the problem elaboration. In other words, the language model expands the context of each problem and feeds it back to itself (see also Section 2.2) before predicting the answer. The example in Figure 1 illustrates this basic idea.
We test the effect of different kinds of problem elaboration on the performance of GPT-2 [Radford et al., 2019] in a deductive, multi-hop natural language reasoning task inspired by and named "ChainRuler" (Subsection 3.1). Given a context consisting of natural language rules and facts (e.g., the context illustrated in Figure 1 ) the goal is to answer yes/no questions (e.g., Marion is upset?) that, by construction, require performing correct deductive reasoning over the provided context (Subsection 3.3). Free and fewshot elaborations consist in text generated in response to a generic, unanswered question, whereas piecemeal elaborations are assembled from multiple generated text fragments that address task-specific queries [as, e.g., in Shwartz et al., 2020] (Subsection 3.2).
Here is a preview of our main results: GPT-2 follows a simple syntactic heuristic [similiar to those discussed in McCoy et al., 2019] when prompted with a ChainRuler reasoning problem, which, in benevolent cases, is effective and leads to high accuracy, but causes systematic bias as soon as there is sufficient effective distraction or the task involves contraposition (Section 4). Against this baseline, dynamic problem elaborations can, depending on the problem characteristics, increase accuracy by 9% -either improving zero-shot skill or effectively de-biasing the model (Section 4). The observed variance in elaboration effectiveness may be explained in view of the elaborations' coherence with the problem to be solved. Specifically, the most faithful piecemeal elaborations boost accuracy by 24% resp. 16%, compared to the no elaboration treatment (Subsection 5.2). Likewise, different types of fewshot elaborations excel under different kinds of problem characteristics (Section 4) -especially so when negations are absent in the corresponding problem descriptions (Subsection 5.3).
2.1 Reasoning Performance Of Neural Language Models
Most of the studies that assess reasoning and inference skills of neural language models (LMs) seem to support the following claims [see also Rogers et al., 2020] :
1. Pre-trained neural language models, whether uni-or bidirectional, display a poor zero-shot performance on reasoning tasks [e.g., Yanaka et al., 2019 . Even GPT-3, while achieving impressive zero-shot results for other NLU benchmarks, struggles with the task of natural language inference (NLI) in particular [Brown et al., 2020, Sec. 3.8] . Moreover, Kassner and Schütze [2020] , extending the LAMA probe by Petroni et al. [2020b] , show that LMs are vulnerable to mispriming effects and have major difficulties in getting negations right [consistent with Talmor et al., 2020a] . Similarly, probe language models with semantic fragments and find that even models that are fine-tuned on NLI datasets fail to cope with, e.g., elementary logical relations. However, there is evidence that pre-trained language models do possess substantial conceptual knowledge, which shows in their ability to correctly draw conceptual (as opposed to formal or logical) inferences Sabharwal, 2019, Talmor et al., 2020a] and to rely on these relations as implicit background assumptions in answering questions [Talmor et al., 2020b ].
2. With task-specific fine-tuning or so-called inoculation [Liu et al., 2019a] , however, these models can achieve state-of-the art results and are almost perfectly mastering many reasoning tasks. While zero-shot performance is generally poor, language models trained on task-specific data have propelled SOTA accuracy levels above 90% for major benchmarks (such as SNLI [Bowman et al., 2015] , MultiNLI [Williams et al., 2018] and RTE [Dagan et al., 2005] ). Language models quickly learn to master logical fragments given appropriate training data [Kassner and Schütze, 2020 , Richardson and Sabharwal, 2019 , and can be fine-tuned to correctly draw complex deductive inferences , Betz et al., 2020 and to generate informal reasons , Camburu et al., 2018 , Brahman et al., 2020 . Schick and Schütze [2020a,b] introduce "Pattern Exploiting Training" (PET) and show that unsupervised pattern-recognition and annotation of training data substantially boosts the performance of the language model that is trained on the labeled data.
Against this background, the novelty of our study is to show that GPT-2 has a strong zero-shot performance on a NLI task involving deductive reasoning over rules . Our particular focus on zero-shot performance follows much recent work on zero-shot evaluation of pretrained language models , Ma et al., 2020 , Banerjee and Baral, 2020 , Bosselut et al., 2019 , which take zero-shot performance of pre-trained models without specialized fine-tuning as an insightful benchmark for better understanding LM's reasoning abilities.
2.2 Dynamic Templating And Context Retrieval
It is well-known that performance of neural LMs in NLP tasks depends sensitively on the wording of the query [Petroni et al., 2020b , Jiang et al., 2020 . Accordingly, Petroni et al. [2020b] argue that by assessing a language model with a given set of manually defined queries one measures a lower bound of the system's full skill. Recent studies have explored two directions for dynamically adjusting and expanding LM queries, which are conceptually related to automatic query expansion in (classic) information retrieval systems [Carpineto and Romano, 2012] :
1. Dynamic templating refers to the automatic and case-specific optimization of natural language templates which are used to construct a query from given data. Specifically, Jiang et al. [2020] explore three strategies for improving manually generated prompts: mining effective prompts from a database (e.g., Wikipedia), paraphrasing prompts (e.g., through two-way-translation, forth and back), and pooling multiple prompts. Each of these strategies is shown to significantly improve predictions in QA tasks.
2. Dynamic context expansion refers to the automatic retrieval and/or generation of additional context, over and above the task data, that is embedded in the query. Chen et al. [2019] extract and add "reasoning chains" to problem descriptions, which improves performance on multi-hop QA tasks [Yang et al., 2018] . Likewise, Petroni et al. [2020a] assess whether automatic context expansion boosts the performance of the RoBERTa model [Liu et al., 2019b] on a QA task. Standard information retrieval systems are shown to increase accuracy by 10% to 40%, depending on the specific task. If, however, a text that is generated with a language model is added to the context, precision actually drops as compared to the baseline performance without context expansion [Petroni et al., 2020a] . Whereas such free context expansion deteriorates performance, , introducing self-talk, demonstrate that task-specific and highly structured generation of additional context (called "clarifications") may improve the performance of various language models in commonsense QA tasks. Retrieval augmented generation (RAG) [Lewis et al., 2020] pushes dynamic context expansion one step further by coupling, in one global net, a transformer-based neural document retrieval system with a language model for answer prediction. RAG leads to substantially more accurate, factive and specific answers than obtained by the bare generative language model [Lewis et al., 2020] . Moreover, dynamic context expansion has recently been successfully applied to reasoning tasks. PRover [Saha et al., 2020 ] is a multi-head model based on RoBERTa [Liu et al., 2019b] , which both constructs proof chains and predicts answers for a deductive reasoning task . Saha et al. [2020] show that this kind of structured problem elaboration significantly boosts accuracy (by 6% for zero-shot setting). Likewise, Gontier et al. [2020] demonstrate that transformer language models can be trained to generate effective context expansions that allow the model to solve reasoning tasks from CLUTTR, a database of inference problems with implicit general premises [Sinha et al., 2019] .
Against this background, the novelty of our study consists in showing that bootstrapping context generation, where one and the same language model that is used for answer prediction is also employed for dynamic context expansion, can increase the zero-shot reasoning performance of an autoregressive transformer model in a NLI task.
3 Experiments
We study the effect of problem elaboration on GPT-2's reasoning skill by adding different types of dynamically generated texts to the context in a reasoning task. Roughly speaking, we proceed in three steps: First, we synthesize test data for our reasoning task (Subsection 3.1). Second, we generate and add problem elaborations for each entry in the dataset (Subsection 3.2). Third, we append the generated elaborations to the context and predict the answers (Subsection 3.3).
3.1 Synthesizing The Chainruler Data
In order to test the zero-shot reasoning skill of GPT-2 and the effectiveness of dynamic problem elaboration, we design a deductive reasoning task, inspired by RuleTaker , and construct a corresponding synthetic dataset. In a nutshell, the task consists in correctly inferring a conclusion from a set of rules and a fact. More specifically, each problem is composed of:
1. the conclusion (correct answer): a singular, possibly negated statement (e.g., "a is G"); 2. two false alternatives which contradict the conclusion: the logical negation of the conclusion ("a is not G") and a singular statement which contradicts the conclusion for conceptual reasons ("a isḠ" withḠ being conceptually complementary to G); 3. the fact: an singular statement "a is F" (or, "a is not F"), which serves as premise; 4. the rule chain: l generalized conditionals that allow one to infer the correct answer from the fact
(F ⊃ I 1 , I 1 ⊃ I 2 , . . . , I l−1 ⊃ G).
If the problem is of type "contraposition", then the last conditional is transposed (replaced by not-G ⊃ not-I l−1 ); 5. the distractors: a set of k confounding rules whose consequent terms equal the target predicate or its logical / conceptual complement:
H 1 ⊃ X 1 , H 2 ⊃ X 2 , . . . , H k ⊃ X k with X i ∈ {G, not-G,Ḡ, not-Ḡ}.
The problem description (context) of a single task item consists in a random permutation of the fact, the relevant rules (rule chain) and the confounding rules (distractors). By "depth" of a problem, we refer to the length l of its rule chain, whereas the breadth denotes the number k of confounding rules.
Note that the distractors, and especially their consequent terms, are sampled randomly. So, by mere chance, all confounding rules in a problem description might actually point towards the correct answer (consequent term = target predicate). To capture this property of a problem, we introduce the notion of effective distraction, which is the number of distractors whose consequent term is not identical with the conclusion's predicate. fact "Jill is green." rulechain "If someone is green, then they are loud.", "If someone is loud, then they are guilty." distractors "If someone is empty, then they are innocent." conclusion "Jill is guilty." alternatives "Jill is not guilty.", "Jill is innocent." depth 2 breadth 1 contraposition False eff. distraction 1 fact "Lily is blue." rulechain "If someone is blue, then they are careful.", "If someone is careful, then they are loud.", "If someone is not generous, then they are not loud." distractors "If someone is in need of money, then they are not generous.", "If someone is guilty, then they are not generous." conclusion "Lily is generous." alternatives "Lily is not generous.", "Lily is stingy." The construction of the synthetic test dataset can be broken down in two main steps:
1. We randomly sample a balanced set of formal problem descriptions that instantiate the above structure, while systematically varying problem characteristics such as depth and breadth.
2. Drawing from a database of (i) names, (ii) pairs of conceptually contradictory predicates, and (iii) simple natural language templates, we create natural language instances of the formal problems by simple substitution. Table 1 illustrates the ChainRuler task by presenting two example items from the dataset.
3.2 Generating Problem Elaborations
We distinguish and study six ways of generating problem elaborations (cf. Figure 2 ).
Free elaboration. We prompt the model with an unanswered question and generate one single completion. The first four sentences of this generated completion represent the "free elaboration" of the problem. The query for eliciting this free elaboration presents the context and asks which of the alternative answers is correct, e.g.: "Here is what we know: context Does this mean that Loretta is not hungry, is hungry, or is not full? Explain!"
The fewshot elaborations are generated similarly to the free elaborations, with the exception that two "sample solutions" are prepended to the prompt. Each sample solution features a problem description and a proof chain serving as paradigmatic elaboration of the problem. More specifically, we explore the following three kinds of paradigmatic elaborations:
• IC elaboration, which consists in the problem's fact, the intermediary conclusions that can be inferred from the fact by consecutively applying the rules, and the final conclusion; • PC elaboration, which consists in the problem's fact, the rule chain (correctly ordered), and the final conclusion; • PCIC elaboration, which consists in the problem's fact, followed alternately by the relevant rules and conclusions one can infer, until the final conclusion is reached.
This gives, correspondingly, the following fewshot elaborations: Fewshot IC, Fewshot PC, and Fewshot PCIC.
With free and fewshot elaboration, the model generates, given its prompt, a single completion.
Structured and recursive elaboration, described below, are, in contrast, piecemeal methods, which prompt the model not once but four times. The four generated completions are then post-processed and concatenated to obtain the problem elaboration.
Structured elaboration. The model generates, independently of each other, four completions given one and the same prompt. The four sentences which come first in each conditionally generated text are concatenated and represent the "structured elaboration" of the problem. The specific query used to elicit the structured elaboration states the context and ends with a cloze-style question about what one may infer about the subject, e.g.: "Here is what we know: context Therefore, Loretta".
Recursive elaboration. The model generates a single sentence given the prompt used for structured elaboration. The generated sentence is then added to the context, before the model is prompted again to generate a second sentence, which is once more appended to the context, and so on, until four sentences are iteratively generated. These four statements make up the recursive elaboration of the problem.
The free and structured elaborations are generated with top-p nucleus sampling (we follow in setting p = 0.5). The remaining elaborations are decoded with beam search. Table 2 displays examples of thusly elicited elaborations for two different ChainRuler problem items. To put the results in perspective, we compare the effects of dynamic problem elaboration with four synthetic context expansions that can be directly generated from the test data as follows:
Answers (Baseline):
We randomly pick one of the three alternative answers and repeat it four times.
Context (Baseline):
We concatenate four randomly picked statements from the context.
Intermediary Conclusions (Oracle):
We adjoin all intermediary conclusion about the subject that can be inferred by successively applying the given rules to the fact.
Final Conclusion (Oracle):
We repeat the final conclusion (i.e., the correct answer) four times.
3.3 Predicting Answers
To predict answers, we calculate the conditional probability that the language model assigns to each possible answer given the context and -depending on the experimental treatment -the corresponding elaboration.
The most likely answer is then predicted to be the correct one. Formally, consider context c, elaboration e and possible answers a 1 , a 2 , a 3 . Let p(s|s c ) be the conditional probability our language model assigns to sequence s given sequence s c (as prompt). The correct answer is predicted according to argmax i=1,2,3 p(a i |c, e) .
In order to assess the quality of the model's probabilistic predictions, we reduce the problem to a binary classification task, where for each context c and elaboration e either a or not ¬a is Table 2 : Elaboration examples, corresponding to the entries in Table 1 . We color generated sentences in accordance with their logical relation to the given context (independent/digression, implied/explicit in context, implied/implicit in context, inconsistent). the correct answer. (We drop, in effect, the second false alternative from each item's answer set, cf. Section 3.1.) The probabilistic scores for this binary task are obtained by normalizing the corresponding conditional probabilities, e.g. prob(a) = p(a|c, e)/ p(a|c, e) + p(¬a|c, e) and likewise for ¬a, so that prob(a) + prob(¬a) = 1.
Throughout this study, we use the HuggingFace implemention of the 1.5B-parameter GPT-2 model [Wolf et al., 2019] .
As should be transparent from this Section's outline of the experiment, our study does not involve any training. We merely assess the zero-shot performance of the pre-trained GPT-2 model.
4 Results
First of all, we find that GPT-2 follows a simple heuristic for solving task ChainRuler task: it's predictions are seemingly just based on how frequently the predicate of an answer-option appears in the consequent of the problem's rules. Whenever a problem description contains, by chance, many distractors whose "then"-part corresponds to the correct answer, GPT-2 achieves very high accuracy. This can be seen from Figure 3a , which displays no elaboration accuracy as a function of a problem's depth and its effective distraction (see Section 3.1). If the model is, however, not lucky and many distractors conincidentally point towards the wrong answers (i.e., high effective distraction), then the model typically gets the answer wrong and performs substantially worse than naïve random guessing (accuracy=.33). Following the simple heuristic, the model systematically commits fallacies and is substantially biased. This is especially the case in tasks with contraposition, where the simple 40 60 80 100 1 2 3 4 5 6 depth 5 4 3 2 1 0 10 14 11 13 15 12 21 19 19 19 19 20 26 25 25 25 24 25 30 27 29 29 28 28 32 30 31 31 32 30 42 38 39 38 38 37 20 40 60 80 100 1 2 3 4 5 6 depth 5 4 3 2 1 0 7.1 5.6 4.7 6.1 6.9 6.5 8.1 2.8 3.2 3.9 3 7.4 6.2 3.2 1.7 2 0.98 0.87 5.6 2.3 2.3 2.7 1.9 1.5 4.7 0.89 1 1.4 1.9 2.4 -0.62 -2.9 -3.2 -1.1 -0.34 1.2 15 heuristic doesn't even work in the absence of distractors and performance is, accordingly, particularly weak. All this suggests that the pre-trained model does not consecutively apply modus ponens, modus tollens, or chain rule to infer the correct answer. It does not, per se, engage in deductive reasoning. To further corroborate this conjecture, we have, in an additional experiment, replaced all antecedent conditions in the problems' rules with unrelated / non-sense statements, which, however, doesn't affect GPT-2's zero-shot performance on the tasks.
Against this background, dynamic problem elaborations have a twofold effect (Figure 3b ): On the one hand, they prevent the model from effectively applying its simple heuristic and hence reduce performance in cases the model was lucky and baseline accuracy was high (esp. no cntrp and effective distraction=0). On the other hand, dynamic problem elaboration both is a successful de-biasing strategy, and can further boost reasoning performance. If the baseline performance is worse than random guessing (e.g., if effective distraction>3), then context expansion increases accuracy by up to 9 percentage points. In cases with slight distraction (e.g., no cntrp, effective distraction=2), the substantial baseline performance is further increased by up to 6 percentage points. All in all, the observed performance gains are in the upper range reported by for the similar self-talk design in commonsense QA tasks.
Moreover, there is no single type of dynamic elaboration which performs best across the entire spectrum of different tasks (Figure 3c , see also Appendix B): Without contraposition, recursive elaboration performs mostly best (and actually outperforms the intermediary conclusions oracle elaboration) unless effective distraction and depth are very high. In the latter, arguably most difficult cases, fewshot elaborations are most effective. Regarding problems with contraposition, fewshot IC elaboration is top in problems with few effective distractors or many distractors and low depth; fewshot elaborations with proof chains are efficient given high effective distraction and great depth; and piecemeal elaborations perform best otherwise. One emerging overall pattern here is that fewshot elaborations tend to perform better than piecemeal elaborations if the task is very difficult and the model is negatively biased (baseline below random guessing).
5 Analysis And Discussion
The findings so far can be summarized as follows. GPT-2 follows a simple heuristic and predicts answers in line with their frequency of previous occurrence when prompted with a ChainRuler problem. This heuristic is effective in some lucky cases, but quickly leads to systematic failure when effective distraction increases. While dynamic problem elaboration decreases the effectiveness of the heuristic in the lucky cases, it also reduces bias and substantially improves performance across a wide-spectrum of problem constellations. Different elaboration methods display characteristic performance fingerprints.
In the following, we further differentiate and explain these results in terms of 1. the degree to which generated elaborations facilitate the successful application of the simple syntactic heuristic used by the model (Subsection 5.1);
2. the degree to which generated elaborations cohere with the original problem to be solved, i.e., the verisimilitude, pertinence, and faithfulness of the elaborations (Subsection 5.2); 3. the degree to which generated elaborations syntactically resemble the problem-specific "ideal" elaborations, as alluded to in the fewshot sample solutions (Subsection 5.3);
4. the degree to which piecemeal elaborations are syntactically redundant and internally coherent (Subsection 5.4).
5.1 Do Generated Elaborations Facilitate The Application Of The Simple Heuristic?
If the model is initially lucky, i.e. there are few effective distractors, its syntactic heuristic is highly effective and adding additional context just tends to reduce overall accuracy (Figure 3b ). Yet, what's the mechanism underlying the performance boost due to dynamic problem elaboration we have observed? Does problem elaboration (A) block the application of the syntactic heuristic whenever it's not successful and incite the model to deploy a better prediction strategy? Or, (B) does it expand the problem in a way such that the simple syntactic heuristic becomes more effective if applied on the extended context?
To address these question, we introduce the notion of total (epistemic) luck -a counterpart to the concept of effective distraction (Subsection 3.1). Total epistemic luck refers to the number of occurrences of the conclusion's predicate both in the original problem description and the generated elaboration (provided the conclusion's predicate is not preceded by "not"). Given a context with high total luck, the simple syntactic heuristic is likely to yield the correct answer to a ChainRuler problem. Figure 4a plots the model's prediction score of the correct answer (cf. Subsection 3.3) as a function of total epistemic luck for different types of elaboration. For baseline none (gray), we observe a clear linear relationship: the model scores the correct answer with p=.25 if total luck equals 0, compared to p>.8 if total luck is 6. This is another way to say that the model uses a simple syntactic heuristic for prediction. Now, importantly, we observe a similar relationship for the predictions based on elaborations, too (where the relationship is slightly stronger for piecemeal elaborations). This suggests that the model is relying on the syntactic heuristic no matter whether it bases its prediction on the original or the dynamically expanded context. What seems to drive the performance boost by problem elaboration, then, is an expansion of the context that facilitates the application of the simple heuristic. In other words, the model is overall more lucky with than without problem elaboration. In fact, and consistent with our analysis, Figure 4b shows that problem elaboration increase total epistemic luck by, on average, 0.35-0.7 points.
5.2 Do generated elaborations cohere with the problem to be solved?
Verisimilitude, pertinence and faithfulness measure the degree to which an elaboration coheres with different aspects of a given problem. Figure 5: Accuracy in ChainRuler task for six types of elaborations as a function of (a) their verisimilitude, that is the semantic similarity between generated elaboration and correct answer (conclusion), (b) their pertinence, that is the semantic similarity between generated elaboration and sequence of possible answers, and (c) their faithfulness, that is the semantic similarity between generated elaboration and context. Top row: without contraposition. Bottom row: with contraposition.
Informal Explication Formal Operationalization
Verisimilitude: degree to which the elaboration is semantically similar to the ground truth cosine similarity between sentence-embeddings of elaboration and conclusion Pertinence:
degree to which the elaboration is semantically similar to the disjunction of possible answers cosine similarity between sentence-embeddings of elaboration and question Faithfulness:
degree to which the elaboration is semantically similar to the problem description (premises) cosine similarity between sentence-embeddings of elaboration and context
Transformer embeddings offer an elegant operationalization of the metaphoric notion of semantic similarity. (Technically speaking, we calculate cosine similarity between the DistilBERT-embeddings of the corresponding texts [Reimers and Gurevych, 2019] .) Figure 5a plots GPT-2's accuracy on ChainRuler tasks as a function of the elaborations' verisimilitude.
As expected, the more a dynamically generated elaboration resembles the correct answer, the more likely the model is to provide the correct answer given the elaboration. This observation is consistent with our analysis of total epistemic luck (Subsection 5.1) as well as with the finding that oracle elaborations which just repeat the correct answer (maximum verisimilitude) boost accuracy levels above 80% (cf. Appendix B). Moreover, these highly plausible results also corroborate the method of semantic similarity analysis based on transformer embeddings. Figure 5b plots GPT-2's accuracy on ChainRuler tasks as a function of the the generated elaboration's semantic similarity to the problem's question, which presents the three alternative answers. We observe a positive relation between pertinence and accuracy, especially for recursive and structured elaborations. If a piecemeal elaboration really addresses the question, then its, on average, more likely to be effective. Figure 5c plots GPT-2's accuracy on ChainRuler tasks as a function of the generated elaboration's faithfulness to the problem description. For ChainRuler tasks without contraposition, we obtain clear and highly plausible results (upper row): The more faithful the dynamic elaboration, the more effective it is in terms of helping the model to predict the correct answer. The relationship is most pronounced for piecemeal elaborations, such that the most faithful (top 7%) recursive and structured elaborations increase accuracy by 24 respectively 16 percentage points (as compared to no elaboration). Concerning the ChainRuler tasks with contraposition (bottom plot in Figure 5c ), faithfulness as measured by embedding similarity seems to have no effect on accuracy. However, a manual re-analysis of the data reveals that faithfulness is positively correlated with accuracy and that cosine similarity between BERT-embeddings simply fails to reflect deductive implications as soon as contraposition is involved [see also Kassner and Schütze, 2020] .
All in all, variance in elaboration effectiveness can partly be explained in terms of coherence with the original problem (as confirmed by a logistic regression analysis: the R 2 statistics with depth and effective distraction as endogenous variables equals 2.7%, but increases to 9.8% if verisimilitude, pertinence and faithfulness are included as further explanatory variables). A further take-away from Figure 5 is that piecemeal elaborations benefit most from cohering with the problem. The effectiveness of free and fewshot elaborations increases to a much lesser extent with rising pertinence or faithfulness. This might be due to the following difference: Free and fewshot elaborations may resemble a question or a problem description in argumentatively irrelevant ways (simply by repeating the question or mimicking the syntactic structure of the rules). Piecemeal elaborations, however, consist by design in statements about the problem's subject and are hence much more likely to cohere with the problem in inferentially relevant ways, if they cohere with it at all.
5.3 Do Generated Elaborations Resemble Ideal Elaborations?
We consider two kinds of problem specific "ideal" elaborations. Given a ChainRuler problem, the perfect proof chain consists in the fact, the rule chain (in correct order), and the final conclusion. The intermediary and final conclusions simply are the intermediary conclusions (in the order they can be inferred by applying the rules) plus the final conclusion. We use BLEU2-scores to measure the extent to which a given problem elaboration syntactically resembles the corresponding ideal elaboration.
As the boxplot in Figure 6a reveals, fewshot PC and fewshot PCIC elaborations are syntactically highly similar to the corresponding perfect proof chains. Similarly, fewshot IC elaborations syntactically resemble intermediary and final conclusions to a greater extent than the other free and fewshot elaborations -albeit less so than piecemeal elaborations (Figure 6b) . Thus, the fewshot samples are clearly shaping the generated elaborations. Yet, does this effect pay off in terms of accuracy? Do elaborations which syntactically resemble an "ideal" elaboration tend to be more effective? The barplots in Figure 6a and b answer this question, by reporting the degree to which syntactic similarity leads to higher accuracy, as measured by a logistic regression (which controls for depth and breadth). Fewshot IC and piecemeal elaborations tend to be much more effective if they resemble the perfect proof chain. Accordingly, one reason for the overall poor performance of fewshot IC (cf. Appendix B) seems to be that the model mostly fails to generate the correct intermediary and final conclusions, even if "told so" by fewshot examples. This is not the case at all for fewshot PC and fewshot PCIC elaborations. As soon as the model is "told to" generate proof chains, syntactic similarity to the ideal proof chain ceases to be an indicator of accuracy. The ability of GPT-2 to generate and exploit ideal elaborations -in particular: proof chains -is strongly influenced by the presence of negations in the problem description. Once more, "not" turns out to be a trouble-maker. To see this, we consider ChainRuler tasks whose singular premise (fact) is not negated. Table 3 reports the accuracy difference in these tasks as compared to all tasks. The fewshot elaborations with proof chain examples are significantly more effective with unnegated facts (actually, fewshot PC now outperforms baseline none). This seems to be not only due to the generation of more accurate proof chains, but also to a better ability of the model to tap on good elaborations, as the increased accuracy of oracle elaborations suggests.
5.4 Are Piecemeal Elaborations Syntactically Redundant And Internally Coherent?
Piecemeal elaborations are composed of four separately generated statements about the problem's subject. We assess the syntactic internal redundancy of such an elaboration by averaging over pairwise BLEU2-scores, and take the mean cosine similarity of the sentence-embeddings to be a measure of the elaboration's semantic internal coherence.
As Figure 6c,d shows, piecemeal elaborations are highly redundant and internally coherent; recursive elaborations even more so than structured ones. (Roughly half of the recursive elaborations simply consist in one and the same sentence repeated four times.) Redundancy/coherence has, however, opposite effects on the effectiveness of recursive versus structured elaborations. Recursive elaborations are the more effective, the less syntactically redundant / semantically coherent they are. Structured elaborations, in contrast, gain from redundancy and coherence.
These findings can be explained in terms of the underlying generation methods. With recursive elaboration, a first sentence about the subject is generated an appended to the context. Then, the model is prompted to generate a second statement about the subject given the updated context: Either the model "sees" what else can be inferred about subject given the newly added first sentence, and generates another sentence -which leads to a sensible proof chain, and to low redundancy. Or the model does not "see" what else can be inferred from the updated context, and then simply generates again what it has generated before (additionally encouraged to do so by positive feedback effects observed in Holtzman et al. [2019] ), namely the first sentence -which is a sign of poor inferential insight and results in high redundancy. That's why low redundancy goes along with high effectiveness of recursive elaborations. Now, the four individual sentences that make up a structured elaboration are generated independently of each other. So, the more confident the model is about how to complete a sentence about the subject, the more likely it is that this sentence will be generated several times when prompting the model independently of each other. For structured elaboration, redundancy and internal coherence are therefore indicators of confidence, which explains -assuming that models are all in all decently calibrated for ChainRuler tasks -why high redundancy coincides with high accuracy.
6 Conclusion And Future Work
In this paper, we introduce ChainRuler, a dataset for multi-hop deductive argumentation, and assess GPT-2's zero-shot ability both to solve the inference tasks and to generate effective problem elaborations, i.e., texts which -once added to the context -improve performance. Our main findings are:
• GPT-2 follows a simple heuristic when prompted with a ChainRuler reasoning problem -which leads to high accuracy in benevolent cases, but causes systematic bias as soon as effective distraction is high or the task involves contraposition: pre-trained GPT-2 then performs much worse than random guessing. (Section 4) • Dynamic context expansion with generated problem elaborations can, depending on the problem characteristics, increase accuracy by up to 9%, i.e., by an order of magnitude observed in comparable experiments yet other tasks , Saha et al., 2020 . Elaborations possess, depending on how they are generated, characteristic "accuracy fingerprints" over the problem spectrum. (Section 4) • Dynamic problem elaboration doesn't prevent the model from applying its heuristic. On the contrary, it expands the context so that the syntactic heuristic can be applied more successfully. Bluntly put: The reasoning is all in the context generation, the final prediction remains "stupid". (Subsection 5.1) • Variance in elaboration effectiveness can be explained in view of the extent to which an elaboration coheres with the problem to be solved. Moreover, the most faithful so-called recursive and structured elaborations boost accuracy by 24% resp. 16%, compared to the no elaboration treatment. (Subsection 5.2) • Fewshot learning (in the sense of [Brown et al., 2020] ) powerfully shapes the generated elaborations (Subsection 5.3), but does not lead to significantly stronger overall performance (Section 4). Rather, different types of fewshot elaborations excel under different kinds of problem characteristics (Section 4) -especially so when negations are absent in the corresponding problem descriptions (Subsection 5.3). • Redundancy is not necessarily a flaw of an elaboration. Rather, repeating a statement over and again can be a sign of a model's strong confidence and enable it to successfully exploit the generated elaboration (Subsection 5.4).
All these results are obtained with pre-trained GPT-2 and without further fine-tuning. This is certainly the reason for why we observe substantial, yet still clearly limited inference skill and ability to generate effective problem elaborations. This said, it seems worthwhile to explore, in future research, whether generative Transformer language models can learn to think aloud. Obviously, there are alternative set-ups for training language models to generate and exploit sound problem elaborations, for example:
• The language model is fine-tuned on the specific task, e.g., the ChainRuler data. • The language model is fine-tuned on a given corpus of good problem elaborations (like the ones considered in Subsection 5.3). • The language model is fine-tuned on a dynamically evolving dataset: The model generates free elaborations. Those elaborations that increase prediction accuracy to the greatest extent are added to the training data. The model is fine tuned on the training data. Next, another round of free elaborations are generated; once more, the best elaborations are added to the training data, and so on.
Besides improvements in accuracy and reliability, transfer learning effects would be of major interest in this context. For instance, it would be important to study whether language models are able to generalize a problem solving heuristic, and to produce effective elaborations beyond the tasks they have been trained on.
1 2 3 4 5 6 depth 5 4 3 2 1 0 effective_distraction -2.7 1.7 3.3 6.1 3.4 1.8 -2.3 -2.1 1.7 1.7 0.56 2.5 -5.9 -2.6 -3.6 -1.7 -1.6 -2 -7.4 -7.4 -4. 2 -4.6 -2.6 -4.3 -8.3 -12 -8.8 -6.7 -5.6 -4.3 -18 -21 -20 -16 -16 -14 (free_elaboration, none) cntrp=False 1 2 3 4 5 6 depth 5 4 3 2 1 0 effective_distraction -1.4 0.21 1.3 2.4 2.5 2.6 -7.4 -3.9 -2 0.52 -1.4 1.1 - 14 -7.2 -7.3 -5.7 -5.2 -4.9 -17 -13 -9.3 -7.2 -7.2 -7.7 -20 -16 -12 -11 -6.4 -6.6 -21 -21 -16 -12 -9.7 -7.8 (fewshot_ic_elaboration, none) cntrp=False 1 2 3 4 5 6 depth 5 4 3 2 1 0 effective_distraction 4.4 4.6 3.3 2.9 3.2 6.5 2.1 1.8 2.3 2.2 3 7.4 -0.5 -0.023-0.76 -0.78 -1.5 0.31 -4 -3.6 -3 -3.1 -1.9 -2 -7. 6 -7.3 -6.8 -5 -3.5 -2.1 -12 -15 -17 -15 -14 -12 (fewshot_pc_elaboration, none) cntrp=False 1 2 3 4 5 6 depth 5 4 3 2 1 0 effective_distraction 7.1 5.6 4.7 4.9 6.9 5 0.52 1.3 3.2 3.9 2.9 5.6 -3.4 -0.2 -2 -1.8 0.11 0.87 -7.6 -6.9 -4.5 -4.7 -2. 5 -2.6 -8.8 -11 -8.1 -6.8 -4.6 -3.2 -17 -21 -18 -14 -13 -9.9 (fewshot_pcic_elaboration, none) cntrp=False 1 2 3 4 5 6 depth 5 4 3 2 1 0 effective_distraction 2.5 3.1 4 4.4 1.8 1.2 3.2 2.2 3.1 3.2 2.1 2.9 2.3 2.5 1.5 0.87 0.98 0.45 1.8 0.039 1.3 0.18 0.46 -0.47 2 -1.8 -2.4 -1.7 -0.74 0.34 -0.62 -4.1 -4.5 -3.1 -3.1 -2.6 (structured_elaboration, none) cntrp=False 1 2 3 4 5 6 depth 5 4 3 2 1 0 effective_distraction 5 1.5 1.9 2.4 -0.11 0 8.1 2.8 3.1 2.3 1.3 3.7 6.2 3.2 1.7 2 0.73 0.85 5.6 2.3 2.3 2.7 1.9 1.5 4.7 0.89 1 1.4 1.9 2.4 -5.9 -2.9 -3.2 -1.1 -0.34 1.2 (recursive_elaboration, none) cntrp=False 1 2 3 4 5 6 depth 2.1 5 3.4 3.5 4.1 2.9 -0.9 0.49 1.8 2.3 1.9 1.1 -4 1.3 -0.43 0.15 -0.77 0.94 -3.9 1.1 0.78 0.81 -1 1.3 -0.54 5.3 2.1 3.9 3.1 2.9
(fewshot_ic_elaboration, none) cntrp=True -2.1 0.77 -1.8 -0.68 -0.33 1.8 -3.9 0.13 -0.95 -1.4 -1.7 0.15 -7.8 3.5 -1.5 -2.3 -1.6 -1.6 (fewshot_pcic_elaboration, none) cntrp=True they are clever. If someone is clever, then they are tall. Chris is gray. If someone is tired, then they are small. Does this mean that Chris is tall, is not tall, or is small? Explain! Well, it says that Chris is gray. It follows that Chris is clever. And therefore, Chris is tall. Fewshot PC Here is what we know: If someone is lonely, then they are not brown. If someone is big, then they are not popular. Bill is brown. Does this mean that Bill is lonely, is not lonely, or is not popular? Explain! Well, it says that Bill is brown. If someone is brown, then they are not lonely. Therefore, Bill is not lonely.
Here is what we know: If someone is boring, then they are tall. If someone is gray, then they are clever. If someone is clever, then they are tall. Chris is gray. If someone is tired, then they are small. Does this mean that Chris is tall, is not tall, or is small? Explain! Well, it says that Chris is gray. If someone is gray, then they are clever. If someone is clever, then they are tall. Therefore, Chris is tall. Fewshot PCIC Here is what we know: If someone is lonely, then they are not brown. If someone is big, then they are not popular. Bill is brown. Does this mean that Bill is lonely, is not lonely, or is not popular? Explain! Well, it says that Bill is brown. If someone is brown, then they are not lonely. Therefore, Bill is not lonely.
Here is what we know: If someone is boring, then they are tall. If someone is gray, then they are clever. If someone is clever, then they are tall. Chris is gray. If someone is tired, then they are small. Does this mean that Chris is tall, is not tall, or is small? Explain! Well, it says that Chris is gray. If someone is gray, then they are clever. It follows that Chris is clever. If someone is clever, then they are tall. And therefore, Chris is tall. Figure 7 reports detailed accuracy gains achieved by various kinds of dynamic problem elaborations, including oracles.