Transformers as Soft Reasoners over Language

P. Clark
Oyvind Tafjord
Kyle Richardson
ArXiv
2020
View in Semantic Scholar

Abstract

Beginning with McCarthy's Advice Taker (1959), AI has pursued the goal of providing a system with explicit, general knowledge and having the system reason over that knowledge. However, expressing the knowledge in a formal (logical or probabilistic) representation has been a major obstacle to this research. This paper investigates a modern approach to this problem where the facts and rules are provided as natural language sentences, thus bypassing a formal representation. We train transformers to reason (or emulate reasoning) over these sentences using synthetically generated data. Our models, that we call RuleTakers, provide the first empirical demonstration that this kind of soft reasoning over language is learnable, can achieve high (99%) accuracy, and generalizes to test data requiring substantially deeper chaining than seen during training (95%+ scores). We also demonstrate that the models transfer well to two hand-authored rulebases, and to rulebases paraphrased into more natural language. These findings are significant as it suggests a new role for transformers, namely as limited "soft theorem provers" operating over explicit theories in language. This in turn suggests new possibilities for explainability, correctability, and counterfactual reasoning in question-answering.

1 Introduction

AI has long pursued the goal of giving a system explicit knowledge, and having it reason over that knowledge to reach conclusions, dating back to the earliest years of the field, e.g., McCarthy's Advice Taker (1959) , and Newell and Simon's Logic Theorist (1956) . While this has resulted in impressive applications, e.g., [Metaxiotis et al., 2002] , building and reasoning over the required formal representations has also proved challenging [Musen and Van der Lei, 1988] . In this work, we explore a modern approach to this goal, and ask whether transformers can be trained to directly reason (or emulate reasoning), but using rules expressed in language, thus bypassing a formal representation. If this can be Alan is blue. Alan is rough. Alan is young. Bob is big. Bob is round. Charlie is big. Charlie is blue. Charlie is green. Dave is green. Dave is rough.

If someone is young and round then they are kind. Big people are rough. If someone is round and big then they are blue. All rough people are green. Q1: Bob is green. True/false? [Answer: T] Q2: Bob is kind. True/false? [F] Q3: Dave is blue. True/false? [F] Figure 1: Questions in our datasets involve reasoning with rules. The inputs to the model are the context (facts + rules) and a question. The output is the T/F answer to the question. Here the underlying reasoning for the true fact (Q1) is: Bob is big, therefore rough (rule2) therefore green (rule4). (This is an example of depth 2 reasoning). Note that the rules themselves change for different questions in the datasets. Facts not provably true are assumed false (closed-world assumption). done, new opportunities for question-answering, explainability, correctability, and counterfactual reasoning may become possible.

Figure 1: Questions in our datasets involve reasoning with rules. The inputs to the model are the context (facts + rules) and a question. The output is the T/F answer to the question. Here the underlying reasoning for the true fact (Q1) is: Bob is big, therefore rough (rule2) therefore green (rule4). (This is an example of depth 2 reasoning). Note that the rules themselves change for different questions in the datasets. Facts not provably true are assumed false (closed-world assumption).

This goal is quite distinct from question-answering as selecting an answer span in a passage, today's prevailing paradigm, e.g., [Rajpurkar et al., 2016] . Rather, we want the system to reason over the provided rules to find conclusions that follow. Our goal is also distinct from that of inducing rules from examples, e.g., given instances of family relationships, inducing that a parent's parent is a grandparent [Sinha et al., 2019] , something that transformers are already known to do well. Rather, here we provide rules explicitly, and wish transformers to draw appropriate conclusions, as illustrated in Figure 1 . Here, rather than inducing rules from examples, our task involves learning to emulate a reasoning algorithm.

We provide the first demonstration that this is possible, i.e., that transformers can reason with explicit rule sets in language. This result suggests a new role for transformers, namely as a kind of limited "soft theorem prover" (Figure 2) . 1 This in turn may allow inspection and control of the knowledge that the model is manipulating, with potential benefits for explanation, correctability, and counterfactual reasoning.

Figure 2: (a) Traditional formal reasoning applies a theorem prover to axioms in order to answer a question. (b) Our work here strives for a linguistic analog, where a transformer serves as a “soft theorem prover” over knowledge expressed linguistically.

Our investigations here are in a limited setting: Rules are linguistic expressions of conjunctive implications condition [∧ condition]* → conclusion, 2 with the semantics of logic programs with negation [Apt et al., 1988] ; and reasoning is the deduction of a statement's truth according to these semantics. 3 However, although there is still a potentially large gap to natural language inference (NLI), 4 our approach also suggests a path to teaching machines to reason over broader language, with similar potential benefits.

We leave open the question of whether the transformer is actually "reasoning", and even what that might mean in a neural setting. Rather, we show that transformers can reliably emulate the i/o behavior of a formal reasoner, including applied to test data requiring more reasoning that at training time, two hand-authored rulebases, and rulebases rephrased into more natural (crowdsourced) language.

The paper is organized to address the following questions, and contributes the following results:

1. Can transformers learn to reason with explicit rules?

Figure 3. Not extracted; please refer to original document.

Figure 4: Learning rates (in-distribution test accuracy vs. number of training examples). The lower depth models require less training data, but do not generalize well (boxed area in Table 1).

Figure 5. Not extracted; please refer to original document.

Figure 6. Not extracted; please refer to original document.

Figure 7. Not extracted; please refer to original document.

Figure 8. Not extracted; please refer to original document.

We train and test on rules expressed in (synthetic) language, and find high (99%) accuracy, including on test questions requiring a greater depth of reasoning than seen during training (scoring up to 95%, Table 1 ). 2. Can the trained model solve hand-authored reasoning problems? We find the trained models are able to solve five of six variants of two independently authored rule-based problems, zero shot (90%+ scores, Table 4 ). 3. Do the results transfer to theories expressed in more natural language? Models also perform well when trained and tested on theories paraphrased into more natural (crowdsourced) language (98% score). The best earlier model can even partially solve these problems zero-shot (66% accuracy, Table 5 ). 4. Can the model identify which facts an answer depends on? We show that the model is largely able to do this (94% F1), including perfectly for over 70% of the questions. This is a first step towards having a model create an explanation for its conclusions. (Section 4.5 and Figure 9 ). 5. Can other neural architectures learn to reason? Our experiments show a particular transformer (RoBERTa) is sufficient for our tasks, but is it necessary? We show that two other systems, BERT and ESIM (an LSTMbased model) [Chen et al., 2016] , are also able to learn these tasks, albeit with lower scores (95% and 80% respectively, vs. 98%). This suggests that our results are not specific to RoBERTa or transformers, although transformers learn the tasks more easily (Table 6 ).

Figure 9: Counts of the F1 scores for predicting which sentences are critical to the proofs of questions in DMax (test, no negation subset). For over 70% of the questions, the model predicts critical sentences perfectly (F1=1.0), with high F1 in the remaining case.

Table 1: Accuracy of models (Mod0,...) trained and tested on the five datasets (“Test (own)” row), and tested on all, and different slices, of the DMax test set. The boxed area indicates test problems at depths unseen during training.

Table 4: Accuracy of the earlier models tested on hand-crafted rulebases (zero shot, no fine-tuning). Note that the models were only trained on the earlier datasets (e.g., Figures 1 and 3), and thus the new rulebases’ entities, attributes, and predicates (bar is()) are completely unseen until test time.

Table 5: Accuracy with rules paraphrased into more natural language (ParaRules), without fine-tuning (zero shot) and with (last column only). The strongest zero-shot model (MMax) partially solves (66.6%) this problem zero-shot, with strongest performance for depth 0 and 1 inferences.

Table 6: Transformers (RoBERTa,BERT) are sufficient but not strictly necessary for this task, although other architectures (ESIM) do not score as well. DECOMP was run as a sanity check that the datasets are not trivially solvable - its low score (random baseline is 50%) suggests they are not.

Figure 2: (a) Traditional formal reasoning applies a theorem prover to axioms in order to answer a question. (b) Our work here strives for a linguistic analog, where a transformer serves as a "soft theorem prover" over knowledge expressed linguistically.

Although our demonstrations are within constrained environments, the work suggests new avenues for using transformers for both formal theorem proving and natural language inference (NLI), discussed in Section 5. In particular, the ability to derive reasoned conclusions from stated knowledge suggests new opportunities for explainability, correctability/machine instruction, and counterfactual reasoning in question-answering.

2 Related Work

While our work is, to the best of our knowledge, the first systematic study of transformers reasoning over explicitly stated rule sets, there are several datasets that make a first step towards this by testing whether neural systems can apply explicit, general knowledge in a particular situation. Two synthetic datasets that test whether a single rule can be applied correctly are as follows: 1. Task 15 in the bAbI dataset [Weston et al., 2015] applies rules of the form "Xs are afraid of Ys" to an instance, e.g., "Sheep are afraid of wolves. Gertrude is a sheep. What is Gertrude afraid of? A:wolves" 2. The synthetic, conditional probes in [Richardson et al., 2020] test single rule application, e.g., "If Joe has visited Potsdam then Anne has visited Pampa. Joe has visited Potsdam. Has Anne visted Pampa? A:yes" The associated papers also show that neural systems can learn these tasks almost perfectly (100% and >97% accuracy respectively).

Similarly, two reading comprehension datasets involve applying general rule(s) to text:

3. In the QuaRTz dataset , the "rules" are qualitative relationships, e.g., "A sunscreen with a higher SPF protects the skin longer.", and the dataset tests whether a system can apply a rule correctly to a situation, e.g., "Billy is wearing sunscreen with a lower SPF than Lucy. Who will be best protected from the sun?", thus testing one-step rule application. 4. ROPES (Reasoning over Paragraph Effects in Situations) [Lin et al., 2019] is similar, except the general knowledge is contained within a paragraph-length passage and must be applied to a pararagraph-length situation. While the non-synthetic nature of QuaRTz and ROPES adds realism to the task, it also complicates them as a test of "machine reasoning", in the sense of (producing the i/o behavior of) chaining rules to reach a conclusion. Rather, they test reading comprehension, requiring not just "reasoning" but also use of background knowledge and handling the diversity of language.

Although our core datasets may seem similar to the bAbI dataset [Weston et al., 2015] in using synthetic data, our probes are qualitatively different. Specifically, apart from bAbI Task 15 (above), the underlying rules needed to infer an answer in the bAbI tasks are implicit. For example, answering Task 1 questions such as "Mary went to the hallway. John went to the office. Where is Mary? A: hallway" requires inducing state-change rules such as "X moves to Y → X at Y". In other words, the bAbI tasks test whether a system can learn and apply these underlying rules from examples, while our concern here is reasoning with explicit rule sets, potentially different for each example (Figure 1) .

Similarly, our work is qualitatively distinct from work on multihop reasoning datasets, e.g., HotpotQA [Yang et al., 2018] . Again, for those problems, the implicit rules of inference (i.e., which multihop chains are valid) need to be inferred from examples (e.g., that "causes" is transitive). In our case, rules of inference are explicitly stated.

Our work can be seen as evaluating transformers for (a subset of) Natural Logic [MacCartney and Manning, 2014; Moss, 2010] , i.e., formal inference over statements expressed in language. It is also related to textual entailment and Natural Language Inference (NLI) [Manning and MacCartney, 2009] , but with the important difference that NLI also allows unsupported inferences that "a person would typically infer" [Dagan et al., 2013] . We discuss bridging the gap between our work and NLI in Section 5.4.

Several researchers have developed methods for Neural Theorem Proving (NTP), combining symbolic and neural methods to reason step-wise over language-derived structures, e.g., [Weber et al., 2019; . Similarly, there has been work on SAT solving [Selsam et al., 2018] , approximate (DNF) model counting [Abboud et al., 2020] , and formula embedding [Abdelaziz et al., 2020; Crouse et al., 2019] with neural networks, to help solve formal reasoning problems. While our goals are similar, we do not impose any structure on the neural reasoning process, instead wanting to know if the (i/o of the) reasoning process itself is learnable, using knowledge expressed in language.

Our task can perhaps best be viewed as one of algorithm emulation, here for systematic reasoning with rules. There have been numerous other demonstrations that transformers either already know [Talmor et al., 2019] or can learn to emu-late other algorithms, including for semantic parsing [He and Choi, 2019] , machine translation [Wang et al., 2019] , symbolic integration [Lample and Charton, 2019] , and mathematics [Saxton et al., 2019] . Here we investigate a transformer's ability to learn rule-based reasoning.

3 Dataset Generation

To investigate a transformer's ability to emulate rulebased reasoning, we generate five datasets requiring various depths of inference to answer the questions, as we now describe. Each example in a dataset is a triple (context,statement,answer), where context has the form (fact*,rule*), statement is the question, namely a declarative sentence to prove, and answer is either T (true) if statement deductively follows from the context, or F if it does not (false under a closed-world assumption, CWA). Facts, rules, and the question statements are expressed in (synthetic) English. Each example is essentially a (linguistic) standalone logical theory with an "Is it true?" question (the statement) posed against it. By virtue of the dataset generation procedure (below), we know every question can be answered by a formal reasoner (under a CWA). We are interested in whether a transformer can similarly learn to answer these questions, apparently requiring some neural form of reasoning.

3.1 Overview

To generate each example, we first generate a small theory (facts + rules) in logic, perform forward inference to derive all its implications, then select question statements from those implications (answer=true), and from unproven (positive) facts (answer=false, under the CWA).

The facts, rules, and questions are then expressed in (synthetic) English using simple natural language templates. We generate five datasets, each constrained by the maximum depth of inference required to prove the facts used in its questions (up to depths D=0, D≤1, D≤2, D≤3 and D≤5 respectively). Depth D=0 means the true facts can be "proved" by simple lookup in the context (no inference). The fifth dataset, called DMax, contains questions up to depth 5, and is used to test generalization to depths unseen in training on the other four datasets.

3.2 Theory Generation

Theories contain two types of facts:

• atributes is(e i , a j ) e.g., is(Alan,Big).

• relations r k (e i , e k ) e.g., eats(Dog,Rabbit).

The is() predicate assigns attributes to entities, while the r k () predicates relate two entities. Like people names, the symbols Dog, Rabbit, etc. also denote specific entities, i.e., denote "the dog", "the rabbit", etc. Rules are of the form:

condition [∧ condition]* → conclusion.

The first condition is a predicate whose first argument is a variable, 5 and second argument is an attribute or entity. For each subsequent condition and the conclusion, they are also

The bald eagle does not eat the dog. The cat chases the dog. The cat eats the bald eagle. The cat is nice. The cat likes the dog. The cat likes the rabbit. The dog is furry. The rabbit chases the bald eagle. The rabbit eats the bald eagle.

If someone does not eat the cat then they do not eat the dog. If someone likes the bald eagle then they do not like the rabbit. If someone eats the bald eagle and they do not eat the rabbit then they are furry. If someone is furry then they like the cat. Figure 3 : An example of a rulebase and 3 questions using relations with negation. The reasoning for the [T] answer is: The rabbit eats the bald eagle (given), therefore the rabbit is furry (rule3), therefore the rabbit likes the cat (rule4).

Q1

predicates whose first argument is either the same variable or a previously mentioned entity, and the second argument is a new attribute or entity. (In this way, rules are constrained to have at most one variable. Rules are implicitly universally quantified over that variable). For example, the formal form of the first rule in Figure 1 looks:

// If someone is young and round then they are kind.

is(?X,Young) ∧ is(?X,Round) → is(?X,Kind).

Each theory contains 1-16 facts and 1-9 rules generated at random. We generate two types of theory:

1. Type 1 uses only the is() predicate, with 4 entities {Alan,Bob,...} and 7 (non-mutually-exclusive) attributes {Blue,Rough,Young,...}, drawn randomly from pools of 10 names and 14 attributes respectively. 2. Type 2 uses is() and 3 other predicates {likes(), chases(), ...}, 4 entities {Cat,Dog,BaldEagle,...}, and 5 attributes {Big,Furry,...}, drawn randomly from pools of size 6, 10, and 10 respectively. We also generate a version of each that adds negation (not) in the facts and rule conditions/conclusions. Figure 1 is an example of Type 1, without negation. Figure 3 is an example of Type 2, with negation. Each dataset contains 100k examples (25k of each Type × without/with negation). Data is randomly split 70/10/20 into train/dev/test partitions, ensuring no overlap of theories between each partition.

3.3 Forward Inference

Given a randomly generated theory (facts+rules), we perform exhaustive forward inference to find all its implications, noting their proof(s). (As the domains are finite, the number of implications are finite too). For semantics, we treat the rulebase as a logic program, and infer the minimal, supported answer set implied by the program [Apt et al., 1988] . Negations in the rules' conditions are treated as negation as failure (NAF), and we ensure that the rulebase is stratified to avoid ambiguity and cycles [Bidoit and Froidevaux, 1991] . Inference is performed layerwise to find the minimal supported model, and inconsistent and unstratified rulebases are discarded. We also check that inference proceeds to the depth required for the target dataset, e.g., for the D≤3 dataset, at last one fact must require depth 3 inference to infer it for all its theories.

3.4 Question Generation And English Synthesis

For each theory, we generate several questions with answer 'true' by selecting from the inferred facts, one at each depth of inference from 0 to the dataset's target depth (e.g., for the D≤2 dataset, we generate 3 'true' questions at depths d = 0, 1, and 2 for each theory). For each 'true' question we also generate a 'false' question by negating a conclusion proven at the same depth. We then generate the same number of questions using facts that are unproven (false under a closed-world assumption), drawing equally from unproven, instantiated positive rule conclusions or other unproven positive facts. Half are used as questions labeled as false (via the CWA), and for diversity, half are flipped by negating the fact and changing the label to true (i.e., "f ? False" becomes "Not f ? True"). Thus a theory for depth d has (up to) 4(d+1) questions, with an equal balance of true and false answers. Each question is also annotated with the inference depth needed to answer it. 6 Finally the theories and questions are converted into (synthetic) English, using simple natural language templates plus rules to improve fluency (e.g., using pronouns). We use three templates (randomly selected per rule): "If condition [and condition]* then conclusion.", "All attribute* people|things are attribute.", and "attribute* people|things are attribute.", the last two only applicable to rules involving just attributes. Examples are shown in Figures 1 and 3 .

4 Experiments

We now describe five sets of experiments we have conducted, addressing the five questions described in the Introduction.

4.1 Models

We conduct all our experiments (bar Section 4.6) using RoBERTa-large, additionally fine-tuned on the RACE dataset [Lai et al., 2017] . This additional fine-tuning step has been previously shown to help with sensitivity to hyperparameters [Phang et al., 2018] and improve question-answering [Sun et al., 2018] . We use fixed hyperparameters (learning rate etc), inheriting the settings from RoBERTa on RACE .

We train RoBERTa to predict true/false (i.e., binary classification) for each question statement. Questions are supplied to RoBERTa as:

[CLS] context [SEP] statement [SEP],

where context is the theory (facts+rules, expressed in language) and statement is the fact to try and prove. The [CLS] output token is projected to a single logit. A logit score of >0 is treated as predicting true, otherwise the answer is false. Training is performed using cross-entropy loss. For evaluation, we measure accuracy. (The test data has an equally balance of TRUE/FALSE answers, hence the baseline of random guessing is 50%).

4.2 Can Roberta Answer Reasoning Questions?

We train and test RoBERTa models on each of our datasets D=0, D≤1, D≤2, D≤3, and DMax, containing problems requiring reasoning up to depths 0, 1, 2, 3, and 5 respectively. We then test the models on the DMax dataset, that includes problems at depths greater than the other datasets. The results are shown in Table 1 . The results suggest the following findings:

1. RoBERTa is able to master the test data almost perfectly (99% accuracy, row 1) even though the specific reasoning problems (facts+rules) in each test question are distinct from those in the training set. 2. The Depth=0 model, Mod0, only trained on lookup questions, is (unsurprisingly) unable to answer questions requiring reasoning (column Mod0). 7 3. As we train with increasingly deep inference, the models' ability to generalize improves. The D≤2 model (questions involving problems up to depth 2) achieves 71.1% on Depth=3 problems, while the D≤3 model generalizes well right up to the maximum depth tested (e..g, 97.6% for Depth=5 problems). Looking at the models' accuracies (on their own indistribution test data) as a function of training size, we see the higher depth models require more data to learn (Figure 4 ), but have better out-of-domain performance (boxed area in Table 1) .

We additionally test the robustness of the models' answers by perturbing the original theories. Specifically, for each test fact f that is true, we test whether removing a sentence that is part of the proof of f causes the prediction to (desirably) flip from true to false. We call these sentences in the proof tree critical sentences, as the truth of f depends on them. 7 In fact, we see an interesting learning artifact, namely Mod0 scores worse than random (50%) at depths higher than 2. This arises as Mod0 learns to predict all (positive) facts are false except the explicitly given facts, as that is all it has seen at training time. However, most (90%) of the facts proven at depth 2+ in DMax are positive facts proven true, hence Mod0 gets these all wrong. In a few cases, a fact is proven false at depth 2+, and thus Mod0 will by chance get these right. However, it is rare (10%) to see a negated fact with a proof, as half the theories do not use negation, and for the remainder, negative rule conclusions are generated with lower probability than positive rule conclusions during dataset generation. We see a similar artifact with the Mod1 model. Table 1 ).

Conversely, removing an irrelevant sentence should cause no change to the model's prediction. As we know the original proof trees for each fact f in the dataset, we can identify the critical and irrelevant sentences by simple inspection of those trees. 8 Typically, 1-6 sentences of the ≈15-20 sentences are critical for proving each provable fact. We test this using the no-negation 9 half of the DMax test set (≈10k questions). In this partition, 5904 questions have proofs (are true). (The remaining questions are false under the CWA). 10 For each of these questions, we remove each of the theory sentences s i in turn, and measure the prediction accuracy on each result. As there are about 19 sentences/theory on average, this results in 113978 "sentence removed" probes (of which 20746 have a critical sentence removed, and 93232 have an irrelevant sentence removed). Ideally, removing a sentence critical to a question f should flip the model's prediction from T to F, while removing a noncritical sentence should leave the prediction unchanged as T. We also measure overall performance on the entire dataset of questions with perturbed theories.

The results are shown in Tables 2 and 3. We observe: 1. The overall accuracy is largely unchanged on the full collection of questions with perturbed theories, suggesting robustness to these variants (last column, Table 2 ). 2. For the (20k) questions where the prediction is expected to flip from true to false, we see this flip occurs 81% of the time, Table 3 . This suggests moderate robustness to this specific type of perturbation, although notably less than for a formal theorem prover (that would make this flip 100% of the time). For the remaining (93k) questions, the prediction (correctly) stays true over 99% of the time (no Table) . Table 2 : Accuracy on the DMax (no negation) subset, and all its (113k) perturbed (one context sentence removed) variants. The overall accuracy (Remove Any, last column) is largely unchanged, but with a drop for the subset where a critical sentence was removed.

Table 2: Accuracy on the DMax (no negation) subset, and all its (113k) perturbed (one context sentence removed) variants. The overall accuracy (Remove Any, last column) is largely unchanged, but with a drop for the subset where a critical sentence was removed.

Table 3: On the true questions that were originally answered correctly (column 1), the predicted T answer should flip to predicted F when a critical sentence is removed. In practice, we observe this happens 81% of the time (16654/(16654+3895)). In a few (197) cases, the predicted answer was incorrect to start with (column 2). When an irrelevant sentence is removed, the predicted answer stays correct (T) over 99% of the time (not shown).

Original predictions for true (positive) facts: T F New T 3895 (should have flipped) 10 (incorrectly flips) Pred. F 16654 (correct flips)

187 (becomes correct) Table 3 : On the true questions that were originally answered correctly (column 1), the predicted T answer should flip to predicted F when a critical sentence is removed. In practice, we observe this happens 81% of the time (16654/(16654+3895)). In a few (197) cases, the predicted answer was incorrect to start with (column 2).

When an irrelevant sentence is removed, the predicted answer stays correct (T) over 99% of the time (not shown).

4.3 Performance On Hand-Authored Problems

To further test robustness and out-of-distribution performance, we test the trained models on two hand-authored reasoning problems, both including reasoning with negation, written independently of our datasets. Note that these new datasets are used purely as test sets (no training on them, i.e., zero-shot performance); their vocabulary of entities, attributes, and predicates (bar is()) are all new to the models at test time.

Test Datasets Birds The "birds" rulebase is a well-known logic problem illustrating the use of "abnormality" predicates [McCarthy, 1984] . We entered Sergot's formulation of it 11 verbatim (bar syntax), and generated a series of test questions using the same procedure as earlier. Figure 5 illustrates the problem (in restricted English, exactly as presented to our model) and four example questions. We created two linguistic expressions of the formal theory, Birds1 and Birds2. Birds2 is shown in Figure 5, while Birds1 is identical except "can/cannot fly" is replaced with "is/is not flying" to make the negation ("not") more explicit (this turns out not to matter). Electricity We also created a small rulebase about an electrical circuit, describing the conditions for an appliance to function. We created 4 variants of increasing complexity, containing 5, 6, 11, and 12 rules respectively. For each rulebase, we generate different scenarios (the facts) by randomly selecting from possible ground facts. Questions are then generated against each scenario using the same procedure as earlier, resulting in 4 test sets. Figure 6 shows the Electricity2 rulebase with an example scenario plus three questions. (Appendix B shows all four rulebases).

Results

The results are in Table 4 , tested using the earlier trained models. Note that these new problems and vocabularies were un-If someone is a bird and not abnormal then they can fly. If someone is an ostrich then they are a bird. If someone is an ostrich then they are abnormal. If someone is an ostrich then they cannot fly. If someone is a bird and wounded then they are abnormal. If someone is wounded then they cannot fly. Figure 5: Sergot's "birds" puzzle includes reasoning about abnormality predicates. The dataset contains these and other questions about the single theory.

The circuit has a switch.

The switch is on. The circuit has a light bulb.

If a circuit has a switch and the switch is on then the circuit is complete. If a circuit does not have a switch then the circuit is complete. If a circuit is complete then a current runs through the circuit. If a current runs through a circuit and the circuit has a light bulb then the light bulb is glowing. If a current runs through a circuit and the circuit has a bell then the bell is ringing. If a current runs through a circuit and the circuit has a radio then the radio is playing. seen during training (i.e., are zero-shot). We observe: 1. The "birds" problems are solved (almost) perfectly by all but the non-reasoning (Mod0) model (MMax gets one question wrong on Birds1). 2. The MMax model (trained on DMax) solves all but one of these datasets with 90%+ scores. These are two point demonstrations that the trained models can be used to solve novel reasoning problems with high reliability (90%+ in all but one case).

We see one surprising anomaly also: the models trained with deeper reasoning depths do slightly worse on Electric-ity4 than the depth 1 model, Mod1. From investigation, we find almost all failing questions at higher depths are those where the queried fact f is an unsatisfied rule conclusion (hence should be false), in particular when the first argument of f is not the first argument of one of the rule's conditions. Because of the way the original dataset was generated, examples similar to this are very rare in the training data, possibly causing this anomaly. More generally this illustrates that even when trained on a diversity of problems, the trained model can have unanticipated blind-spots. We discuss this further in Table 4 : Accuracy of the earlier models tested on hand-crafted rulebases (zero shot, no fine-tuning). Note that the models were only trained on the earlier datasets (e.g., Figures 1 and 3) , and thus the new rulebases' entities, attributes, and predicates (bar is()) are completely unseen until test time.

Section 5.2.

We ran the earlier trained models on the ParaRules test partition (no fine-tuning, i.e., zero shot). The results are shown in Table 5 . The strongest model, MMax, partially solves this dataset with a score of 66.6%, higher for questions requiring less inference, and lower for questions requiring more inference. (The below-random scores for D=0 reflect the same artifact as earlier, namely predicting everything as false except for facts explicitly given. See Footnote 7).

Note that these results are for zero-shot, with no model exposure to the paraphrased data during training. In constrast, we also trained a model using both of the D≤3 and ParaRules training partitions. The resulting model (last column in Table 5) has an accuracy of 98.8% on the ParaRules test partition (even though the ParaRules test rewordings are distinct from train and dev), showing near-perfect performance is learnable. Although a limited study, this suggests that our findings may extend to rulebases expressed in more natural language.

4.4 Reasoning With Paraphrased Rules

Our experiments so far have been with synthetic language, but our ultimate goal is to reason over full natural language. To test transfer to more natural linguistic forms, we generated a new dataset of 40k examples, using crowdworkers to paraphrase our theories. Of course, this only tests robustness to paraphrasing, not to abitrary natural language. Nevertheless, it is a small first step in this direction.

To generate our data, we follow a similar approach to [Sinha et al., 2019], described below. For this experiment, we used Type 1 theories without negation, i.e., the same form as in Figure 1 .

Dataset Generation

To generate the new dataset, called ParaRules, we first generated a novel collection of 10k theories (facts+rules) expressed in synthetic language, as before, then extracted the "fact groups" and rules from each. A "fact group" is all the facts in a theory about a particular person, e.g., (from Figure 1) "Alan is blue. Alan is rough. Alan is young.", while a rule is just the original "If...then..." sentence. We then asked crowdworkers to creatively re-express the fact-groups and rules, shown to them in English, in their own words. For example, the earlier fact-group might be rewritten as: "Alan is on the young side, but rough. He often feels rather blue.". Rewritten fact-groups were then turned into templates by variabilizing the person name. Turkers also rephrased each rule (no variabilization needed). Rephrasings were automatically checked to make sure that all the key attributes were mentioned (and no others included), and rejected otherwise.

We use these to assemble the new ParaRules dataset of 40k questions against ≈2k theories expressed in the paraphrased language. To build each theory, facts were collected by randomly sampling and instantiating fact-group templates with people's names, and rules were randomly sampled. An example is shown in Figure 7 . The train, dev, and test sets were generated using different partitions of the templates, to ensure that no templates were shared between partitions.

As we kept track of the corresponding logic underlying each fact group and rule, we can then generate questions as before: Exhaustively forward-chain on the (logic version of) the theory, discard if a contradiction is hit or reasoning is of insufficient depth (we require at least depth 3 reasoning), and Alan, who is round, red, kind, and also green, tends to be rather blue. In the snow sits Bob, crying from being cold. Charlie has green teeth and rough skin. People also notice his blue eyes.

A quite nice person who is red and green is also big. Any big, kind person that turns red is cold to the touch. Young, kind people have a habit of being nice. A kind person will certainly be young. Figure 7 : A paraphrased theory in the ParaRules dataset. The reasoning for the true answer here is: Alan is kind (given), therefore young (rule4), therefore nice (rule3). Table 5 : Accuracy with rules paraphrased into more natural language (ParaRules), without fine-tuning (zero shot) and with (last column only). The strongest zero-shot model (MMax) partially solves (66.6%) this problem zero-shot, with strongest performance for depth 0 and 1 inferences. then for each depth select inferred and non-inferred facts as true/false questions as before.

4.5 Generating Explanations

In Section 4.2, we tested (for the no-negation theories) whether removing a theory sentence s i caused the prediction for a true fact f to flip to false, and found that sentences causing a flip were very often (98%) part of the original proof of f (i.e., critical sentences), while sentences that did not were not (97%). Using that data about which removed sentences Above, the model has correctly identified the sentences critical to the answer (shown in green). Here, the underlying line of reasoning is: "The lion is red (first highlighted statement), therefore young (third highlighted), and as it also cold (second) it visits the rabbit (fourth)." Above, the model has identified two critical sentences (green), but has missed three (purple) and included an irrelevant rule (red). Here, the underlying line of reasoning is: "The lion eats the squirrel (fourth highlighted statement), therefore is rough (third), therefore young (first), and as it also sees the squirrel (sixth) it is green (second)." --------- Figure 9: Counts of the F1 scores for predicting which sentences are critical to the proofs of questions in DMax (test, no negation subset). For over 70% of the questions, the model predicts critical sentences perfectly (F1=1.0), with high F1 in the remaining case. caused a flip, we can build a map of the theory paragraph showing which sentences the model considers critical to a conclusion, a potentially first step to providing an explanation for the model's answers. Figure 8 shows two examples, the first showing where the model has perfectly identified the critical sentences, and the second where it has made some errors.

��

We can quantify this "explanatory" performance by measuring the per-proof scores of predicted vs. actual critical sentences for each question, measuring the precision, recall, and F1 scores for each question in turn. The (macro)average P/R/F1 scores are P=98.7, R=86.9, and F1=92.4, suggesting a high degree of reliability in predicting sentences critical to a proof. (This is essentially an alternative view on the earlier robustness data, viewed from a per-proof perspective). A histogram of the F1 scores is shown in Figure 9 , indicating perfect critical sentence identification for over 70% of the questions, and high F1 for the remaining questions. This suggests the model has some knowledge of the dependencies between the context sentences and a particular conclusion. Table 6 : Transformers (RoBERTa,BERT) are sufficient but not strictly necessary for this task, although other architectures (ESIM) do not score as well. DECOMP was run as a sanity check that the datasets are not trivially solvable -its low score (random baseline is 50%) suggests they are not.

4.6 Other Architectures

To what extent are our results specific to RoBERTa? To explore this, we also trained BERT and ESIM (an LSTM-based model for natural language inference) [Chen et al., 2016] on our datasets. Also, as a sanity check that our datasets are not trivially solvable, we ran the decompable attention model (DECOMP) on our data [Parikh et al., 2016] . The results are shown in Table 6 .

We observe that the strongest BERT model trained up to depth 3 (Mod3) masters the dataset that includes higher inference depths (DMax) with 95%+ accuracy, while ESIM's scores are lower (≈80%). Note that unlike RoBERTa and BERT, ESIM was not pre-trained on large amounts of text, perhaps contributing to its lower scores. This suggests that our results are not specific to RoBERTa or transformers, although transformers seem to learn the tasks more easily. As expected, DECOMP does not do well (a random baseline score is 50%), suggesting the datasets are not trivially solvable.

A related question is: how important is pretraining? To test this, we generated a version of the D≤3 dataset in which every word was (systematically) replaced by a random word, so that there was no grammaticality in the theories. After training (using the best hyperparameter settings we could find, using the dev partition), RoBERTa scores 83.3% on the test partition, substantially below the 99.3% on the original dataset in restricted English (Table 1 ). This suggests that the pretrained knowledge in RoBERTa is playing a role in its performance and making learning easier. Similarly the zero-shot transfer to hand-authored language, Figure 7 and Table 5 , suggests pretrained knowledge of language may be playing a role.

5 Discussion

We now discuss the bounds of the work we have conducted, broader research questions that it opens up, and opportunities for future work.

5.1 Scope Of Our Theories

While we have demonstrated that transformers can emulate a form of deductive reasoning, our demonstrations have been with small theory sizes (< 20 facts, < 10 rules), small domains (< 100 possible ground facts), and with a limited rule language (at most one variable that is universally quantified over) in propositional logic. 12 In addition, our training datasets have been generated under particular assumptions and distributions (e.g., that all entities to be queried will be mentioned at least once in the theory facts) that may not hold in general. While we have shown several data points suggesting that the trained model generalizes outside the training regimen, the full scope of its generalization is still unknown.

More generally, demonstrating reasoning with more complex theories and richer rule languages is still an open task. For example, we limited rules to have at most one variable (denoted by "someone"/"something"), because expressing multi-variable rules can become linguistically cumbersome (e.g., "If a person's father is a second person, and the second person's father is a third person, then the first person's grandfather is the third person."). Finding and extending the rule language to more linguistically palatable, multi-variable implications (e.g., allowing conciser rules such as "Someone's father's father is their grandfather.") is a possible next step.

5.2 Models Of Inference

We have shown that our trained model emulates reasoning, but which particular model of reasoning is it emulating? Our training data is generated using a particular semantic theory of inference, but alternative semantics are possible. For example, we make a closed-world assumption (CWA) that anything unproven is false, and (following the semantics of logic programs [Apt et al., 1988] ) we require true facts to be supported. 13 However, these are just design decisions, and other alternatives are possible. For example, we could have made an open world assumption that unproven statements have an unknown truth value, and trained the system using examples generated with a three-valued logic. Or, we could use an inference model that handles uncertainty, e.g., admiting linguistic modifiers such as "often", "sometimes", "a few", etc. and train a transformer with examples from that.

What exactly is the nature of the transformer's reasoning? Perhaps it is "simply" memorizing different canonical problem structures, although if so, our results suggest it can still use this information compositionally to solve novel problems at unseen depths (Table 1 ). Nevertheless, Section 4.3 also suggests that a trained model may have blind-spots, possibly due to unusual classes of problem unseen in training. If so, finding ways to characterize the different types of inference problems (e.g., defining canonical problem "signatures"), and designing training curricula to ensure they are systematically covered and/or the model is capable of generalizing to them, would be valuable.

5.3 Formal Theorem Proving

If a transformer can reliably emulate the results of correct reasoning, it may have utility for the formal reasoning commmunity. In particular, if its output generalizes to more complex problems than seen during training (as our experiments demonstrate for one particular setting, Table 1), one might be able to train on problems that are solvable with reasonable computing resources, but then apply it to more complex problems that are computationally infeasible, and still obtain high accuracy. Even if accuracy is not perfect, the results might help inform more rigorous theorem-proving. For example, one could imagine a transformer as a "proof assistant" that finds likely true conclusions that could then be verified by a formal reasoner (where verification is easier than proof from scratch). Our results in Section 4.5 suggest first steps to reconstructing proofs using the facts and rules in the theory. Similarly, a system that can identify facts that are likely true may help guide model generators, e.g., [Niemelä and Simons, 1997] .

Although we have presented theories in English, another open question is whether transformers could reason with theories expressed in their native (formal) form, rather than translated. Similarly, we have not investigated the transformer's behavior with inconsistent theories, again a potential avenue for further work.

Note that as we increase the complexity/size of the theory, we reach the limit on the number of tokens a transformer can handle (≤512). Thus handling larger theories might also require new transformer architectures, or new methods for fetching relevant rules to reason with from a larger knowledge base.

5.4 Natural Language Inference (Nli)

Our long-term goal is reasoning over full language, i.e., natural language inference (NLI). Although we have trained using restricted English, we want the model to still behave correctly if the knowledge is expressed in different ways. Our experiments with paraphrased rulebases suggests that indeed there is transfer from synthetic to more natural language (Table 5) , at least for deductive reasoning.

More generally, though, there are many natural language statements that do not neatly fit our rule language and its underlying representation (e.g., "Most birds fly", "It often rains in Seattle in winter."). While our experiments with paraphrased rulebases suggests the model may still draw the right conclusions, i.e., there seems to be some transfer, the results are likely to become less predictable as the semantics get further from those seen at training time. To apply our methodology to statements with more complex semantics would require new training data, either synthesized from a richer formal representation and model of inference, 14 or collected from people.

Similarly, the deductive model of inference that we have taught the system may still be quite far from that required for NLI. In particular, NLI allows for unsupported inferences that "a person would typically infer" [Dagan et al., 2013], while we have used a precise model of inference in which all of a rule's conditions need to be proven true in order for the conclusion to follow. For NLI, we would like our model to still proceed if there are gaps in the explicitly provided knowledge, providing the missing knowledge is "obvious" (and not contradicted by the explicitly provided facts), perhaps by leveraging its pretrained knowledge. Similarly, our model's treatment of negation as failure (NAF) sometimes clashes with intuitions about NLI, for example given (just) "If my car does not have gas then it is not working." our model will conclude (given nothing else) that "My car is not working." as it cannot prove that "My car has gas.". This raises a fundamental tension about the nature of NLI: We want reasoning to be rigorous (conclusions justified by the information provided), but also "soft" (tolerant of phrasing differences and commonsense knowledge gaps), and strictly speaking these two goals are in conflict. Our experiments with Turk-authored language illustrates tolerance of phrasing differences, which we view as desirable, although in a strict deductive sense it is unjustified to conclude (say) "A person is green" from "Charlie has green teeth" (Figure 7) . Similarly we would like the model to tolerate minor, unstated taxonomic gaps, for example given "Buildings have roofs" conclude "My house has a roof", even if "Houses are buildings" is not explicitly stated (but not conclude that result if it is explicitly stated that "Houses are not buildings"). 15 Char-acterizing which inferences should be deductive vs. which can be assumed in NLI, and how to control such flexibility, remain significant open questions. Regardless of how these issues are resolved, though, our experiments suggest that transformers may may be able to learn the target reasoning given appropriate training. 16

6 Conclusion

Can transformers emulate reasoning with explicitly stated rulebases? We have presented substantial evidence that they can, at least within the constrained environments explored in this paper. Within these environments, we have shown that transformers can reliably (95%+) replicate the i/o behavior of a formal reasoner, including on questions at greater inference depths than in training, independently hand-crafted problems, and problems expressed in more natural language. In addition, we have shown that they can, with high reliability (94.1% F1), identify which context sentences a provably true fact depends on, suggesting they have learned some model of fact dependency -i.e., a kind of inference chain -internally. The exact nature of the underlying neural inference remains an open question.

Regardless of mechanism, the ability to reason (or emulate reasoning) over rules expressed in language has potentially far-reaching implications. For example, rules might be retrieved from natural sources (e.g., science texts, Wikipedia), and if a system's answer is derivable from a subset of retrieved rules, then those rules form a valid explanation for the answer. Similarly, if the answer is wrong, the explanation can be traced to find the source of the error, and the bad knowledge potentially corrected. Finally, the mechanism opens the door to performing neural counterfactual reasoning, by providing counterfactual facts to reason with. For example, we can modify the earlier "birds" rulebase to describe a world in which birds typically don't fly, but where ostriches can fly, and see the consequences; or imagine a world in which plastic is a type of metal, and see how the conductivity of objects change (Appendix C). To encourage further progress, all our datasets and an interactive demo are available at http://rulereasoning.apps.allenai.org/.

Strictly speaking, the model is not a theorem prover as it only outputs the proof results, not the proof itself. In Section 4.5, we describe first steps to also recovering the chain of reasoning.

where conditions and conclusions are (possibly negated) literals using binary predicates.3 We only consider rulebases where all statements have a welldefined truth value, i.e., consistent, stratified rulebases.4 NLI is informally defined as making inferences from language that "a person would typically infer"[Dagan et al., 2013], and includes use of many linguistic forms, unstated background knowledge, and sometimes unsound inference steps.

Or with 20% probability, an entity, in order to include some fully grounded rules in the datasets.

For proven facts, we use the depth of the shallowest proof tree. For unproven facts, we use the depth of the shallowest branch of the proof tree that fails (or the deepest of the shallowest if there are multiple (failed) proofs).

If there are multiple, alternative proofs for f , we define a critical sentence as one that is used in all the proofs. To support tihs, we generate and record all possible proofs for each provable fact f during dataset generation.9 With negation, the definition of critical sentence becomes more complex because the the theory is non-monotonic (i.e., removing a sentence may cause a fact to become true). Hence, we omit theories with negation for this analysis.10 For questions that were flipped from "f ? False" to "Not f ? True" to encourage diversity, we undo the flips and just consider the positive forms of facts f .

https://www.doc.ic.ac.uk/∼mjs/teaching/KnowledgeRep491/ ExtendedLP 491-2x1.pdf, p5, also in Appendix A.

Although our rules include variables, they are universally quantified over finite domains, hence our theories can be reduced to propositional (variable-free) theories.

A fact is supported if it is either known, or is the conclusion of a rule whose body is true. Requiring support provides a mechanism for resolving ambiguities that would otherwise exist. For example, the theory {¬a → b} has three possible models {¬a, b}, {a, ¬b}, and {a, b}, but the latter two are disallowed by the semantics of logic programs as a is unsupported by the theory.

If one even exists -formal reasoning is still far from modeling all of natural language inference.15 Similarly, the hand-authored explanations for 4th Grade Sci-

https://www.doc.ic.ac.uk/∼mjs/teaching/KnowledgeRep491/ ExtendedLP 491-2x1.pdf, p5

Live demo available at http://rule-reasoning.apps.allenai.org/