Enriching a Model's Notion of Belief using a Persistent Memory

Nora Kassner
Oyvind Tafjord
Hinrich Schütze
P. Clark
ArXiv
2021
View in Semantic Scholar

Abstract

Although pretrained language models (PTLMs) have been shown to contain significant amounts of world knowledge, they can still produce inconsistent answers to questions when probed, even after using specialized training techniques to reduce inconsistency. As a result, it can be hard to identify what the model actually "believes" about the world. Our goal is to reduce this problem, so systems are more globally consistent and accurate in their answers. Our approach is to add a memory component a BeliefBank that records a model’s answers, and two mechanisms that use it to improve consistency among beliefs. First, a reasoning component a weighted SAT solver improves consistency by flipping answers that significantly clash with others. Second, a feedback component re-queries the model but using known beliefs as context. We show that, in a controlled experimental setting, these two mechanisms improve both accuracy and consistency. This is significant as it is a first step towards endowing models with an evolving memory, allowing them to construct a more coherent picture of the world.1

1 Introduction

How might we ascribe a notion of belief to a model? Prior work has shown that, while pretrained language models (PTLMs) contain substantial world knowledge (Petroni et al., 2019; Roberts et al., 2020) , their answers to probing questions can be inconsistent (Elazar et al., 2021; Ravichander et al., 2020; Kassner and Schütze, 2020) , making it hard to pin down what a model actually "believes" about a proposition. Our goal is to reduce this problem by having systems provide more globally consistent answers to questions.

Prior work on reducing inconsistency has focused on retraining the model itself to be more consistent in its answers, e.g., (Ribeiro et al., 2019; 1 Dataset is available at https://allenai.org/data/beliefbank Figure 1 : The proposed architecture. The model's raw answers are stored in a persistent memory (Be-liefBank), and two mechanisms attempt to improve them: (a) A constraint solver flips beliefs that clash significantly with others (b) A feedback mechanism requeries the model about those beliefs, but uses other, relevant beliefs as the query context. We find both consistency and accuracy of the overall system improves. Li et al., 2019) , but with imperfect results. We present an alternative approach in which the model is unchanged, but an evolving, persistent memory of beliefs -called the BeliefBank -is layered on top, and two mechanisms use it to improve consistency among beliefs. First a reasoning componenta weighted SAT (satisfiability) solver -flips beliefs that clash with others. Second, a feedback component re-queries the model but using known beliefs as context, aiming for more accurate and consistent answers. Thus, the overall system attempts to build a more coherent representation of the world from the model's raw answers, by ruminating on the answers seen so far. This can also be viewed as assembling a simple "mental model" of the world (Johnson-Laird, 1983 ) from the noisy output of a raw PTLM.

Figure 1: The proposed architecture. The model’s raw answers are stored in a persistent memory (BeliefBank), and two mechanisms attempt to improve them: (a) A constraint solver flips beliefs that clash significantly with others (b) A feedback mechanism requeries the model about those beliefs, but uses other, relevant beliefs as the query context. We find both consistency and accuracy of the overall system improves.

We explore this in a controlled experimental setting where both candidate facts and constraints are provided. Candidate facts are simple sentences that may be true or false, e.g., "An eagle is a bird" (T), "An eagle is a mammal" (F). Constraints are between (variabilized) facts, e.g., "X is a bird → X has wings". These allow us both to probe and measure improvement in the system's consistency and accuracy.

We make the following contributions: 1. We augment a PTLM with a global memory -the BeliefBank -and show how it can be leveraged to produce more globally consistent answers. Specifically, we compare two mechanisms for using the BeliefBank -a reasoning and a feedback mechanism -and demonstrate that both can improve system accuracy and consistency in a restricted setting. 2. We contribute a controlled dataset to measure a PTLM's consistency against given constraints 3. We provide an analysis of the failure modes and directions for future work These are significant as they enrich a model's notion of "belief", helping them construct a more coherent picture of the world.

2 Related Work

PTLMs are known to contain extensive world knowledge (Petroni et al., 2019; Roberts et al., 2020 ), yet be inconsistent in their answers to probing questions (Ettinger, 2020; Davison et al., 2019; Kassner and Schütze, 2020; Ravichander et al., 2020; Elazar et al., 2021) . While there has been some prior work on improving answer consistency, the primary approach has been through modified model training. Ribeiro et al. (2019) improved consistency by adding question paraphrases and question implications to the training data (data augmentation). Others have trained models with (small) sets of examples with known constraints between them, and included an additional loss term reflecting inconsistency among set members during training (Minervini and Riedel, 2018; Li et al., 2019; Asai and Hajishirzi, 2020) . However, the constraints are unused at test time (beyond what the model may have internalized), and inconsistent answers are still produced.

For problems requiring a structured answer, e.g., predicting a sequence of state changes, domainspecific constraints have been used to downgrade/block answers that violate them (Tandon et al., 2018; Du et al., 2019) . This encourages consistency within a single answer structure, but not among different answers, our goal here.

In the area of knowledge graph construction, Pujara et al. (2013) define "knowledge graph identi-fication" as the task of building a maximally consistent knowledge graph given noisy facts and their extraction confidences, and ontological constraints between them. They develop a solution use probabilistic soft logic (PSL) (Broecheler et al., 2010) as their constraint reasoner. Similarly, Berant et al. (2010) learn the globally optimal set of entailments between a large database of candidate entailment pairs (with associated confidences), by applying a global transitivity constraint (X Y & Y Z → X Z) using Integer Logic Programming. In our case, we follow similar ideas but show how they can be usefully applied to the noisy predictions of a PTLM. Specifically, we formulate the task as a weighted SAT (satisfiability) problem, and use the SMT solver Z3 (De Moura and Bjørner, 2008) to solve it (Section 4.2.1).

In the area of formal knowledge-bases (KBs), efficient algorithms have been developed for detecting, measuring, and resolving inconsistency (Hansen and Jaumard, 2000; Andersen and Pretolani, 2001; Thimm, 2009; Muiño, 2011; Thimm, 2013) . Our contribution is to leverage some of these methods for PTLMs, adding a reasoning capability that the PTLMs alone lack.

An important part of our contribution is the use of a dynamic, persistent memory. While there are neural architectures that include an associated memory, e.g., (Henaff et al., 2016; Sukhbaatar et al., 2015) , these components typically play the role of a short-term working memory to help computation. In contrast, our BeliefBank memory layer is a persistent, long-term memory of explicit beliefs.

Finally, our feedback mechanism uses old answers to help answer new questions. This builds on prior work such as Self-Talk (Shwartz et al., 2020) , where a model asks itself related questions to help with new answers. In our case, feedback is selected from a global BeliefBank, rather than generated with templated subqueries, potentially allowing more control over feedback selection.

3.1 Beliefs

What does it mean to believe a proposition, say p = eagles are birds? In general, a system can be said to (appear to) believe something if it acts as if it were true. In the specific context of a QA system, we would expect it to produce answers consistent with p (and its other beliefs). Pragmatically, we expect the system to (a) give a consistent answer to different paraphrases of the question "p?" ("Are eagles birds?", "Is an eagle a type of bird?", ...), and (b) give correct answers about implications of p ("Eagles lay eggs", "Eagles have feathers", ...). Of course, a system may not perfectly answer such implications as the implications may have exceptions, or the system may not be a perfect reasoner. 2 . Thus, to the external observer, there are degrees to which a system acts as if it believes something.

3.2 Task Definition

Our goal is to ascribe a stronger notion of "belief" to a system that includes a model M , by improving the consistency and accuracy of its answers (compared with M ). To measure this we consider a true/false probing task, where we are also given a set of constraints between answers:

Given:

• a set of sentences S • a set of constraints C(S) between (the truth values of) sentences in S, each annotated with a weight w i (A penalty w i is applied if c i ∈ C(S) is violated) • a model M that takes as input a True/False natural language (NL) question Q and optionally an (NL) context X, and predicts a True/False answer A with confidence score F Predict:

• the True/False labels for S, so as to maximally improve accuracy (with respect to gold labels) and consistency (minimize total penalties of constraint violations) compared with model M 's raw answers

4 Approach

Our approach is to add a memory layer, called the BeliefBank, on top of the model to globally track beliefs. Two mechanisms are then used to modify BeliefBank beliefs, namely (a) constraint reasoning and (b) re-asking queries augmented with feedback from the BeliefBank.

Let

• a belief b i be a triple ( ("a poodle is a dog", T, 0.9) denotes the belief (strength 0.9) that "a poodle is a dog" is a true statement (T). • a BeliefBank B(S) = a set of beliefs over

s i ,l i ,w i ), where -s i is a sentence ∈ S -label l i ∈ {T,

assertions S = s 1 , ..., s n • a constraint c i = a 4-tuple of the form (s i → s j , l j , w i ) where -s i , s j are sentences ∈ S, -l j ∈ {T,F} denotes the expected truth of s j if s i is true, -w i denotes the strength of that expecta- tion (a penalty w i is applied if it is vio- lated).

For convenience, we allow a shared variable X to be used in s i , s j , allowing a set of grounded constraints to be expressed in a single statement, e.g.

("X is a dog" → "X has a tail", T, 0.8) expresses that if something is a dog, then it should (T) have a tail, with a penalty of 0.8 applied if it does not. Mutual exclusivity is expressed using two rules, e.g., that fish and birds are mutually exclusive is expressed:

("X is a bird" → "X is a fish", F, 1.0) ("X is a fish" → "X is a bird", F, 1.0) where "F" indicates the conclusion should be false if the condition is true.

• A constraint graph C(S) = a set of constraints c i over assertions S Given a set of beliefs B(S) about S and a set of constraints C(S), we measure consistency using (the complement of) Li et al. (2019)'s conditional constraint violation (τ ) metric, namely the fraction of constraints whose condition s i is believed true, but whose conclusion (that s j has truth value l j , denoted l j .s j ) is not. In other words, over all constraints

c i ∈ C(S), inconsistency τ is τ = |{ c i | ¬(s i → l j .s j ) }| / |{ c i | s i }|

i.e., the size of the set of violated constraints (s i → l j .c j is false) divided by the size of the set of applicable constraints (i.e., those where the condition s i is true). We then define:

consistency = 1 -τ Figure 2

Figure 2: Simplified illustration of iteratively improving the BeliefBank (oval). +/- denote true/false predictions for 10 facts about swallows. The model alone makes 4 prediction errors (M). Re-querying for those beliefs using other selected beliefs as context (feedback) fixes 2 of those errors (F). Running constraint solving on the updated BeliefBank fixes another error (C), resulting in just 1 error in the final BeliefBank. Here, the sequence is Model → Feedback → Constraints.

: Simplified illustration of iteratively improving the BeliefBank (oval). +/-denote true/false predictions for 10 facts about swallows. The model alone makes 4 prediction errors (M). Re-querying for those beliefs using other selected beliefs as context (feedback) fixes 2 of those errors (F). Running constraint solving on the updated BeliefBank fixes another error (C), resulting in just 1 error in the final Belief-Bank. Here, the sequence is Model → Feedback → Constraints.

4.2 Methods

We evaluate two methods for improving the Belief-Bank's accuracy and consistency:

Constraint solving: Given a model M 's answers (with confidences) in the BeliefBank, a constraint solver seeks to reduce constraint violations by potentially flipping answers that maximally clash with other answers. Feedback: Beliefs are checked by re-querying the model, but additionally using relevant other beliefs as context for those (re-)queries. Figure 1 shows these components, and Figure 2 illustrates how they can iteratively improve the Be-liefBank. In Figure 2 , there are 10 beliefs of interest about swallows. When probed about these, the raw model gets 4 of these wrong, including inconsistently believing that a swallow is both a bird and a mammal (see (M) in Figure 2 ). Applying feedback, we re-ask the model the 10 questions but adding in the most relevant known beliefs as context for those (re-)queries. The BeliefBank is updated with these new answers, fixing 2 of the errors (F). We then run the constraint solver over the BeliefBank, resulting in one answer being flipped (question 9, from T to F), thus fixing another error (see (C)). The final BeliefBank has just 1 error and greater consistency. Note that updates can also introduce new errors (not shown in Figure 2 ), so improvement is not guaranteed.

4.2.1 Constraint Solving

Given a set of beliefs and constraints, the constraint solver has two competing objectives: (a) flip beliefs so as to minimize constraint violations (b) don't flip beliefs, so as to preserve the model's raw beliefs, i.e., minimize conflict between the model and BeliefBank. To implement this tradeoff, the model's beliefs are themselves treated as just another constraint, e.g., the belief "a poodle is a dog" (weight 0.9) is treated as a constraint "a poodle is a dog", with penalty 0.9 if it is violated (i.e., labeled as false by the constraint solver). To balance the two objectives (a) and (b), the belief weights are scaled by a learned hyper-parameter λ, trained on a calibration part our dataset, disjoint from the data then used in experiments (Section 5).

To implement constraint solving, we translate the task into a weighted SAT (satisfiability) problem P , for which efficient algorithms with guarantees exist. Each belief becomes a weighted assertion in P , e.g., the belief ("a poodle is a dog", T, 0.9) is expressed in SAT syntax: 0.9 "a poodle is a dog" 3 while the constraint ("a poodle is a dog" → "a poodle has a tail", T, 0.8) is expressed: 0.8 "a poodle has a tail" -"a poodle is a dog" (literally: "a poodle has a tail" OR NOT ("-") "a poodle is a dog"). We then apply the solver Z3 (De Moura and Bjørner, 2008) to P , which outputs a set of truth assignments for all individual sentences in P so as to minimize the weighted sum of violations. If the truth assignment of any sentence has changed, the BeliefBank is correspondingly updated.

4.2.2 Feedback

Feedback involves re-asking the model a question, but with the benefit of knowing answers to related questions. To use these answers in the re-query, selected beliefs are added to the query context before re-asking the model. (Note that the selected beliefs are not guaranteed to be correct, of course). Our conjecture is that if the model is explicitly reminded of relevant beliefs when answering a new question, it will answer the question more accurately and consistently. For example, in Figure 1 , when asked "Do swallows have gills?", our model M incorrectly answers "yes". But if reminded that swallows are not fish, by asking: "CONTEXT Swallows are not fish. QUERY Do swallows have gills?" the model now correctly answers "no".

We evaluate two policies for choosing which beliefs to feed back to M when re-asking question Q:

1. randomly selected from the BeliefBank 2. most relevant, using the constraint graph to identify relevance. As the constraint graph captures potential clashes that the answer to Q could cause, we use the graph to select those beliefs that would be most affected by that answer. For example, if the current query is: "Is a poodle an animal?", the constraint graph identifies potential clashes the would occur if the model answered "yes", and also clashes if it answered "no". In this example, if the model answered "no", the resulting belief ("a poodle is not an animal") would strongly clash with other beliefs "A poodle is a dog.", "A poodle is a mammal.", and "A poodle is a domesticated canine.", so all three are strong candidates for the context. We select the three overall strongest clashing beliefs found in this way, considering both a "yes" and a "no" answer to Q. In both cases, three (if possible) beliefs are selected, this number empirically found to be most effective. (For experiments with small amounts of data, sometimes less than three relevant beliefs are found.)

5 Dataset

We create a dataset to test our approach in a controlled way, allowing us to perform systematic experiments to evaluate behavior. The dataset contains two parts, constraints and facts, defined over simple sentences such as "a swallow is a bird.".

5.1 Constraints

The dataset contains two kinds of constraints: positive implications: (conclusion truth value l i = True), e.g., "X is a dog → [TRUE] X has a tail." mutual exclusivities: expressed as a pair of constraints with l i = False, e.g., "X is a dog → [l=FALSE] X is a bird." "X is a bird → [l=FALSE] X is a dog."

expresses that an entity cannot be both a dog and a bird at the same time.

Positive implications were manually gathered from ConceptNet (Speer et al., 2017) . First, we identified 176 general concepts of interest, e.g., "mammal", choosing concepts with high occurrence (> 100 times) in ConceptNet, avoiding significantly ambiguous terms (e.g., "bat"), and filtering out plurals and obscure concepts. For these entities, we then collected all ConceptNet facts involving 6 relations: IsA, HasA, MadeOf, PartOf, HasProperty, and Ca-pableOf, and re-expressed them as constraints. For example, the ConceptNet triple (dog, HasA, tail) gives rise to the constraint "X is a dog" → "X has a tail." (Triples are converted into English sentences using simple templates). We then manually filter theses constraints for factual correctness. We also add the constraint in the backwards direction, "X has a tail" → "X is a dog", expecting these to have lower weight, i.e., be weaker. (These backwards rules discourage the trivial solution that everything is false.) Finally, weights are assigned to all the constraints using a combination of crowdsourcing and calibration, described shortly (Section 5.2).

Mutual exclusivities were gathered from the "isa" taxonomies in ConceptNet and WordNet (Fellbaum, 2005) , using the approximation that (generally) siblings in the noun hierarchy are mutually exclusive. Thus, for any pair of concepts in our concepts of interest that reside in different taxonomic subtrees, we add a mutual exclusivity constraint (using two constraint rules).

We collected 12,147 constraints in this fashion (1798 implications, 8599 mutual exclusivities). The set of constraints can be viewed as a constraint graph of implications from one sentence to another.

5.2 Constraint Weights

We used a combination of crowdsourcing and calibration to assign a reasonable constraint weight w i to each constraint. Workers were shown each constraint and asked to judge if the implication held always/usually/sometimes/rarely/never with raw scores 4/3/2/1/0. Three workers independently scored each constraint and the scores averaged.

When used with a particular model M , the raw scores need to be recalibrated to appropriately match the confidences output by that model. This is described shortly in Section 6.2.

5.3 Facts

We also collect a set of truth-labeled facts about different entities, relevant to the constraints. To do this, we select a new entity, e.g., "poodle", that is a member of one of our general concepts, e.g., "dog", then instantiate the constraint graph with that entity (i.e., set X = "poodle"). We then identify the leaf (source) nodes of that graph, just considering forward implication rules, i.e., find facts not implied by other facts in the graph, and manually annotate their True/False labels. We then use the implications to infer other True/False labels for other sentences, i.e., propagate the annotated labels through the graph. This provides "silver" labels for sentences reachable in this way (a subset of all the sentences in the graph), silver because the implications are soft, hence not guaranteed to hold for all entities.

We repeat this for 27 entities (17 animals, 10 plants), resulting a final dataset containing 4998 "silver" facts (sentences + True/False labels).

6 Experiment Environment

6.1 Model

The fixed model M that we use for our experiments is the multi-angle QA system MAQAw (Bhakthavatsalam et al., 2021) . MAQAw is a derivative of UnifiedQA, a state-of-the-art T5 model fine-tuned on ≈400k question-answer pairs (Khashabi et al., 2020) . MAQAw was then further fine-tuned on several thousand science questions and trained on different permutations of inputs and outputs (e.g., query Q + context X → answer A; XA → Q; etc.). To query the model's beliefs we pose the query and let the model chose between the two answer options "yes" and "no". MAQAw also outputs an answer confidence, used as the belief weight. Note that we do not retrain MAQAw for this work; rather, it is used as a black-box QA module in the broader system (Figure 1 ).

6.2 Calibration

To appropriately mix MAQAw's confidences and the constraint scores, a calibration step is needed. To do this, we use the 1239 facts about seven entities in our dataset as a calibration set, and then perform experiments using (only) the remaining facts.

We calibrate the crowdsourced raw constraint scores using sigmoid-scaling (two parameters). We also calibrate the relative weight λ of the model's beliefs and the constraints, the relative weight of the backward implications (compared with the forward), and the relative weight of mutual exclusivity rules. To do this we perform a grid search over these parameters, finding the values that result in the highest F1 (accuracy) after running the constraint solver over the raw model's beliefs about these facts.

7 Experiments

We evaluate accuracy (F1) 4 and consistency (1τ , Section 4.1) for 3 different-sized, randomly selected subsets of our data: 10%, 50%, 100%. We average results across entities. The results are shown in Table 1 , showing the model's a priori accuracy and consistency (line 1), and the effects of constraint-solving, feedback, or both (lines 2-5).

Table 1: Results after seeing different proportions of data. model (raw) are scores for the model stand-alone. In model + constraint-solving, the constraint solver is run over all the raw answers so far. In model + feedback, questions are re-asked using context selected from the other raw answers so far. In model + feedback + constraintsolving, the constraint solver is additionally run on the answers with feedback.

7.1 Results

Several conclusions can be drawn:

1. The basic model is both imprecise and inconsistent, with F1 ≈ 73% and consistency ≈ 75%. This reflects similar observations of model inconsistency made for other PTLMs (Section 2), and illustrates the challenge we wish to address. As we show shortly, the main source of error is in precision, rather than recall, i.e., the large majority of errors are false positives.

2. Constraint-solving removes almost all inconsistency with respect to our constraints, even with only a subset of the data. (Of course, with less data, there are fewer applicable constraints that can be violated.) It also significantly improves accuracy (≈+10% F1), indicating that the flipped truth values not only reduce clashes but better align with the true (gold) labels. Note that improvement in accuracy is not guaranteed: constraint-solving can (and sometimes does) flip truth values the wrong way, in order to reduce constraint violations. The improvement suggests that the model is getting enough answers correct, a priori, to steer the constraint-solving in the right direction. Comparing accuracy after seeing 10%, 50% and 100% of the data, we note Table 1 : Results after seeing different proportions of data. model (raw) are scores for the model stand-alone. In model + constraint-solving, the constraint solver is run over all the raw answers so far. In model + feedback, questions are re-asked using context selected from the other raw answers so far. In model + feedback + constraintsolving, the constraint solver is additionally run on the answers with feedback.

incremental gains the more beliefs and constraints the constraint-solver is able to leverage.

3. Feedback (both types) also improves both accuracy and consistency, although not to the same extent as constraint-solving. It is perhaps surprising that explicitly reminding the model of facts it already believes, when answering a new question, can improve the results. The context may be behaving loosely like an attention mechanism, encouraging the model to focus on the facts provided in the context, even if they are already latently known, thus influencing question-answering behavior. Similar observations have been observed by Shwartz et al. (2020) .

4.

Constraint-guided feedback is more effective than random feedback when the Belief-Bank is large (100% column), with F1 rising to 88% and consistency to 91%. Conversely, randomly selected feedback perform similarly across different data sizes. We analyze this further in Section 7.2.

5.

Combining constraint-guided feedback with constraint solving further improves accuracy by ≈5 percent points compared with only using feedback, and by ≈2 percent points compared with the constraint-solver alone. Again the addition of the constraintsolver results in almost perfect consistency of 99%. This setup reflects the incremental setup depicted in Figure 2 , where the constraint-solver can leverage the updated BeliefBank.

7.2 Analysis

Both constraint-solving and feedback can cause the overall system to flip beliefs, with the goal that the overall system's beliefs (in the BeliefBank) are more accurate and consistent than those of the underlying model. However, both mechanisms have the potential to flip beliefs in the wrong direction also. We provide some analysis of both good and bad flips here.

The System Behaving as Desired: As an illustration of desired behavior, the raw model incorrectly believes that a pine is both a plant (correct) and a vertebrate (incorrect), when queried. However, this violates a mutual exclusivity rule, so the constraint-solver considers flipping one of these.

Flipping "pine is a plant" from T to F would result in numerous other violations, e.g., "pine is a tree" (which the model also believes) would become violated. As a result, it prefers to (correctly) disbelieve "pine is a vertebrate", improving both the accuracy and consistency of the BeliefBank.

Types Of Error:

From an analysis of the data, we see that the majority of the raw model errors are false positives -the MAQAw model generally answers (almost) all the positive facts correctly (recall is ≈97%), but mistakenly thinks some negative facts are also true (precision is ≈60%). These false positives are typically rather unusual facts, e.g., "A poodle is a bathroom." (MAQAw's answer: True).

It is unsurprising that the model knows most of the positive facts, as they are simple statements about common entities ("eagles can fly"), likely seen in pre-training. However, the fact that the model makes (what a person would view as) catastrophic errors when asked more unusual questions, e.g., believing that "a poodle is made of fermented milk and bacteria", reveals that the PTLM's grasp of the world is still incomplete and problematic. The constraint mechanism proposed here essentially asks the model to think about its answers and their consequences, so that it can spot problems that the PTLM alone does not see, and repair them accordingly.

Sensitivity to Constraints and Weights: The constraint reasoner also makes mistakes, sometimes flipping things the wrong way so as to improve consistency, at the expense of accuracy. For example, the raw model correctly believes that a "a rat is not a cat". However, the constraint solver then (incorrectly) flips this to "a rat is a cat". It does this as there are multiple constraint rules weakly suggesting rats are cats given the model's other beliefs ("rats catch mice", "rats have tails", "rats have fur",...), which together add up. However, the model also (correctly) believes "a rat is not a feline", and there is a constraint that "a cat is a feline", so in principle this should prevent the belief "a rat is a cat". In practice, though, the constraint "a cat is a feline" does not have infinite weight, so here the constraint mechanism allows it to be violated, allowing the wrong conclusion ("a rat is a cat") to be drawn. Of course, one could increase the weight on the "a cat is a feline" constraint to solve this particular problem, or add new constraint rules like "cats are larger than squirrels". In general, though, the effectiveness of constraint reasoning is sensitive to both the number and weights on the constraints. Although we have used automated techniques to recalibrate the various weights (Section 6.2), the system's behavior remains sensitive to them. Finding more effective ways to discover and appropriately tune constraints remains an open problem.

Ambiguity: Although we have avoided obvious cases of ambiguity, e.g., reasoning about "bats", there are smaller cases of ambiguity that may explain some of the raw model's errors. For example, although the model (incorrectly) believes "a swallow is a fish", there is a fish called "a black swallower" that it may have seen in pre-training, confusing it. Similarly, some unusual statements such as "A penguin is a protein" (gold label is False, model believes is True) are partially ambiguous, possibly contributing to the discrepancy.

Does the Model Know the Constraints? Finally, we are checking consistency using externally defined constraints, which the raw model may not itself be aware of (i.e., may not have acquired in pre-training). For example, although the model may appear to be internally inconsistent if it thinks a swallow is both a bird and a fish, this is only true if it also knows these beliefs are mutually exclusive. To check for internal inconsistency of a model's behaviour, we would also need to check if the model knew the violated constraints themselves, e.g., using additional probes. Again this is an area for future exploration.

The influence of Feedback: We currently can only speculate why feedback (i.e., providing relevant facts to the model as context for questionanswering) improves results. One explanation is that feedback helps the model focus attention. For example, reminding the model that "a swallow is not a fish" should help it realize that "a swallow has gills" is False (Figure 1 ). However, we also observe significant gains when feeding back random facts about the entity being queried. Possibly these facts still encourage the model to focus on the entity. Or, given the majority of beliefs are negative, the simple presence of negation in the context may discourage false positive answers, the primary cause of raw model errors.

8 Future Work

These findings were made using a restricted experimental setup where facts are short sentences, constraints are provided, and both facts and constraints use the same language so it is clear when constraints apply. To broaden the approach, several developments are required:

• A system would need to automatically gather constraints and queries. Constraints might be mined from text, or even extracted from the model itself by direct querying. Alternatively, domain-specific constraints (e.g., "X is a bird → X has wings") could be replaced with more generally applicable constraint patterns (e.g., "X is a Y & Y has Z → X has Z"), turning the domain-specific parts of constraints into facts that themselves could reside in the BeliefBank. • Our queries have been simple facts. To fully utilize the BeliefBank for more complex queries (e.g., "What color is the gas that plants produce?"), a mechanism would be needed to decompose compound queries into their primitives ("What gas do plants produce?", "What color is that gas?"), and ensure they were consistent with BeliefBank beliefs (e.g., "Oxygen is colorless"); in other words, ensure complex answers were faithful to the Belief-Bank (Subramanian et al., 2020) . • In general, detecting when a constraint ap-plies to a fact is non-trivial, and requires appropriate machinery for identifying sentence/constraint alignment, e.g., an entailment model (Seo et al., 2016) .

• Although constraint/model weights are automatically calibrated (Section 6.2), the system is still sensitive to those weights and can make mistakes (Section 7.2). Improved mechanisms for automatic calibration would significantly help. • Our system is passive, in that it answers questions and uses answers in the order they are provided. An alternative design would be an active system that, given a new question, actively identifies which auxiliary questions would be most useful to also ask, to help answer the main question (either with constraint solving or feedback).

Finally, scaling to a massive BeliefBank would require addressing several engineering requirements, including bounding the constraint checking, and not re-asking every prior question as new answers were discovered. For example, we might only check constraints and their implications up to depth D from a new fact, rather than exhaustively. Similarly, the system may maintain a BeliefBank where earlier answers were generated using only feedback available at that point, rather than re-asking all earlier questions as new knowledge becomes available.

9 Conclusions

PTLMs can be inconsistent in their answers to probing questions, and can still give (what to a person appears as) naively wrong answers to unusual questions. This work is a first step to alleviating these problems. By augmenting a PTLM with a persistent, global memory -the Belief Bank -and using it for both constraint-solving and feedback, we have shown that both consistency and accuracy can be significantly improved. The additional memory layer can loosely be seen as a simple "mental model", constructed from the PTLM's raw answers.

Our experiments were conducted in a restricted, controlled setting, and further development is needed to scale to larger and more complex tasks. Nevertheless, the work here is significant as it is a first step towards endowing models with a globally consistent notion of belief, allowing them to construct a more coherent pic-ture of the world. The dataset is available at https://allenai.org/data/beliefbank.

Similarly, people do not always behave fully consistently with their professed beliefs.

In practice, strings are replaced with numeric identifiers in SAT syntax, but for clarity we leave them as strings here.

We measure accuracy with F1 (on the True class) rather than % correct because the True/False distribution in our dataset is unbalanced, with significantly fewer True than False answers. F1 avoids scores being dominated by negative answers.