Bridging Knowledge Gaps in Neural Entailment via Symbolic Models
Authors
Abstract
Most textual entailment models focus on lexical gaps between the premise text and the hypothesis, but rarely on knowledge gaps. We focus on filling these knowledge gaps in the Science Entailment task, by leveraging an external structured knowledge base (KB) of science facts. Our new architecture combines standard neural entailment models with a knowledge lookup module. To facilitate this lookup, we propose a fact-level decomposition of the hypothesis, and verifying the resulting sub-facts against both the textual premise and the structured KB. Our model, NSNet, learns to aggregate predictions from these heterogeneous data formats. On the SciTail dataset, NSNet outperforms a simpler combination of the two predictions by 3% and the base entailment model by 5%.
1 Introduction
Textual entailment, a key challenge in natural language understanding, is a sub-problem in many end tasks such as question answering and information extraction. In one of the earliest works on entailment, the PASCAL Recognizing Textual Entailment Challenge, Dagan et al. (2005) define entailment as follows: text (or premise) P entails a hypothesis H if typically a human reading P would infer that H is most likely true. They note that this informal definition is "based on (and assumes) common human understanding of language as well as common background knowledge".
While current entailment systems have achieved impressive performance by focusing on the language understanding aspect, these systems, especially recent neural models (e.g. Parikh et al., 2016; , do not directly address the need for filling knowledge gaps by leveraging common background knowledge. Figure 1 illustrates an example of P and H from SciTail, a recent science entailment dataset (Khot P : The aorta is a large blood vessel that moves blood away from the heart to the rest of the body. H (entailed): Aorta is the major artery carrying recently oxygenated blood away from the heart. H' (not entailed): Aorta is the major vein carrying recently oxygenated blood away from the heart. et al., 2018) , that highlights the challenge of knowledge gaps-sub-facts of H that aren't stated in P but are universally true. In this example, an entailment system that is strong at filling lexical gaps may align large blood vessel with major artery to help conclude that P entails H. Such a system, however, would equally well-but incorrectly-conclude that P entails a hypothetical variant H' of H where artery is replaced with vein. A typical human, on the other hand, could bring to bear a piece of background knowledge, that aorta is a major artery (not a vein), to break the tie.
Motivated by this observation, we propose a new entailment model that combines the strengths of the latest neural entailment models with a structured knowledge base (KB) lookup module to bridge such knowledge gaps. To enable KB lookup, we use a fact-level decomposition of the hypothesis, and verify each resulting sub-fact against both the premise (using a standard entailment model) and against the KB (using a structured scorer). The predictions from these two modules are combined using a multi-layer "aggregator" network. Our system, called NSnet, achieves 77.9% accuracy on SciTail, substantially improving over the baseline neural entailment model, and comparable to the structured entailment model proposed by . Figure 2 : Neural-symbolic learning in NSnet. The bottom layer has QA and their supporting text in SciTail, and the knowledge base (KB). The middle layer has three modules: Neural Entailment (blue) and Symbolic Matcher and Symbolic Lookup (red). The top layer takes the outputs (black and yellow) and intermediate representation from the middle modules, and hierarchically trains with the final labels. All modules and aggregator are jointly trained in an end-to-end fashion.
2 Neural-Symbolic Learning
A general solution for combining neural and symbolic modules remains a challenging open problem. As a step towards this, we present a system in the context of neural entailment that demonstrates a successful integration of the KB lookup model and simple overlap measures, opening up a path to achieve a similar integration in other models and tasks. The overall system architecture of our neural-symbolic model for textual entailment is presented in Figure 2 . We describe each layer of this architecture in more detail in the following sub-sections.
2.1 Inputs
We decompose the hypothesis and identify relevant KB facts in the bottom "inputs" layer ( Fig. 2 ).
Hypothesis Decomposition: To identify knowledge gaps, we must first identify the facts stated in the hypothesis h = (h 1 , h 2 ..). We use ClausIE (Del et al., 2013) to break h into sub-facts. ClausIE tuples need not be verb-mediated and generate multiple tuples derived from conjunctions, leading to higher recall than alternatives such as Open IE (Banko et al., 2007) . 1 Knowledge Base (KB): To verify these facts, we use the largest available clean knowledge base for the science domain (Dalvi et al., 2017) , with 294K simple facts, as input to our system. The knowledge base contains subject-verb-object (SVO) tuples with short, one or two word arguments (e.g., hydrogen; is; element). Using these simple facts ensures that the KB is only used to fill the basic knowledge gaps and not directly prove the hypothesis irrespective of the premise.
KB Retrieval: The large number of tuples in the knowledge base makes it infeasible to evaluate each hypothesis sub-fact against the entire KB. Hence, we retrieve the top-100 relevant knowledge tuples, K ′ , for each sub-fact based on a simple Jaccard word overlap score.
2.2 Modules
We use a Neural Entailment model to compute the entailment score based on the premise, as well as two symbolic models, Symbolic Matcher and Symbolic Lookup, to compute entailment scores based on the premise and the KB respectively (middle layer in Fig. 2 ).
Neural Entailment
We use a simple neural entailment model, Decomposable Attention (Parikh et al., 2016) , one of the state-of-the-art models on the SNLI entailment dataset (Bowman et al., 2015). However, our architecture can just as easily use any other neural entailment model. We initialize the model parameters by training it on the Science Entailment dataset. Given the sub-facts from the hypothesis, we use this model to compute an entailment score n(h i , p) from the premise to each sub-fact h i .
Symbolic Matcher In our initial experiments, we noticed that the neural entailment models would often either get distracted by similar words in the distributional space (false positives) or completely miss an exact mention of h i in a long premise (false negatives). To mitigate these errors, we define a Symbolic Matcher model that compares exact words in h i and p, via a simple asymmetric bag-of-words overlap score:
m(h i , p) = |h i ∩ p| |p|
One could instead use more complex symbolic alignment methods such as integer linear programming (Khashabi et al., 2016; Khot et al., 2017) .
Symbolic Lookup This module verifies the presence of the hypothesis sub-fact h i in the retrieved KB tuples K ′ , by comparing the sub-fact to each tuple and taking the maximum score. Each field in the KB tuple kb j is scored against the corresponding field in h i (e.g., subject to subject) and averaged across the fields. To compare a field, we use a simple word-overlap based Jaccard similarity score, Sim(a, b) = |a∩b| |a∪b| . The lookup match score for the entire sub-fact and kb-fact is:
Sim f (h i , kb j ) = ∑ k Sim(h i [k], kb j [k]) /3
and the final lookup module score for h i is:
l(h i ) = max kb j ∈K ′ Sim f (h i , kb j )
Note that the Symbolic Lookup module assesses whether a sub-fact of H is universally true. Neural models, via embeddings, are quite strong at mediating between P and H. The goal of the KB lookup module is to complement this strength, by verifying universally true sub-facts of H that may not be stated in P (e.g. "aorta is a major artery" in our motivating example).
2.3 Aggregator Network
For each sub-fact h i , we now have three scores: n(h i , p) from the neural model, m(h i , p) from the symbolic matcher, and l(h i ) from the symbolic lookup model. The task of the Aggregator network is to combine these to produce a single entailment score. However, we found that using only the final predictions from the three modules was not effective. Inspired by recent work on skip/highway connections (He et al., 2016; Srivastava et al., 2015) , we supplement these scores with intermediate, higher-dimensional representations from two of the modules.
- putation), n v (h i , p) = [v 1 ; v 2 ].
We define a hybrid layer that takes as input a simple concatenation of these representation vectors from the different modules:
in(h i , p) =[h enc i ; l(h i ); m(h i , p); n(h i , p); emb i ; n v (h i , p)]
The hybrid layer is a single layer MLP for each sub-fact h i that outputs a sub-representation out i = MLP (in(h i , p) ). A compositional layer then uses a two-layer MLP over a concatenation of the hybrid layer outputs from different sub-facts, {h 1 , . . . , h I }, to produce the final label,
label = MLP([out 1 ; out 2 ; • • • out I ])
Finally, we use the cross-entropy loss to train the Aggregator network jointly with representations in the neural entailment and symbolic lookup models, in an end-to-end fashion. We refer to this entire architecture as the NSnet network. To assess the effectiveness of the aggregator network, we also use a simpler baseline model, Ensemble, that works as follows. For each sub-fact h i , it combines the predictions from each model using a probabilistic-OR function, assuming the model score P m as a probability of entailment. This function computes the probability of at least one model predicting that h i is entailed, i.e. P(h
i ) = 1 − Π m (1 − P m ) where m ∈ n(h i , p), m(h i , p), l(h i ).
We average the probabilities from all the facts to get the final entailment probability. 2
3 Experiments
We use the SciTail dataset 3 for our experiments, which contains 27K entailment examples with a 87.3%/4.8%/7.8% train/dev/test split. The premise and hypothesis in each example are natural sentences authored independently as well as independent of the entailment task, which makes the dataset particularly challenging. We focused mainly on the SciTail dataset, since other crowd-sourced datasets, large enough for training, contained limited linguistic variation (Gururangan et al., 2018) leading to limited gains achievable via external knowledge.
For background knowledge, we use version v4 of the aforementioned Aristo TupleKB 4 (Dalvi et al., 2017) , containing 283K basic science facts. We compare our proposed models to Decomposed Graph Entailment Model (DGEM) and Decomposable Attention Model (De-compAttn) (Parikh et al., 2016). Table 1 summarizes the validation and test accuracies of various models on the SciTail dataset. The DecompAttn model achieves 74.3% on the test set but drops by 1.6% when the hypotheses are decomposed. The Ensemble approach uses the same hypothesis decomposition and is able to recover 2.1% points by using the KB. The end-toend NSnet network is able to further improve the score by 3.1% and is statistically significantly (at p-value 0.05) better than the baseline neural entailment model. The model is marginally better than DGEM, a graph-based entailment model proposed by the authors of the SciTail dataset We show significant gains over our base entailment model by using an external knowledge base, which are comparable to the gains achieved by DGEM through the use of hypothesis structure. These are orthogonal approaches and one could replace the base DecompAttn model with DGEM or more recent models (Tay et al., 2017; Yin et al., 2018) . In Table 2 , we evaluate the impact of the Symbolic Matcher and Symbolic Lookup module on the best reported model. As we see, removing the symbolic matcher, despite its simplicity, results in a 3.2% drop. Also, the KB lookup model is able to fill some knowledge gaps, contributing 2.1% to the final score. Together, these symbolic matching models contribute 4% to the overall score. Figure 3 shows few randomly selected examples in test set. The first two examples show cases when the symbolic models help to change the neural alignment's prediction (F) to correct prediction (T) by our proposed Ensemble or NSnet models. The third question shows a case where the NSnet architecture learns a better combination of the neural and symbolic methods to correctly identify the entailment relation while Ensemble fails to do so. Few randomly selected examples in the test set between symbolic only, neural only, Ensemble and NSnet inference. The symbolic only model shows its the most similar knowledge from knowledge base inside parenthesis. The first two example shows when knowledge helps fill the gap where neural model can't. The third example shows when NSnet predicts correctly while Ensemble fails.
4 Related Work
Compared to neural only (Bowman et al., 2015; Parikh et al., 2016) or symbolic only (Khot et al., 2017; Khashabi et al., 2016 ) systems, our model takes advantage of both systems, often called neural-symbolic learning (Garcez et al., 2015) . Various neural-symbolic models have been proposed for question answering (Liang et al., 2016) and causal explanations (Kang et al., 2017) . We focus on end-to-end training of these models specifically for textual entailment.
Contemporaneous to this work, Chen et al. (2018) have incorporated knowledge-bases within the attention and composition functions of a neural entailment model, while Kang et al. (2018) generate adversarial examples using symbolic knowledge (e.g., WordNet) to train a robust entailment model. We focused on integrating knowledgebases via a separate symbolic model to fill the knowledge gaps.
5 Conclusion
We proposed a new entailment model that attempts to bridge knowledge gaps in textual entailment by incorporating structured knowledge base lookup into standard neural entailment models. Our architecture, NSnet, can be trained end-to-end, and achieves a 5% improvement on SciTail over the baseline neural model used here. The methodology can be easily applied to more complex entailment models (e.g., DGEM) as the base neural entailment model. Accurately identifying the subfacts from a hypothesis is a challenging task in itself, especially when dealing with negation. Improvements to the fact decomposition should further help improve the model.
While prior work on question answering in the science domain has successfully used Open IE to extract facts from sentences(Khot et al., 2017), one of the key reasons for errors was the lossy nature of Open IE.
While more intuitive, performing an AND aggregation resulted in worse performance (cf. Appendix ?? for details).3 http://data.allenai.org/scitail 4 http://data.allenai.org/tuple-kb