Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

Knowledge Enhanced Contextual Word Representations


  • Matthew E. Peters
  • Mark Neumann
  • Robert L Logan IV
  • Roy Schwartz
  • V. Joshi
  • Sameer Singh
  • Noah A. Smith
  • 2019
  • View in Semantic Scholar


Contextual word representations, typically trained on unstructured, unlabeled text, do not contain any explicit grounding to real world entities and are often unable to remember facts about those entities. We propose a general method to embed multiple knowledge bases (KBs) into large scale models, and thereby enhance their representations with structured, human-curated knowledge. For each KB, we first use an integrated entity linker to retrieve relevant entity embeddings, then update contextual word representations via a form of word-to-entity attention. In contrast to previous approaches, the entity linkers and self-supervised language modeling objective are jointly trained end-to-end in a multitask setting that combines a small amount of entity linking supervision with a large amount of raw text. After integrating WordNet and a subset of Wikipedia into BERT, the knowledge enhanced BERT (KnowBert) demonstrates improved perplexity, ability to recall facts as measured in a probing task and downstream performance on relationship extraction, entity typing, and word sense disambiguation. KnowBert’s runtime is comparable to BERT’s and it scales to large KBs.

1 Introduction

Large pretrained models such as ELMo (Peters et al., 2018) , GPT (Radford et al., 2018) , and BERT (Devlin et al., 2019) have significantly improved the state of the art for a wide range of NLP tasks. These models are trained on large amounts of raw text using self-supervised objectives. However, they do not contain any explicit grounding to real world entities and as a result have difficulty recovering factual knowledge (Logan et al., 2019) .

Knowledge bases (KBs) provide a rich source of high quality, human-curated knowledge that can be used to ground these models. In addition, they often include complementary information to that found in raw text, and can encode factual knowledge that is difficult to learn from selectional preferences either due to infrequent mentions of commonsense knowledge or long range dependencies.

We present a general method to insert multiple KBs into a large pretrained model with a Knowledge Attention and Recontextualization (KAR) mechanism. The key idea is to explicitly model entity spans in the input text and use an entity linker to retrieve relevant entity embeddings from a KB to form knowledge enhanced entity-span representations. Then, the model recontextualizes the entity-span representations with word-toentity attention to allow long range interactions between contextual word representations and all entity spans in the context. The entire KAR is inserted between two layers in the middle of a pretrained model such as BERT.

In contrast to previous approaches that integrate external knowledge into task-specific models with task supervision (e.g., Yang and Mitchell, 2017; Chen et al., 2018) , our approach learns the entity linkers with self-supervision on unlabeled data. This results in general purpose knowledge enhanced representations that can be applied to a wide range of downstream tasks.

Our approach has several other benefits. First, it leaves the top layers of the original model unchanged so we may retain the output loss layers and fine-tune on unlabeled corpora while training the KAR. This also allows us to simply swap out BERT for KnowBert in any downstream application. Second, by taking advantage of the existing high capacity layers in the original model, the KAR is lightweight, adding minimal additional parameters and runtime. Finally, it is easy to incorporate additional KBs by simply inserting them at other locations.

KnowBert is agnostic to the form of the arXiv:1909.04164v1 [cs.CL] 9 Sep 2019 KB, subject to a small set of requirements (see Sec. 3.2) . We experiment with integrating both WordNet (Miller, 1995) and Wikipedia, thus explicitly adding word sense knowledge and facts about named entities (including those unseen at training time). However, the method could be extended to commonsense KBs such as ConceptNet (Speer et al., 2017) or domain specific ones (e.g., UMLS; Bodenreider, 2004) . We evaluate KnowBert with a mix of intrinsic and extrinsic tasks. Despite being based on the smaller BERT BASE model, the experiments demonstrate improved masked language model perplexity and ability to recall facts over BERT LARGE . The extrinsic evaluations demonstrate improvements for challenging relationship extraction, entity typing and word sense disambiguation datasets, and often outperform other contemporaneous attempts to incorporate external knowledge into BERT.

2 Related Work

Pretrained word representations Initial work learning word vectors focused on static word embeddings using multi-task learning objectives (Collobert and Weston, 2008) or corpus level cooccurence statistics (Mikolov et al., 2013a; Pennington et al., 2014) . Recently the field has shifted toward learning context-sensitive embeddings (Dai and Le, 2015; Peters et al., 2018; Devlin et al., 2019) . We build upon these by incorporating structured knowledge into these models.

Entity embeddings Entity embedding methods produce continuous vector representations from external knowledge sources. Knowledge graphbased methods optimize the score of observed triples in a knowledge graph. These methods broadly fall into two categories: translational distance models (Bordes et al., 2013; Wang et al., 2014b; Lin et al., 2015; Xiao et al., 2016) which use a distance-based scoring function, and linear models (Nickel et al., 2011; Yang et al., 2014; Trouillon et al., 2016; Dettmers et al., 2018) which use a similarity-based scoring function. We experiment with TuckER (Balazevic et al., 2019) embeddings, a recent linear model which generalizes many of the aforecited models. Other methods combine entity metadata with the graph (Xie et al., 2016) , use entity contexts (Chen et al., 2014; Ganea and Hofmann, 2017) , or a combination of contexts and the KB (Wang et al., 2014a; Gupta et al., 2017) . Our approach is agnostic to the details of the entity embedding method and as a result is able to use any of these methods.

Entity-aware language models Some previous work has focused on adding KBs to generative language models (LMs) (Ahn et al., 2017; Logan et al., 2019) or building entity-centric LMs (Ji et al., 2017) . However, these methods introduce latent variables that require full annotation for training, or marginalization. In contrast, we adopt a method that allows training with large amounts of unannotated text.

Task-specific KB architectures Other work has focused on integrating KBs into neural architectures for specific downstream tasks (Yang and Mitchell, 2017; Sun et al., 2018; Chen et al., 2018; Bauer et al., 2018; Mihaylov and Frank, 2018; Wang and Jiang, 2019; Yang et al., 2019) . Our approach instead uses KBs to learn more generally transferable representations that can be used to improve a variety of downstream tasks.

3 Knowbert

KnowBert incorporates knowledge bases into BERT using the Knowledge Attention and Recontextualization component (KAR). We start by describing the BERT and KB components. We then move to introducing KAR. Finally, we describe the training procedure, including the multitask training regime for jointly training KnowBert and an entity linker.

3.1 Pretrained Bert

We describe KnowBert as an extension to (and candidate replacement for) BERT, although the method is general and can be applied to any deep pretrained model including left-to-right and rightto-left LMs such as ELMo and GPT. Formally, BERT accepts as input a sequence of N Word-Piece tokens (Sennrich et al., 2016; Wu et al., 2016) , (x 1 , . . . , x N ), and computes L layers of D-dimensional contextual representations H i ∈ R N ×D by successively applying non-linear functions H i = F i (H i−1 ). The non-linear function is a multi-headed self-attention layer followed by a position-wise multilayer perceptron (MLP) (Vaswani et al., 2017) :

F i (H i−1 ) = TransformerBlock(H i−1 ) = MLP(MultiHeadAttn(H i−1 , H i−1 , H i−1 )).

The multi-headed self-attention uses H i−1 as the query, key, and value to allow each vector to attend to every other vector.

BERT is trained to minimize an objective function that combines both next-sentence prediction (NSP) and masked LM log-likelihood (MLM):


Given two inputs x A and x B , the next-sentence prediction task is binary classification to predict whether x B is the next sentence following x A . The masked LM objective randomly replaces a percentage of input word pieces with a special [MASK] token and computes the negative loglikelihood of the missing token with a linear layer and softmax over all possible word pieces.

3.2 Knowledge Bases

The key contribution of this paper is a method to incorporate knowledge bases (KB) into a pretrained BERT model. To encompass as wide a selection of prior knowledge as possible, we adopt a broad definition for a KB in the most general sense as fixed collection of K entity nodes, e k , from which it is possible to compute entity embeddings, e k ∈ R E . This includes KBs with a typical (subj, rel, obj) graph structure, KBs that contain only entity metadata without a graph, and those that combine both a graph and entity metadata, as long as there is some method for embedding the entities in a low dimensional vector space. We also do not make any assumption that the entities are typed. As we show in Sec. 4.1 this flexibility is beneficial, where we compute entity embeddings from WordNet using both the graph and synset definitions, but link directly to Wikipedia pages without a graph by using embeddings computed from the entity description.

We also assume that the KB is accompanied by an entity candidate selector that takes as input some text and returns a list of C potential entity links, each consisting of the start and end indices of the potential mention span and M m candidate entities in the KG:

C = { (start m , end m ), (e m,1 , . . . , e m,Mm ) | m ∈ 1 . . . C, e k ∈ 1 . . . K}.

In practice, these are often implemented using precomputed dictionaries (e.g., CrossWikis; Spitkovsky and Chang, 2012), KB specific rules (e.g., a WordNet lemmatizer), or other heuristics (e.g., string match; Mihaylov and Frank, 2018) . Ling et al. (2015) showed that incorporating candidate priors into entity linkers can be a powerful signal, so we optionally allow for the candidate selector to return an associated prior probability for each entity candidate. In some cases, it is beneficial to over-generate potential candidates and add a special NULL entity to each candidate list, thereby allowing the linker to discriminate between actual links and false positive candidates. In this work, the entity candidate selectors are fixed but their output is passed to a learned context dependent entity linker to disambiguate the candidate mentions.

Finally, by restricting the number of candidate entities to a fixed small number (we use 30), KnowBert's runtime is independent of the size the KB, as it only considers a small subset of all possible entities for any given text. As the candidate selection is rule-based and fixed, it is fast and in our implementation is performed asynchronously on CPU. The only overhead for scaling up the size of the KB is the memory footprint to store the entity embeddings.

3.3 Kar

The Knowledge Attention and Recontextualization component (KAR) is the heart of KnowBert. The KAR accepts as input the contextual representations at a particular layer, H i , and computes knowledge enhanced representations H i = KAR(H i , C). This is fed into the next pretrained layer, H i+1 = TransformerBlock(H i ), and the remainder of BERT is run as usual.

In this section, we describe the KAR's key components: mention-span representations, retrieval of relevant entity embeddings using an entity linker, update of mention-span embeddings with retrieved information, and recontextualization of entity-span embeddings with word-to-entity-span attention. We describe the KAR for a single KB, but extension to multiple KBs at different layers is straightforward. See Fig. 1 for an overview.

Figure 1: The Knowledge Attention and Recontextualization (KAR) component. BERT word piece representations (Hi) are first projected to H proj i (1), then pooled over candidate mentions spans (2) to compute S, and contextualized into Se using mention-span self-attention (3). An integrated entity linker computes weighted average entity embeddings Ẽ (4), which are used to enhance the span representations with knowledge from the KB (5), computing S′e. Finally, the BERT word piece representations are recontextualized with word-to-entity-span attention (6) and projected back to the BERT dimension (7) resulting in H′i.

Mention-span representations The KAR starts with the KB entity candidate selector that provides a list of candidate mentions which it uses to compute mention-span representations. H i is first pro- (1), then pooled over candidate mentions spans (2) to compute S, and contextualized into S e using mention-span self-attention (3). An integrated entity linker computes weighted average entity embeddingsẼ (4), which are used to enhance the span representations with knowledge from the KB (5), computing S e . Finally, the BERT word piece representations are recontextualized with word-to-entity-span attention (6) and projected back to the BERT dimension (7) resulting in H i .

jected to the entity dimension (E, typically 200 or 300, see Sec. 4.1) with a linear projection,

H proj i = H i W proj 1 + b proj 1 .

( 1)Then, the KAR computes C mention-span representations s m ∈ R E , one for each candidate mention, by pooling over all word pieces in a mentionspan using the self-attentive span pooling from Lee et al. (2017) . The mention-spans are stacked into a matrix S ∈ R C×E .

Entity Linker

The entity linker is responsible for performing entity disambiguation for each potential mention from among the available candidates. It first runs mention-span self-attention to compute S e = TransformerBlock(S).

The span self-attention is identical to the typical transformer layer, exception that the self-attention is between mention-span vectors instead of word piece vectors. This allows KnowBert to incorporate global information into each linking decision so that it can take advantage of entity-entity cooccurrence and resolve which of several overlapping candidate mentions should be linked. 1 Following Kolitsas et al. (2018) , S e is used to score each of the candidate entities while incorporating the candidate entity prior from the KB. Each candidate span m has an associated mention-span vector s e m (computed via Eq. 2), M m candidate entities with embeddings e mk (from the KB), and prior probabilities p mk . We compute M m scores using the prior and dot product between the entityspan vectors and entity embeddings,

EQUATION (3): Not extracted; please refer to original document.

with a two-layer MLP (100 hidden dimensions). If entity linking (EL) supervision is available, we can compute a loss with the gold entity e mg . The exact form of the loss depends on the KB, and we use both log-likelihood,

EQUATION (4): Not extracted; please refer to original document.

and max-margin,

EQUATION (5): Not extracted; please refer to original document.

formulations (see Sec. 4.1 for details).

Knowledge enhanced entity-span representations KnowBert next injects the KB entity information into the mention-span representations computed from BERT vectors (s e m ) to form entityspan representations. For a given span m, we first disregard all candidate entities with score ψ below a fixed threshold, and softmax normalize the remaining scores:

ψ mk =        exp(ψ mk ) ψ mk ≥δ exp(ψ mk ) , ψ mk ≥ δ 0, ψ mk < δ.

Then the weighted entity embedding is

e m = kψ mk e mk .

If all entity linking scores are below the threshold δ, we substitute a special NULL embedding forẽ m . Finally, the entity-span representations are updated with the weighted entity embeddings s e m = s e m +ẽ m ,

which are packed into a matrix S e ∈ R C×E .

Recontextualization After updating the entityspan representations with the weighted entity vectors, KnowBert uses them to recontextualize the word piece representations. This is accomplished using a modified transformer layer that substitutes the multi-headed self-attention with a multiheaded attention between the projected word piece representations and knowledge enhanced entityspan vectors. As introduced by Vaswani et al. for the query, and S e for both the key and value:

H proj i = MLP(MultiHeadAttn(H proj i , S e , S e )).

This allows each word piece to attend to all entity-spans in the context, so that it can propagate entity information over long contexts. After the multi-headed word-to-entity-span attention, we run a position-wise MLP analogous to the standard transformer layer. 2 Finally, H proj i is projected back to the BERT dimension with a linear transformation and a residual connection added:

H i = H proj i W proj 2 + b proj 2 + H i (7)

Alignment of BERT and entity vectors As KnowBert does not place any restrictions on the entity embeddings, it is essential to align them with the pretrained BERT contextual representations. To encourage this alignment we initialize W proj 2 as the matrix inverse of W proj 1 (Eq. 1). The use of dot product similarity (Eq. 3) and residual connection (Eq. 7) further aligns the entity-span representations with entity embeddings.

L KnowBert = L BERT + j i=1 L EL i end

3.4 Training Procedure

Our training regime incrementally pretrains increasingly larger portions of KnowBert before fine-tuning all trainable parameters in a multitask setting with any available EL supervision. It is similar in spirit to the "chain-thaw" approach in Felbo et al. (2017) , and is summarized in Alg. 1.

We assume access to a pretrained BERT model and one or more KBs with their entity candidate selectors. To add the first KB, we begin by pretraining entity embeddings (if not already provided from another source), then freeze them in all subsequent training, including task-specific finetuning. If EL supervision is available, it is used to pretrain the KB specific EL parameters, while freezing the remainder of the network. Finally, the entire network is fine-tuned to convergence by minimizing

L KnowBert = L BERT + L EL .

We apply gradient updates to homogeneous batches randomly sampled from either the unlabeled corpus or EL supervision.

To add a second KB, we repeat the process, inserting it in any layer above the first one. When adding a KB, the BERT layers above it will experience large gradients early in training, as they are subject to the randomly initialized parameters associated with the new KB. They are thus expected to move further from their pretrained values before convergence compared to parameters below the KB. minimize disruption of the network and decrease the likelihood that training will fail. See Sec. 4.1 for details of where each KB was added.

The entity embeddings and selected candidates contain lexical information (especially in the case of WordNet), that will make the masked LM predictions significantly easier. To prevent leaking into the masked word pieces, we adopt the BERT strategy and replace all entity candidates from the selectors with a special [MASK] entity if the candidate mention span overlaps with a masked word piece. 3 This prevents KnowBert from relying on the selected candidates to predict masked word pieces.

4.1 Experimental Setup

We used the English uncased BERT BASE model (Devlin et al., 2019) to train three versions of KnowBert:

KnowBert-Wiki, KnowBert-WordNet, and KnowBert-W+W that includes both Wikipedia and WordNet.


The entity linker in KnowBert-Wiki borrows both the entity candidate selectors and embeddings from Ganea and Hofmann (2017). The candidate selectors and priors are a combination of CrossWikis, a large, precomputed dictionary that combines statistics from Wikipedia and a web corpus (Spitkovsky and Chang, 2012), and the YAGO dictionary (Hoffart et al., 2011) . The entity embeddings use a skipgram like objective (Mikolov et al., 2013b) to learn 300-dimensional embeddings of Wikipedia page titles directly from Wikipedia descriptions without using any explicit graph structure between nodes. As such, nodes in the KB are Wikipedia page titles, e.g., Prince (musician). Ganea and Hofmann (2017) provide pretrained embeddings for a subset of approximately 470K entities. Early experiments with embeddings derived from Wikidata relations 4 did not improve results.

We used the AIDA-CoNLL dataset (Hoffart et al., 2011) for supervision, adopting the standard splits. This dataset exhaustively annotates entity links for named entities of person, organization and location types, as well as a miscellaneous type. It does not annotate links to common nouns or other Wikipedia pages. At both train and test time, we consider all selected candidate spans and the top 30 entities, to which we add the special NULL entity to allow KnowBert to discriminate between actual links and false positive links from the selector. As such, KnowBert models both entity mention detection and disambiguation in an end-to-end manner. Eq. 5 was used as the objective.

KnowBert-WordNet Our WordNet KB combines synset metadata, lemma metadata and the relational graph. To construct the graph, we first extracted all synsets, lemmas, and their relationships from WordNet 3.0 using the nltk interface. After disregarding certain symmetric relationships (e.g., we kept the hypernym relationship, but removed the inverse hyponym relationship) we were left with 28 synset-synset and lemma-lemma relationships. From these, we constructed a graph where each node is either a synset or lemma, and intro- duced the special lemma in synset relationship to link synsets and lemmas. The candidate selector uses a rule-based lemmatizer without partof-speech (POS) information. 5 Our embeddings combine both the graph and synset glosses (definitions), as early experiments indicated improved perplexity when using both vs. just graph-based embeddings.

We used TuckER (Balazevic et al., 2019) to compute 200dimensional vectors for each synset and lemma using the relationship graph. Then, we extracted the gloss for each synset and used an off-theshelf state-of-the-art sentence embedding method (Subramanian et al., 2018) to produce 2048dimensional vectors. These are concatenated to the TuckER embeddings. To reduce the dimensionality for use in KnowBert, the frozen 2248dimensional embeddings are projected to 200dimensions with a learned linear transformation.

For supervision, we combined the SemCor word sense disambiguation (WSD) dataset (Miller et al., 1994) with all lemma example usages from WordNet 6 and link directly to synsets. The loss function is Eq. 4. At train time, we did not provide gold lemmas or POS tags, so KnowBert must learn to implicitly model coarse grained POS tags to disambiguate each word. At test time when evaluating we restricted candidate entities to just those matching the gold lemma and POS tag, consistent with the standard WSD evaluation.

Training details To control for the unlabeled corpus, we concatenated Wikipedia and the Books Corpus and followed the data preparation process in BERT with the exception of heavily biasing our dataset to shorter sequences of 128 word pieces for efficiency. Both KnowBert-

Table 1: Comparison of masked LM perplexity, Wikidata probing MRR, and number of parameters (in millions) in the masked LM (word piece embeddings, transformer layers, and output layers), KAR, and entity embeddings for BERT and KnowBert. The table also includes the total time to run one forward and backward pass (in seconds) on a TITAN Xp GPU (12 GB RAM) for a batch of 32 sentence pairs with total length 80 word pieces. Due to memory constraints, the BERTLARGE batch is accumulated over two smaller batches.
Table 2: Fine-grained WSD F1.


AIDA-A AIDA-B Daiber et al. (2013) 49.9 52.0 Hoffart et al. (2011) 68.8 71.9 Kolitsas et al. (2018) 86.6 82.6 KnowBert-Wiki 80.2 74.4 KnowBert-W+W 82.1 73.7 Table 3 : End-to-end entity linking strong match, micro averaged F 1 .

Table 3: End-to-end entity linking strong match, micro averaged F1.

Wiki and KnowBert-WordNet insert the KB between layers 10 and 11 of the 12-layer BERT BASE model. KnowBert-W+W adds the Wikipedia KB between layers 10 and 11, with WordNet between layers 11 and 12. Earlier experiments with KnowBert-WordNet in a lower layer had worse perplexity. We generally followed the fine-tuning procedure in Devlin et al. (2019) ; see supplemental materials for details.

4.2 Intrinsic Evaluation

Perplexity Table 1 compares masked LM perplexity for KnowBert with BERT BASE and BERT LARGE . To rule out minor differences due to our data preparation, the BERT models are finetuned on our training data before being evaluated. Overall, KnowBert improves the masked LM perplexity, with all KnowBert models outperforming BERT LARGE , despite being derived from BERT BASE .

Factual recall To test KnowBert's ability to recall facts from the KBs, we extracted 90K tuples from Wikidata (Vrandečić and Krötzsch, 2014) for 17 different relationships such as companyFoundedBy. Each tuple was written into natural language such as "Adidas was founded by Adolf Dassler" and used to construct two test instances, one that masks out the subject and one that masks the object. Then, we evaluated whether a model could recover the masked entity by computing the mean reciprocal rank (MRR) of the masked word pieces. of (frozen) parameters in the entity embeddings (Table 1) . KnowBert is much faster than BERT LARGE . By taking advantage of the already high capacity model, the number of trainable parameters added by KnowBert is a fraction of the total parameters in BERT. The faster speed is partially due to the entity parameter efficiency in KnowBert as only as small fraction of parameters in the entity embeddings are used for any given input due to the sparse linker. Our candidate generators consider the top 30 candidates and produce approximately O(number tokens) candidate spans. For a typical 25 token sentence, approximately 2M entity embedding parameters are actually used. In contrast, BERT LARGE uses the majority of its 336M parameters for each input.

Integrated EL It is also possible to evaluate the performance of the integrated entity linkers inside KnowBert using diagnostic probes without any further fine-tuning. As these were trained in a multitask setting primarily with raw text, we do not a priori expect high performance as they must balance specializing for the entity linking task and learning general purpose representations suitable for language modeling. Table 2 displays fine-grained WSD F 1 using the evaluation framework from and the ALL dataset (combing SemEval 2007 . By linking to nodes in our WordNet graph and restricting to gold lemmas at test time we can recast the WSD task under our general entity linking framework. The ELMo and BERT baselines use a nearest neighbor approach trained on the SemCor dataset, similar to the evaluation in Melamud et al. (2016) , which has previously been shown to be competitive with task-specific architectures . As can be seen, KnowBert provides competitive performance, and KnowBert-W+W is able to match the performance of KnowBert-WordNet despite incorporating both Wikipedia and WordNet. Table 3 reports end-to-end entity linking performance for the AIDA-A and AIDA-B datasets. Here, KnowBert's performance lags behind the current state-of-the-art model from Kolitsas et al. (2018) , but still provides strong performance compared to other established systems such as AIDA (Hoffart et al., 2011) and DBpedia Spotlight (Daiber et al., 2013) . We believe this is due to the selective annotation in the AIDA data that only annotates named entities. The CrossWikisbased candidate selector used in KnowBert generates candidate mentions for all entities including common nouns from which KnowBert may be learning to extract information, at the detriment of specializing to maximize linking performance for AIDA.

4.3 Downstream Tasks

This section evaluates KnowBert on downstream tasks to validate that the addition of knowledge improves performance on tasks expected to benefit from it. Given the overall superior performance of KnowBert-W+W on the intrinsic evaluations, we focus on it exclusively for evaluation in this section. The main results are included in this section; see the supplementary material for full details.

The baselines we compare against are BERT BASE , BERT LARGE , the pre-BERT state of the art, and two contemporaneous papers that add similar types of knowledge to BERT. ERNIE (Zhang et al., 2019) uses TAGME (Ferragina and Scaiella, 2010) to link entities to Wikidata, retrieves the associated entity embeddings, and fuses them into BERT BASE by fine-tuning. Soares et al. (2019) learns relationship representations by fine-tuning BERT LARGE with large scale "matching the blanks" (MTB) pretraining using entity linked text. Relation extraction Our first task is relation extraction using the TACRED (Zhang et al., 2017) and SemEval 2010 Task 8 (Hendrickx et al., 2009) datasets. Systems are given a sentence with marked a subject and object, and asked to predict which of several different relations (or no relation) holds. Following Soares et al. 2019 ] to mark the location of the subject and object in the input sentence, then concatenates the contextual word representations for [E1] and [E2] to predict the relationship. For TACRED, we also encode the subject and object types with special tokens and concatenate them to the end of the sentence.

For TACRED (Table 4) , KnowBert-W+W significantly outperforms the comparable BERT BASE systems including ERNIE by 3.5%, improves over BERT LARGE by 1.4%, and is able to match the performance of the relationship specific MTB pretraining in Soares et al. (2019) . For SemEval 2010 Task 8 (Table 5) , KnowBert-W+W F 1 falls between the entity aware BERT BASE model from Wang et al. (2019b) , and the BERT LARGE model from Soares et al. (2019).

Table 4: Single model test set results on the TACRED relationship extraction dataset. † with MTB pretraining.
Table 5: Test set F1 for SemEval 2010 Task 8 relationship extraction. † with MTB pretraining.

Words in Context (WiC) WiC (Pilehvar and Camacho-Collados, 2019) is a challenging task that presents systems with two sentences both containing a word with the same lemma and asks them to determine if they are from the same sense or not. It is designed to test the quality of contextual word representations. We follow standard practice and concatenate both sentences with a [SEP] token and fine-tune the [CLS] embedding. As shown in Table 6 , KnowBert-W+W sets a new state of the art for this task, improving over BERT LARGE by 1.4% and reducing the relative gap to 80% human performance by 13.3%. Table 7 : Test set results for entity typing using the nine general types from (Choi et al., 2018) .

Table 6: Test set results for the WiC dataset (v1.0). †Pilehvar and Camacho-Collados (2019) ††Wang et al. (2019a)
Table 7: Test set results for entity typing using the nine general types from (Choi et al., 2018).

Entity Typing

We also evaluated KnowBert-W+W using the entity typing dataset from Choi et al. (2018) . To directly compare to ERNIE, we adopted the evaluation protocol in Zhang et al. (2019) which considers the nine general entity types. 7 Our model marks the location of a target span with the special [E] and [/E] tokens and uses the representation of the [E] token to predict the type. As shown in Table 7 , KnowBert-W+W shows an improvement of 0.6% F 1 over ERNIE and 2.5% over BERT BASE .

5 Conclusion

We have presented an efficient and general method to insert prior knowledge into a deep neural model. Intrinsic evaluations demonstrate that the addition of WordNet and Wikipedia to BERT improves the quality of the masked LM and significantly improves its ability to recall facts. Downstream evaluations demonstrate improvements for relationship extraction, entity typing and word sense disambiguation datasets. Future work will involve incorporating a diverse set of domain specific KBs for specialized NLP applications. the task does not define a standard development split, we randomly sampled 500 of the 8000 training examples for development. The hyperparameter search space was:

• learning rate: [3e-5, 5e-5]

• number epochs: [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] with β 2 = 0.98.

We used the provided semeval2010 task8 scorer-v1.2.pl script to compute F 1 . The maximum development F 1 averaged across the random restarts was 89.1 ± 0.77 (maximum value was 90.5 across the seeds).

WiC WiC is a binary classification task with 7.5K annotated sentence pairs. Due to the small size of the dataset, we found it helpful to use model averaging to reduce the variance in the development accuracy across random restarts. The hyperparameter search space was:

• learning rate: [1e-5, 2e-5, 3e-5, 5e-5]

• number epochs: [2, 3, 4, 5] • beta 2 : [0.98, 0.999]

• weight averaging decay: [no averaging, 0.95, 0.99]

The maximum development accuracy was 72.6.

Entity typing As described in Section 4.3, we evaluated on a subset of data corresponding to entities classified by nine different general classes: person, location, object, organization, place, entity, object, time, and event. β 2 was set to 0.98. The hyperparameter search space was:

• learning rate: [2e-5, 3e-5]

• number epochs: [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14] • beta 2 : [0.98, 0.999]

• weight averaging decay: [no averaging, 0.99]

The maximum development F 1 was 75.5 ± 0.38 averaged across five seeds, with maximum value of 76.0.

We found a small transformer layer with four attention heads and a 1024 feed-forward hidden dimension was sufficient, significantly smaller than each of the layers in BERT. Early experiments demonstrated the effectiveness of this layer with improved entity linking performance.

As for the multi-headed entity-span self-attention, we found a small transformer layer to be sufficient, with four attention heads and 1024 hidden units in the MLP.

Following BERT, for 80% of masked word pieces all candidates are replaced with [MASK], 10% are replaced with random candidates and 10% left unmasked.

https://github.com/facebookresearch/ PyTorch-BigGraph

https://spacy.io/ 6 To provide a fair evaluation on the WiC dataset which is partially based on the same source, we excluded all WiC train, development and test instances.

Table 8: Full results on the Wikidata probing task including all relations.