GenericsKB: A Knowledge Base of Generic Statements
We present a new resource for the NLP community, namely a large (3.5M+ sentence) knowledge base of *generic statements*, e.g., "Trees remove carbon dioxide from the atmosphere", collected from multiple corpora. This is the first large resource to contain *naturally occurring* generic sentences, as opposed to extracted or crowdsourced triples, and thus is rich in high-quality, general, semantically complete statements. All GenericsKB sentences are annotated with their topical term, surrounding context (sentences), and a (learned) confidence. We also release GenericsKB-Best (1M+ sentences), containing the best-quality generics in GenericsKB augmented with selected, synthesized generics from WordNet and ConceptNet. In tests on two existing datasets requiring multihop reasoning (OBQA and QASC), we find using GenericsKB can result in higher scores and better explanations than using a much larger corpus. This demonstrates that GenericsKB can be a useful resource for NLP applications, as well as providing data for linguistic studies of generics and their semantics. GenericsKB is available at this https URL.
While deep learning systems have achieved remarkable performance trained on general text, NLP researchers frequently seek out additional repositories of general/commonsense knowledge to boost performance further, e.g., (Icarte et al., 2017; Wang et al., 2018; Yang et al., 2019; Peters et al., 2019; Liu et al., 2019; Paul and Frank, 2019) . However, there are only a limited number of repositories currently available, with ConceptNet (Speer et al., 2017) and WordNet (Fellbaum, 1998) being popular choices. In this work we contribute a new, novel resource, namely 1 GENERICSKB is available at https://allenai.org/data/genericskb 1. Example generics about "tree" in GENERICSKB Trees are perennial plants that have long woody trunks. Trees are woody plants which continue growing until they die. Most trees add one new ring for each year of growth. Trees produce oxygen by absorbing carbon dioxide from the air.
Trees are large, generally single-stemmed, woody plants. Trees live in cavities or hollows. Trees grow using photosynthesis, absorbing carbon dioxide and releasing oxygen.
2. An example entry, including metadata Term: tree Sent: Most trees add one new ring for each year of growth. Quantifier: Most Score: 0.35 Before: ...Notice how the extractor holds the core as it is removed from inside the hollow center of the bit. Tree cores are extracted with an increment borer. After: The width of each annual ring may be a reflection of forest stand dynamics. Dendrochronology, the study of annual growth rings, has become prominent in ecology... a large collection of contextualized generic sentences, as an additional source of general knowledge, and to help fill gaps with existing repositories. The resource, called GENERICSKB, is the first to contain naturally occurring generic sentences, as opposed to extracted or crowdsourced triples, and thus is rich in high-quality, general, semantically complete statements.
Statements in GENERICSKB were culled from over 1.7 billion sentences from three corpora. To collect statements, we first clean the source data, then filter it using linguistic rules to identify likely generics, then apply a BERT-based scoring step to distinguish generics that are meaningful on their own (avoiding generics with contextual meaning such as Meals are on the third floor). The resulting KB contains over 3.5M statements, each including metadata about its topic, surrounding context, and a confidence measure. Figure 1 illustrates some examples, as well as a full entry illustrating the metadata. We also create GENERICSKB-BEST (1M+ sentences), containing the best-quality generics in GENERICSKB plus selected, synthesized generics from WordNet and ConceptNet.
We also report results using GENERICSKB for two tasks, namely question-answering (using the OpenbookQA dataset (Mihaylov et al., 2018) ), and explanation generation (using the QASC dataset (Khot et al., 2019) ). Our goal is not to build a new model, but to see how an existing model's performance changes when the GENER-ICSKB corpus replaces a larger corpus for these tasks. We find that GENERICSKB can sometimes produce higher question-answering scores, and always produced better quality explanations. This suggests that GENERICSKB may have value for other NLP tasks also, either standalone or as an additional source of general knowledge to help train models. Finally, independent of deep learning, GENERICSKB may be a valuable resource for those studying generics and their semantics in linguistics.
2 Related Work
A generic statement is one that makes a blanket statement about the members of a category, e.g., "Tigers are striped." 2 Because they apply to many entities, they are particularly important for reasoning. Although common in language, their semantics has been a topic of considerable debate in linguistics, e.g., (Carlson and Pelletier, 1995; Schubert and Pelletier, 1989; Leslie, 2015; Liebesman, 2011; Schubert and Pelletier, 1987; Leslie, 2011) . Rather than repeat that debate here, we note that our primary goal is to collect rather than interpret generics. We hope that our resource can contribute to study of their semantics.
Several repositories of general knowledge are available already, but with different characteristics and coverage to GENERICSKB, e.g., (Sap et al., 2019; Tandon et al., 2014; Van Durme et al., 2009) . ConceptNet (Speer et al., 2017) is perhaps the most used, containing approximately 1M English triples (excluding RelatedTo, Synonym, and [Lexical]FormOf links), or 34M triples total. ConceptNet triples can be rendered as short generics, thus covering just simple (typically three word) generic statements about 28 relationships. Similarly, WordNet taxonomic and meronymic links express short, specific relationships but leave most uncovered (compare with Figure 1 ). Triple stores, e.g., (Clark and Harrison, 2009) , acquired from open information extraction (Banko et al., 2007) , contain larger and less constrained collections of knowledge, but typically with low precision (Mishra et al., 2017), making it difficult to exploit them in practice. GENERICSKB thus fills a gap in this space, containing naturally occurring generic statements that an author considered salient enough to write down.
Size (# sentences) Corpus Original Cleaned Filtered Waterloo ∼ 1.7B ∼ 500M ∼ 3.1M SimpleWiki ∼ 900k ∼ 790k ∼ 13k ARC ∼ 14M ∼ 6.2M ∼ 338k GENERICSKB ∼ 1.7B ∼ 513M ∼ 3.4M
To construct GENERICSKB, sentences were selected from over 1.7B sentences in three corpora (Table 1) : The Waterloo corpus is 280GB of English plain text, gathered by Charles Clarke (Univ. Waterloo) using a webcrawler in 2001 from .edu domains. It was made available to us and was previously used in (Clark et al., 2016) . SimpleWikipedia is a filtered scrape of Sim-pleWikipedia pages (simple.wikipedia.org). The ARC corpus is a collection of 14M science and general sentences, released as part of the ARC challenge . GENERICSKB was then assembled in the following three steps:
As the source corpora originated from web scrapes, they contain noise in various forms, such as blocks of code, non-English text, hyperlinks, and emails. The corpora were cleaned using the following:
• Regular Expressions to capture frequently occurring lexical properties of noise. • Sentence and token length heuristics to filter out malformed sentences. • Text cleanup using the Fixes Text For You (ftfy) python library which fixes various encoding-related errors. • Language Detection using spaCy to filter out non-English text.
no-bad-first-word: Sentence does not start with a determiner ("a", "the",...) or selected other words. remove-non-verb-roots: Remove if root is a non-verb remove-present-participle-roots: Do not consider any present participle roots. has-no-modals: Sentences containing modals ("could", "would", etc) are rejected all-propn-exist-in-wordnet: All (normalized, non-stop)
words are in WordNet's vocabulary Figure 2 : Example filtering rules. (See supplementary material for the full list).
We next use a set of 27 hand-authored lexicosyntactic rules to identify standalone generic sentences, and reject others. For example, sentences that start with a bare plural ("Dogs are...") are considered good candidates, while those starting with a determiner ("A man said...") or containing a present participle ("A bear is running...") are not. Similarly, sentences containing pronouns ("He said...") are likely to have contextual rather than standalone meaning, and so are also rejected. A sample of the filtering rules are summarized in Figure 2 , and the full list of rules is given in the Appendix. Given the size and redundancy of the initial corpus, these rules aim to filter the corpus aggressively to produce a set of high-quality candidates, rather than catch all possible standalone generics.
Finally, we train and apply a BERT classifier to score sentences by by how well they describe a useful, general truth. To build the classifier, a random subset (size 10k) of the 3.4M candidate generics was labeled by crowdworkers as to whether they expressed a useful, general truth about the world (with options yes, no, unsure), guided by examples. Specifically, workers were asked to reject (1) sentences which do not stand on their own, e.g.,:
Free parking is provided (2) subjective and/or not useful statements, e.g.,
Life is too serious, sometimes. (3) Vague statements, e.g.,
All cats are essentially cats. (4) Statements about people and companies, e.g.,
Apple makes lots of iPhones (5) Facts that are incorrect in isolation, e.g.,
All maps are hand-drawn.
Each fact was annotated twice and scores (yes/unsure/no = 1/0.5/0) averaged. The joint probability of agreement (i.e., that both annotators agreed) was 70.1% (approximately 1/3 of the agreed annotations being "yes", 2/3 "no"), and Cohen's Kappa κ was 0.52 ("moderate agreement" ). The dataset was then split 70:10:20 into train:dev:test, and a BERT classifer 3 fine-tuned on the training set. Each sentence is input simply as [CLS] sentence. The output is pooled, then run through a linear layer which outputs two logits representing the two classes (yes/no), followed by a softmax to obtain class probabilities. This classifier scored 83% on the held-out test set. The classifier was then used to score all 3.4M extracted generic sentences.
3.4 Genericskb And Genericskb-Best
The final GENERICSKB contains 3,433,000 sentences. We also create GENERICSKB-BEST, comprising GENERICSKB generics with a score > 0.23 4 , augmented with short generics synthesized from three other resources 5 for all the terms (generic categories) in GENERICSKB-BEST. GENERICSKB-BEST contains 1,020,868 generics (774,621 from GENERICSKB plus 246,247 synthesized).
For some initial indications of whether GENER-ICSKB can be useful, we performed two experiments.
We evaluate using GENERICSKB for a question-answering task, namely OpenbookQA (Mihaylov et al., 2018) , comparing it to using an alternative, large, publically available corpus (QASC-17M, (Khot et al., 2019) ). For both, we use the BERT-MCQ QA system (Khot et al., 2019) . Note that our goal is to evaluate the corpora, not the QA system. The results are shown in Table 2 , indicating that using the high-quality version GENERICSKB-BEST can, at least in this Table 3 : Comparative quality of two-hop explanations (sentence chains), generated using two different corpora for two different question sets.
case, result in improved QA performance over using the original corpus, even though it is a fraction of the size.
4.2 Explanation Quality
We also experimented with using GENERICSKB-BEST to generate explanations for a (given) answer, where an explanation is a chain of two sentences drawn from the corpus. For example: What can cause a forest fire? storms because: Storms can produce lightning AND Lightning can start fires Good explanations typically use generic sentences, reflecting the underlying formal structure of the explanation. This suggests that a corpus of generics may help in this task.
We test this hypothesis using the QASC dataset. We can do this because the BERT-MCQ system described earlier already finds candidate good chains as part of its retrieval step (Khot et al., 2019) (specifically, it finds pairs of sentences from the corpus that maximally overlap the question, answer, and each other). We can thus collect these chains found using the original QASC-17M corpus, and using GENERICSKB-BEST, and compare quality.
To evaluate these chains, we train a simple BERT-model using the QASC training data, which comes with a gold reasoning chain for every correct answer. We use the gold chains as examples of good chains, and BERT-MCQ-generated chains for incorrect answer options as examples of bad (invalid) chains. We can then use the trained model to evaluate the chains collected earlier.
The results are in Table 3 , and indicate that substantially better explanations are generated with GENERICSKB-BEST. The same result was found using the OBQA dataset. In particular, because of the eclectic nature of the QASC-17M corpus, nonsensical explanations can often occur, e.g.,: What do vehicles transport? people because:
What to say what vehicle to use AND Now people say it's time to move on. compared with the GENERICSKB-BEST explanation: What do vehicles transport? people because:
A vehicle is transport AND Transportation is used for moving people Here, the QASC-17M explanation is nonsensical, while as GENERICSKB is rich in stand-alone generics, the explanations produced with it are more often valid.
4.3 Genericskb Quality
Finally we note that even with filtering, some (undesirable) contextual generics occasionally pass through. Examples include:
• All results are confidential.
• Complications are usually infrequent.
• Democracy Is Four Wolves And A Lamb Voting On What To Have For Lunch.
These examples exhibit ellipsis, vagueness, and metaphor, complicating their interpretation. Ideally, the scoring model would then score these low, but this may not always happen: recognizing contextuality often requires world knowledge. For example, consider distinguishing the good, standalone generic Murder is illegal from the contextual one Parking is illegal.
To evaluate the extent of this, two annotators independently annotated 100 random (GENERICSKB) sentences from GENERICSKB-BEST as to whether they represented useful, general truths (the same criterion as in Section 3.3), and found 85% (averaged) met this criterion. This suggests that such problems are relatively uncommon.
With the growing use of deep learning in NLP, researchers have often sought out additional general knowledge resources to improve their systems. To help meet this need, as well as provide a general resource for linguistics, we have created GENERICSKB, the first large-scale resource of naturally occurring generic statements, as well as an augmented subset GENERICSKB-BEST, including important metadata about each statement.
While GENERICSKB is not a replacement for a Web-scale corpus, we have shown it can assist in both question-answering and explanation construction for two existing datasets. These positive examples of utility suggest that GENERICSKB has potential as a large, new resource of general knowledge for the community. GENERICSKB is available at https://allenai.org/data/genericskb.
We also include near-universally quantified statements such as "Most tigers are striped" in GENERICSKB, although their status as generics is sometimes disputed by semanticists.
We use the BERT-for-classification package provided by AllenNLP, https://allenai.github.io/allennlpdocs/api/allennlp.models.bert for classification.html 4 By calibration, equivalent to an annotator score of 0.5, i.e., more likely good than bad.5 ConceptNet (isa, hasPart, locatedAt, usedFor); WordNet (isa, hasPart); and the Aristo TupleKB (at https://allenai.org/data/tuple-kb) For WordNet, we use just the most frequent sense for each generic term.