COMET-ATOMIC 2020: On Symbolic and Neural Commonsense Knowledge Graphs

Jena D. Hwang
Chandra Bhagavatula
Ronan Le Bras
Jeff Da
Keisuke Sakaguchi
Antoine Bosselut
Yejin Choi
AAAI
2021
View in Semantic Scholar

Abstract

Recent years have brought about a renewed interest in commonsense representation and reasoning in the field of natural language understanding. The development of new commonsense knowledge graphs (CSKG) has been central to these advances as their diverse facts can be used and referenced by machine learning models for tackling new and challenging tasks. At the same time, there remain questions about the quality and coverage of these resources due to the massive scale required to comprehensively encompass general commonsense knowledge. In this work, we posit that manually constructed CSKGs will never achieve the coverage necessary to be applicable in all situations encountered by NLP agents. Therefore, we propose a new evaluation framework for testing the utility of KGs based on how effectively implicit knowledge representations can be learned from them. With this new goal, we propose ATOMIC 2020, a new CSKG of general-purpose commonsense knowledge containing knowledge that is not readily available in pretrained language models. We evaluate its properties in comparison with other leading CSKGs, performing the first large-scale pairwise study of commonsense knowledge resources. Next, we show that ATOMIC 2020 is better suited for training knowledge models that can generate accurate, representative knowledge for new, unseen entities and events. Finally, through human evaluation, we show that the few-shot performance of GPT-3 (175B parameters), while impressive, remains ~12 absolute points lower than a BART-based knowledge model trained on ATOMIC 2020 despite using over 430x fewer parameters.

1 Introduction

Commonsense understanding and reasoning remain longstanding challenges in general artificial intelligence. However, large-scale language models have brought tremendous progress in the sub-field of natural language processing. Such large-scale language models (Radford et al. 2018; Devlin et al. 2019; Brown et al. 2020) trained on extreme-scale data have been shown to effectively adapt to diverse downstream tasks, achieving significant performance gains across natural language benchmarks (Wang et al. 2019) . Interestingly, as these models have grown larger (and trained on larger amounts of data), their benchmark performance has continued to improve (Raffel et al. 2019 ) despite limited conceptual improvements, leaving open questions regarding the source of these remarkable generalization properties.

Recent work has hypothesized that many of these performance gains could be a result of language models being able to memorize facts in their parameters during training (Roberts, Raffel, and Shazeer 2020) that can be leveraged at evaluation time. As a result, a new paradigm of language models as knowledge bases has emerged (Petroni et al. 2019) . In this setting, language models are prompted with natural language prefixes or questions, and they express knowledge through language generation. The initial success of this paradigm for representing commonsense knowledge (Davison, Feldman, and Rush 2019; Tamborrino et al. 2020) has led to the optimistic claim that language models comprehensively encode commonsense knowledge, and remove the need for structured knowledge resources. 20 tuple count distribution compared to ATOMIC (Sap et al. 2019) and CONCEPTNET, either its commonsense subset (Li et al. 2016) or the full set (Speer, Chin, and Havasi 2017) .

We take a more skeptical view of this capacity of language models -Does scaling up language models actually endow them with commonsense knowledge? While language models can successfully express certain types of knowledge, their best results are observed in narrowly specific conditions -we show (cf. §5) that they perform better when evaluated on knowledge bases that prioritize ontological relations and whose examples resemble language-like assertions (e.g., mango IsA fruit). 2 Consequently, the types of knowledge that can be directly accessed through the language model's interface remains limited.

However, prior work has also shown that training language models on knowledge graph tuples leads them to learn to express their implicit knowledge directly (Bosselut et al. 2019) , allowing them to provide commonsense knowledge on-demand. These adapted knowledge models have exhibited promising results on commonsense benchmarks compared with methods that require linking entities to knowledge graphs (Shwartz et al. 2020; Liu et al. 2020) . Inspired by these successes, we propose a dual use for commonsense knowledge bases going forward: as static graphs that can be linked to for discrete knowledge access, and as resources for adapting language models to hypothesize commonsense knowledge about un-annotated entities and events.

With this second purpose in mind, we propose evaluating commonsense knowledge resources based on the complementary information they can bring to pretrained language models. We construct ATOMIC 20 20 , a new, highquality knowledge graph with 1.33M commonsense knowledge tuples across 23 commonsense relations. We compare ATOMIC 20 20 with respect to its coverage and accuracy in competition with other highly used CSKGs, such as CONCEPT-NET (Speer, Chin, and Havasi 2017) . Our results show that ATOMIC 20 20 is able to cover more correct facts about more diverse types of commonsense knowledge than any exist- 2 An observation supported by Brown et al. (2020) 's GPT-3 model, whose best few-shot performance on commonsense knowledge benchmarks comes on the PhysicalIQA (Bisk et al. 2020) and HellaSwag (Zellers et al. 2019) datasets.

ing, publicly-available commonsense knowledge resource. However, our results also indicate that there remains a large amount of exclusivity between these KGs, highlighting the challenge of creating resources that cover the scale and diversity of general commonsense knowledge.

Furthermore, we formalize the COMET framework of Bosselut et al. (2019) across different seed language models and training knowledge graphs, and evaluate the commonsense knowledge hypothesized by these adapted knowledge models. Our empirical study yields two promising conclusions. First, it confirms that KG-adapted language models learn to express knowledge more precisely than naive language models trained only on language. And second, we show that ATOMIC 20 20 as a transfer resource leads to COMET models that achieve the largest increase over their seed language model (across all seed LMs) for the commonsense knowledge types it covers, validating the importance of constructing knowledge resources with examples of knowledge not readily found in language models.

Table 1: Relations in ATOMIC2020 along with illustrative examples and their respective size. Relations that reflect semantically identical categories to CONCEPTNET is marked with an asterisk (∗).

Key Contributions: In summary, we make three key contributions in this paper. We present ATOMIC 20 20 -a new commonsense knowledge graph covering social, physical, and eventive aspects of everyday inferential knowledge (cf. §3). Next, we compare ATOMIC 20 20 with other prominent CSKBs head-to-head and show that our new symbolic knowledge graph is more accurate than any current CSKB (see Table 2) (cf. §4). Finally, we show that our new neural knowledge model COMET-ATOMIC 20 20 successfully transfers ATOMIC 20 20 's declarative knowledge to beat GPT-3, the largest pre-trained language model, in spite of using 400x fewer parameters (see Table 6 ) (cf. §5). This demonstrates the utility and importance of high-quality symbolic knowledge provided by ATOMIC 20 20 to generalize on commonsense information that LMs cannot expressively capture on their own (cf. §6).

Table 2: Accuracy - Percentage (%) of tuples in the knowledge base evaluated by human crowdworkers as either always true or likely (Accept), farfetched/never or invalid (Reject), or unclear (No Judgment).

Table 6: Human evaluation of generation accuracy (%). Each model uses greedy decoding to generate the tail of 5K randomly-sampled test prefixes (head, relation) from each knowledge graph. GPT2-XL, GPT-3 and BART have 1.5B, 175B and 440M parameters, respectively.

Figure 1: A tiny subset of ATOMIC2020, a large atlas of social and physical commonsense relations. Relations in the topleft quadrant reflects relations from ATOMIC.1

2 Background

Commonsense Knowledge Graphs Large scale commonsense knowledge graphs are ubiquitous tools in natu-ral language processing tasks as access to their facts allows models to learn to reason over commonsense knowledge to make predictions (Lin et al. 2019; Feng et al. 2020) . In this work, we evaluate three existing knowledge graphs, CON-CEPTNET, ATOMIC, and TRANSOMCS on their coverage and precision relative to our new resource ATOMIC 20 20 . 3 The CONCEPTNET (v5.7) knowledge graph (Speer, Chin, and Havasi 2017) consists of 36 relations focusing mostly on taxonomic and lexical knowledge (e.g., RelatedTo, Synonym, IsA) and physical commonsense knowledge (e.g., MadeOf, PartOf). CONCEPTNET (v5.7) contains 3.4M entity-relation tuples (in English) collected by crowdsourcing and merged with existing knowledge databases from DBPedia, WordNet, Wiktionary, and Open-Cyc. Since the knowledge are derived from human efforts, the accuracy of CONCEPTNET (v5.7) knowledge is fairly high, though the quality does vary depending on the sources of knowledge and relation types. However, as highlighted in (Davis and Marcus 2015; Sap et al. 2019) , and shown in Figure 2 , the coverage of CONCEPTNET (v5.7) is limited to mostly taxonomic, lexical, and object-centric physical commonsense knowledge. In fact, out of 3.4M tuples, 90% of them correspond to taxonomic (e.g., IsA) or lexical (e.g., Synonym, RelatedTo) knowledge, making the commonsense portion of CONCEPTNET (v5.7) relatively small.

Figure 2: ATOMIC2020 tuple count distribution compared to ATOMIC (Sap et al. 2019) and CONCEPTNET, either its commonsense subset (Li et al. 2016) or the full set (Speer, Chin, and Havasi 2017).

The ATOMIC (Sap et al. 2019 ) knowledge graph consists of 880K of tuples across 9 relations that cover social commonsense knowledge (e.g, X gets X's car repaired xIntent to maintain the car), including dynamic aspects of events such as causes and effects, if-then conditional statements, and mental states. The ATOMIC dataset is collected and validated completely through crowdsourcing.

The TRANSOMCS (Zhang et al. 2020a ) knowledge graph consists of 18.48M tuples that were automatically converted from syntactic parses of sentences from various web sources including Wikipedia, Yelp, and Reddit. The set of relations used for the mapping is copied from CONCEPT-NET. Although TRANSOMCS is much larger than other commonsense knowledge graphs, the precision of the extracted knowledge is significantly lower compared to other resources (cf. §4), and performs poorly as an adaptation resource relative to other KGs (cf. §5).

Language Models as Knowledge Bases Recent work hypothesizes that pretrained language models represent commonsense knowledge implicitly (Petroni et al. 2019; Roberts, Raffel, and Shazeer 2020) . However, the results motivating these observations are often limited to narrowly scoped subsets of commonsense knowledge that primarily include taxonomic knowledge (e.g., mango IsA fruit) and that are often found explicitly stated in text. However, commonsense facts are often implied (Gordon and Van Durme 2013), and as will be seen in our studies (cf. §4), state of the art neural models struggle to express implicit commonsense knowledge that involves complex relationships.

To overcome this limitation, Bosselut et al. (2019) take the best of both worlds between commonsense knowledge graphs and pretrained language models. The commonsense transformer, or COMET, adapts pretrained neural language models by training on example tuples from commonsense knowledge graphs. It takes a head/source phrase and a relation (e.g., take a nap Causes) and generates the tail/target phrase (e.g., have energy). Bosselut et al. (2019) show that COMET trained on the CONCEPTNET and ATOMIC knowledge graphs is able to adapt to generate novel (and valid) commonsense knowledge tuples.

Importantly, these neural knowledge models can produce commonsense knowledge on-demand for any head entity that can be expressed through language. This flexibility allows them to be used out-of-the-box, and they have been applied to new, previously unexplored tasks, such as sarcastic comment generation (Chakrabarty et al. 2020) , therapy chatbots (Kearns et al. 2020) , and automated story plot generation (Ammanabrolu et al. 2020) . These contributions show that progress on knowledge models opens up new downstream applications that were challenging to model before.

3 Atomic 20 20

We present ATOMIC 20 20 , a commonsense knowledge graph with 1.33M everyday inferential knowledge tuples about entities and events. ATOMIC 20 20 represents a large-scale commonsense repository of textual descriptions that encode both the social and the physical aspects of common human everyday experiences, collected with the aim of being complementary to commonsense knowledge encoded in current language models. ATOMIC 20 20 introduces 23 commonsense relations types. They can be broadly classified into three categorical types: 9 commonsense relations of socialinteraction, 7 physical-entity commonsense relations, and 7 event-centered commonsense relations concerning situations surrounding a given event of interest. The full inventory of ATOMIC 20 20 relations is listed in Table 1 . In terms of physical and event-centered commonsense, by far, the two largest new relations in ATOMIC 20 20 are ObjectUse and HinderedBy. For ObjectUse, we focused on affordances of everyday objects such as "popcorn bucket" that may be used for "holding popocorn" or "storing things". For HinderedBy, we explore the notion that many events in real world can be defeasible (Lascarides and Asher 1991) by collecting hindrances to goals that may be useful for tasks such as counterfactual reasoning. For example X's desires to adopt a cat may be hindered by finding out that X is allergic to cats, which would necessitate X to adjust future actions accordingly (say, opt for hypoallergenic options like tortoises).

In the case of ObjectUse, we collected over 130K everyday object-use pairs by asking crowdworkers for necessary objects and their uses for each event in ATOMIC 20 20 . For example, given "X eats popcorn" we elicited items such as "popcorn bucket" with their various expected uses. The number also reflects atypical usages gathered in a separate pass where workers were asked to provide creative or resourceful but feasible uses of the objects. Given "popcorn bucket", for instance, one might "wear it as a hat" for, say, a costume party. For HinderedBy, we crowdsourced over 100K tuples of hindrances to existing ATOMIC 20 20 events, asking the workers to provide situations or events that might pose as deterrence should the event be considered an achievable goal (see Appendix A for further details). For socialinteraction commonsense, we primarily incorporated tuples from ATOMIC, but also crowdsourced an additional 34K tuples using the same approach as Sap et al. (2019) .

ATOMIC 20 20 also pulls commonsense tuples from the English subset of CONCEPTNET(v5.7) (latest version available; Speer, Chin, and Havasi 2017). 4 Of the 3.4M English tuples in CONCEPTNET(v5.7), a small subset of 172K tuples was selectively chosen to be integrated into ATOMIC 20 20 . This subset represents data carefully identified to reflect commonsense information dealing with qualita- 4 A CONCEPTNET(v5.7) fact is considered English if both the head and tail concepts are marked with '/en/' in the edge id. tive human experiences. Among the eliminated data are tuples with edge weight ≤ 0.5, dictionary or etymologically based knowledge (e.g., synonyms/antonyms, inflections), lexical hyper/hyponymic lexical relationships such as IsA or InstanceOf, and relations based on lexical cooccurrence (e.g., RelatedTo or LocatedNear), which are easily recoverable from language models. 5 After selective removal of these relations and a post-processing step to ensure the removal of deterministic information such as geographic facts (e.g., "shenzhen" AtLocation"china"), tuples from each CONCEPTNET were examined for further splits or joins to align with the existing structure of ATOMIC 20 20 . A random 10% tuples from each selected relations were then put through crowdsourced validity testing (akin to the process described later in §4). Tuples that were directly incorporated without further edits passed with an acceptance rate of 93% or higher. A subset of relations (i.e., CapableOf, MadeUpOf, HasProperty) were put through additional crowdsourcing to weed out tuples that were either invalid or found to hold prejudiced descriptions of human entities. In the end, only 5 relations (marked with an asterisk in Table 1 ) retain the CONCEPTNET's original meaning with a few relations that are cognates in ATOMIC 20

20

(more details in Appendix A).

4 Symbolic Knowledge Graph Comparison

In this work, we compare our new ATOMIC 20 20 knowledge graph to three other prominent CSKGs: ATOMIC (Sap et al. 2019) , CONCEPTNET 6 (Li et al. 2016) , and TRAN-SOMCS (Zhang et al. 2020a) . We measure the accuracy of tuples in each KG and compare the coverage of each CSKG w.r.t. other CSKGs head-to-head.

Accuracy Assessment

In order to assess the accuracy of the knowledge represented, 3K random instances were extracted from each of the knowledge graphs for a crowdsourced evaluation of the tuples. the form of (head, relation, tail) for annotation. To expedite the human assessment of the tuples, each relation (e.g., xWant or AtLocation) was translated into a humanfriendly natural language form (e.g., "as a result, PersonX wants" and "located or found at/in/on", respectively; cf. Appendix B). The workers were asked to rate the tuples along a 4-point Likert scale: always/often -the knowledge assertion presented is always or often true, sometimes/likely -it is sometimes or likely true, farfetched/never -it is false or farfetched at best, and invalid -it is invalid or makes no sense. Any tuples receiving the former two labels are ranked as Accept and latter two as Reject. The workers were also given a choice to opt out of assessment if the concepts were too unfamiliar for a fair evaluation (No Judgment). Each task (HIT) included 5 tuples of the same relation type, and each tuple was labeled by 3 workers. For the results, we take the majority vote among the 3 workers.

Human Evaluation

Results. ATOMIC 20 20 outperforms other KGs in crowdsourced accuracy as shown in Table 2 . 7 ATOMIC ties with CONCEPTNET with reasonably high accuracy, while Overall inter-rater agreement measured by Fleiss' κ of 0.46 SOMCS lags behind others with far lower accuracy. We provide a per-relation breakdown of accuracies in Table 3 .

Table 3: KG accuracy values broken down by relation. Gray cells indicate statistically significant difference from ATOMIC2020 values. Dark gray cells signal instances where ATOMIC2020 values are significantly higher than its counterpart KB. Relational cognates have been grouped together and exact matches are asterisked (*) (cf. Table 1).

Between ATOMIC 20 20 and ATOMIC, the variations in the assessed accuracies are not found to be statistically significant. Among the ATOMIC 20 20 and CONCEPTNET relations that represent exact matches (marked with * in Table 3), the differences are either not statistically significant or when they are, ATOMIC 20 20 improves upon the associated facts, reflecting that the preprocessing stages of CON-CEPTNET integration were helpful in improving the quality of these relations ( §3). Among cognates in ATOMIC 20 20 and CONCEPTNET relations, two sets of relations fare significantly worse in ATOMIC 20 20 than in CONCEPTNET. In the case of ObjectUse/UsedFor, this is likely due to the fact that ATOMIC 20 20 's ObjectUse includes atypical affordances (cf. §3). In an annotation setting where workers are asked to evaluate the truth or likelihood of an assertion rather than feasibility of use, a portion of the atypical usages are seen as 'farfetched' and thus, rejected. In the case of MadeUpOf/MadeOf, there may be some room for improvement for ATOMIC 20 20 . Unlike the ATOMIC 20 20 's HasSubEvent label that successfully joins together CON-CEPTNET's HAS(FIRST/LAST)SUBEVENT labels for an improved accuracy, ATOMIC 20 20 's MadeUpOf union of MadeOf, PartOf, and a subset of HasA, did not seem to have resulted in improved quality. The rest of the ATOMIC 20 20 cognates see a significantly higher or similar accuracy in comparison to CONCEPTNET.

Coverage Assessment

We make a pairwise comparison between the CSKGs to assess their coverage with regards to the commonsense knowledge they contain. For a reliable head-to-head comparison, we map relations and tuples between various KGs.

Mapping Relations. Since ATOMIC 20 20 is built on existing ATOMIC relations, we primarily need to align relations between ATOMIC 20 20 and CONCEPTNET. We carefully align them based on the definitions for the labels as supplied by the two graphs, then the resulting alignment was verified by sampling at random approximately 20 instances per relation.

Mapping Tuples. In order to resolve syntactic differences in how the concepts are expressed in each of the KGs (e.g., ATOMIC's "PersonX eats breakfast" vs. CONCEPTNET's "eat breakfast"), we preprocess each of the head and tail concepts of each tuple in each KG in the following manner: (1) the concept is lowercased and stripped of extra spaces, punctuations, and stopwords; (2) any exact tuple duplicates within each KB removed, and (3) remaining content words are lemmatized according to their POS category. For ATOMIC and ATOMIC 20 20 , an extra step is added to remove mentions of "PersonX", "PersonY" and "PersonZ" if occurring at the beginning of a string, and to replace with 'person' if they occur elsewhere (e.g, "PersonX greets PersonY").

Metrics. We use two metrics to evaluate the coverage of knowledge graphs. For each pair of CSKGs, we compute precision and recall with respect to a target KG. Coverage precision assesses the proportion of tuples in the source KG that are correct according to tuples in the target KG. Coverage recall reflects the proportion of tuples in the target KB that the tuples in the source KB successfully recalled.

Results. Tables 4 and 5 shows the widest coverage: ATOMIC 20 20 is able to recall all of ATOMIC (as expected) and just under half of CONCEPTNET. There is very little overlap between ATOMIC and CONCEPT-NET, which is unsurprising as all of ATOMIC knowledge is focused on social behaviors CONCEPTNET does not cover while CONCEPTNET leans on physical commonsense which falls outside ATOMIC's scope. Overall, TRANSOMCS intersects very little with any of the other three KBs.

5 Neural Knowledge Graph Comparison

Language models are powerful tools for representing knowledge, but their ability to serve as generative knowledge bases is limited by the fact they are directly trained to represent the distribution of language. Previous work shows knowledge graphs can help language models better transfer as knowledge engines (Bosselut et al. 2019 ) by re-training them on examples of structured knowledge. As a result, a new purpose for knowledge graphs is to be useful in helping language models generalize to hypothesizing knowledge tuples.

Experimental Setup. To evaluate whether knowledge graphs can help language models effectively transfer to knowledge models, we train different pretrained language models on the knowledge graphs described in Section 4, which we describe below: GPT2 (Radford et al. 2019 ) is a Transformer (Vaswani et al. 2017) based language model. In our experiments, we use the largest GPT2 model, GPT2-XL, that has 1.5B parameters. We fine-tune GPT2-XL on each of our CSKGs to predict the tail of a tuple (e.g., wheat) given the head (e.g., bread) and a relation (e.g., MadeUpOf). The hyperparameter settings used for training are described in more detail in Appendix C. Additionally, we use GPT2-XL in a zero-shot setting as a baseline to measure the effect of transfer learning on knowledge graphs. For fair comparison, we convert each relation manually to an English language prompt expecting the tail of each tuple as output generated by the model. BART (Lewis et al. 2020) is a Bidirectional and Autoregressive Transformer, an adaptation from BERT (Devlin et al. 2019 ) that is better suited for natural language generation (e.g., translation, summarization). Additional training details are provided in Appendix C. GPT-3 (Brown et al. 2020) is an autoregressive language model that has 175B (over 100X more parameters than GPT2-XL) parameters and is trained on a corpus of web text. We use the GPT-3 API to prime the language model to generate the tail for a given prefix -(head, relation) pair. Thus, GPT-3 is evaluated in a few-shot setting. Additional details of our implementation are provided in Appendix C.

Table 4: Coverage Precision - Average number of times (in %) a tuple in Source KB is found in Target KB.

Table 5: Coverage Recall - Average number of times (in %) a tuple in Target KB is found in Source KB. †This value is greater than 100 because multiple tuples in ATOMIC2020 can map to the same tuple in ATOMIC.

Evaluation Setup. To assess language-to-knowledge transfer capabilities, we evaluate how language models generalize to new, unseen entities, concepts, or events. We split each knowledge graph into training, validation, and test sets such that the heads of the knowledge tuples do not overlap between these sets. This adversarial split forces the language models to generalize the relationships they learn from training on the knowledge graphs to the entities learned during language pretraining. Also, to avoid overpopulating the validation and test sets with generic heads (e.g., "I", "You", "He", "We", and "They" collectively account for over 2.2M tuple heads in TRANSOMCS), we enforce that the head of any knowledge tuple in the dev and test sets is involved in at Table 7 : Automated metrics for the quality of the tail generations of the GPT2-XL language model and the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for sampled 5k test prefixes for each KG. The 5k prefixes correspond to the ones for the human eval. Similar results are obtained on the full test sets (cf. Appendix C).

Table 7: Automated metrics for the quality of the tail generations of the GPT2-XL language model and the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for sampled 5k test prefixes for each KG. The 5k prefixes correspond to the ones for the human eval. Similar results are obtained on the full test sets (cf. Appendix C).

most 500 tuples. Finally, we remove low-quality tuples from TRANSOMCS by imposing a confidence score of ≥ 0.5. We score the tuples generated by these knowledge models using common evaluation metrics for text generation: BLEU (Papineni et al. 2002) , ROUGE (Lin 2004) , CIDEr (Vedantam, Lawrence Zitnick, and Parikh 2015), and BERT Score (Zhang et al. 2020b) . For a subset of 5000 generated tuples from the test set of each knowledge graph, we also run the same human evaluation described in Section 4.

Results. We present our main results in Tables 6 and 7 . First, we note the large divide between the zero-shot GPT2-XL model that produces commonsense knowledge without any fine-tuning and the two COMET models across the ATOMIC 20 20 , ATOMIC, and CONCEPTNET knowledge graphs (Table 6 ). This large gap indicates that language models can benefit from learning facts from commonsense knowledge graphs. They do not have the means to precisely express this knowledge directly from just pretraining on language. This observation is supported by the gaps between these models in the automatic evaluations (Table 7) , as well. Additionally, human evaluation of GPT-3 (Table 6 ) shows a ∼12 point deficit compared to the performance of COMET(BART), in spite of GPT-3 (175B) having over ∼430 times more parameters than COMET(BART) (406M). Similarly, we see a large gap in performance across all automated metrics in Table 7 . The performance gap indicates that high-quality declarative knowledge is valuable even after the advent of extreme scale language models.

In addition to this main result, two particularly interesting observations emerge. First, we note that the gap between the zero-shot model and COMET is larger on the ATOMIC 20 20 and ATOMIC knowledge graphs, than on CON-CEPTNET, supporting the reflection that ATOMIC 20 20 supports categories of knowledge that are more difficult to learn from pretraining. Second, the results on the human evaluation show that COMET models trained on TRANSOMCS are not able to generalize knowledge to new entities, implying that language models benefit more from accurate knowledge examples, which TRANSOMCS lacks (cf. §4).

6 Discussion

Do pretrained language models already encode commonsense knowledge? Our conclusions on this subject are mixed and hinge on the ambiguous meaning of what it means to encode knowledge. Despite the conclusions of prior work (Petroni et al. 2019; Roberts, Raffel, and Shazeer 2020; Tamborrino et al. 2020) , our results in Table 6 are clear that language models fail to express large varieties of knowledge when prompted for it in a zero-shot manner. When converted to COMET models by training on a knowledge graph, their performance at hypothesizing knowledge tuples skyrockets -47.9% absolute difference between COMET(BART) and GPT2-XL on ATOMIC 20 20 . However, the evaluation tuples are adversarially selected to not include head entities that were in the training set. The model must generalize its learned representations of relations to entities it has not observed these relationships for during fine-tuning, meaning the representation of these entities is solely formulated from learning language. As a result, language models may still encode this knowledge in their parameters, even if they are not capable of expressing it directly. With this framing in mind, the COMET training paradigm proposed by Bosselut et al. (2019) can perhaps be viewed less as a means of learning knowledge from KGs, and more as a method of learning an interface for language models to hypothesize encoded knowledge through language generation. We look forward to future work in this space that attempts to disentangle these two ideas.

What considerations should be made when designing commonsense knowledge resources? Based on our results in Section 5, we outline desiderata for the design and development of future commonsense knowledge graphs. Because certain types of knowledge are already encoded and expressible by pretrained language models, CSKG designers should focus on collecting examples and categories of knowledge that are less likely to be known by language models. For example, of the 378 test tuples evaluated by the GPT2-XL zero-shot model that contained the HinderedBy relation, only 1.3% were deemed plausible by human raters -jumping to 85% plausibility for COMET(BART) -pointing to an advantage in constructing ATOMIC 20 20 with this relationship in mind (see Appendix C for per-relation accuracy.).

Second, commonsense knowledge resources should be designed with the goal of accuracy and relationship coverage. Because language models exhibit powerful adaptation (Brown et al. 2020) , they can generalize many commonsense relationships as long they have examples on which to train. Consequently, we should construct commonsense resources that encapsulate larger numbers of relations so the knowledge in pretrained language models can be grounded to a variety of relationships. However, language models also benefit from learning from precise examples. Being able to train on a large collection of examples from TRANSOMCS (see Appendix C) did not allow COMET models to generalize to unseen entities as these examples were not of sufficient quality (See Table 2 ). Resources should be carefully validated for the quality of their facts, an example set by Speer, Chin, and Havasi (2017) and Sap et al. (2019) .

7 Conclusion

In this work, we formalize a use for commonsense knowledge graphs as transfer learning tools for pretrained language models. With this new purpose, we hypothesize that commonsense knowledge graphs should be designed to contain knowledge that is not already expressible by language models without difficulty (e.g., not taxonomic and lexical knowledge). Consequently, we propose ATOMIC 20 20 , a novel commonsense knowledge graph containing tuples whose relations are specifically selected to be challenging for pretrained language models to express. Our empirical studies demonstrate that ATOMIC 20 20 contains high-accuracy knowledge tuples across multiple novel relations not found in existing CSKGs or expressible by LMs. Furthermore, we show that ATOMIC 20 20 can be effectively used as a training set for adapting language models as knowledge models to generate high quality tuples on-demand.

A Atomic 2020 Details Atomic 20 20 Relations

In this section we detail the relations in ATOMIC 20 20 . Figure 3 shows the hierarchical breakdown of the ATOMIC 20 20 relation labels. While there is no internal structure directly encoded for ATOMIC 20 20 relations, they fall into three natural categories based on their meaning: physical-entity, socialinteraction and event-centered commonsense.

Figure 3: ATOMIC2020 relations organized into a hierarchical structure.

Physical-Entity Commonsense. Physical-entity commonsense deals with inferential knowledge about common entities and objects. Physical commonsense such as these is crucial for interacting with the world: allowing us to distinguish the dangerous (e.g., "fire can be painful") from the harmless (e.g., "teddy bears are comforting"), manipulate objects for our use (e.g., "helmets protect head"), and solve problems (e.g., "how do I open this door?"). We identify seven relations under this category.

• ObjectUse describes everyday affordances or uses of objects, and includes both typical and atypical uses. For example, "popcorn bucket" can typically be used to "hold popcorn" but it could also serve "as a hat" in atypical situations.

• MadeUpOf and HasProperty, two property relations, denote the relationship between an entity and its composition or characteristics. MadeUpOf describes a part, portion or makeup of an entity. For example, "cake" can be MadeUpOf "eggs" (composition/ingredient) or "icing" (part/portion). Similarly, HasProperty usually describes entities' general characteristics such as "rose" is "red," subjective attributes such as "thirst" is "uncomfortable." In certain case, the relation can also map to descriptors that speak to the substance or value of items such as "meat" has property of being "stored in the freezer" or "bike" is "powered by person's legs."

• AtLocation is a spatial relation that describes the location in/on/at which an entity is likely to be found (e.g. "gambler" can be found in "casino," "wrench" can be found in "garage").

• CapableOf is designed to describe abilities and capabilities of everyday living entities (e.g., humans, animals, insects) and natural entities that can exert a force (e.g. sun, storms). CapableOf includes general capabilities such as a "human" is capable of "thinking and reasoning" or "drinking coffee." It also includes specialized capabilities such as a "surgeon" is capable of "operating on a patient."

• Desires and NotDesires are relations that deal with desires 8 of sentient entities; e.g., "doctors" likely desire to "cure patient" but do not desire "malpractice suit."

Social-Interaction Commonsense. Social-interaction relations comment on socially-triggered states and behaviors. Social commonsense is useful for gauging people's intentions and purpose, and predicting situationally-relevant human reactions and behaviors. Following the definitions for ATOMIC relations (Sap et al. 2019) , we identify a total of nine relations within this category.

• Three mental state relations address the emotional or cognitive states of the participants in a given event.

xIntent defines the likely intent or desire of an agent (X) behind the execution of an event. Given the head "X gives Y gifts," an xIntent might be that X wanted "to be thoughtful." Relations xReact and oReact define the emotional reactions on the part of X or other participants in an event. As a result of gift giving, X might feel "good about [one]self" and others (in this case, Y) might feel "appreciated."

• Five behavioral relations address the socially relevant responses to an event. xNeed describes a precondition for X achieving the event. For example, in order for X to give Y gifts, X must first "buy the presents." xWant and oWant are postcondition desires on the part of X and others, respectively. As a result of X giving Y gifts, X may also desire "to hug [Y]" and Y may want to "open the gift." xEffect and oEffect are social actions that may occur after the event: X may "get hugged" and Y may "blush" in response.

• The last relation xAttr describes X's persona or attribute as perceived by others given an event. In the gift giving example, X may be seen as "generous" or "giving." In contrast, in an event such as "X steals a car," X may be perceived as "evil."

Event-Centered Commonsense. While social-interaction commonsense gauges human behaviors and mental states given an event, the event-centered commonsense provides intuitions about how common events are related to one another. Commonsense about event interaction is useful for understanding likely causes and effects of events in the world. This knowledge allows humans to strategize and explore the best solutions for their objectives, make contingency plans, and revise goals when circumstances deviate from expectation. There are seven relations that fall under this category.

• We group three relations under force dynamics. 9 This group conceptualizes dynamic interactions between events with regards to exerted causal forces and impelled actions. Causes specifically captures the causal relation between two events or entities -e.g. an "accident" can cause "injury." Causes does have some overlap with behavioral relations such as xEffect in that they are postconditions of an event, but the postcondition in Causes is not socially triggered and can exist outside human control (e.g., "bad weather" causes "power outages"). HinderedBy introduces hindrances that obstruct the natural path to the achievement of a goal. For example, the event "X adopts a cat" can be obstructed if "X is allergic to cats." xReason provides a post-fact explanation of the cause of an event (e.g., why one has to "walk" could be explained by "car has broken down"), which is related to, but distinct from, xIntent's intentions (i.e., "X walks" because X wanted to "go home"). • Three relations provide reasoning about event scripts or sequences. isAfter and isBefore introduce events that can precede or follow an event, respectively. For example, "X is in a hurry to get to work" can happen after "X wakes up 15 minutes late" and before "X drives too fast." These relations are distinguished from behavioral relations xNeed (pre-condition) and xEffect (postcondition) in that isAfter and isBefore are temporally situated without specific regard to the need or reaction of the person X. For example, "X pushes X's luck" can happen before "X gets a broken nose" but getting a broken nose is not an action X intentionally may take after pushing one's luck. Relation HasSubEvent provides the internal structure of an event, each tail denoting a step within the larger head event. • The last relation in the event-centered category, isFilledBy, provides a filler phrase for an event with a blank that is sensical and commonly acceptable for the event. For example, the blank in an event such as "X catches in the act" can be commonly filled by entities such as a "cheater," a "burglar," or a "mouse."

ATOMIC 20 20 Tuples

In this section, we detail the population of the ATOMIC 20 20 tuples (see Table 1 for counts per relation).

Social-Interaction Tuples. For social-interaction relations, we incorporated 877K tuples from ATOMIC, and crowdsourced an additional 34K tuples using the same approach as Sap et al. (2019) . The rest of this section will refer to the head events in the social-interaction tuples as base events.

Crowdsourced Tuples. Tuples for relations ObjectUse, HinderedBy, isFilledBy, isBefore and isAfter were crowdsourced via Amazon Mechanical Turk. We paid an average of $15 an hour for our crowdsourcing efforts. We release all crowdsourcing templates as part of our codebase. 10

• For the collection of HinderedBy, we crowdsourced over 100K event-hindrance tuples by prompting the workers with base events from ATOMIC and eliciting reasons why one may not be able to achieve the event. In order to make the prompt events more worker-friendly, we processed the events as a desire (e.g., "X adopts a cat" → "X wants to adopt a cat"). 11 We specifically elicited personal causes (e.g., "X is allergic to cats"), situational causes (e.g., "there are no pet stores nearby"), and social causes (e.g., "X's landlord disallows pets").

• 33K isFilledBy tuples were collected by presenting workers with base events. The workers were asked to provide two (up to four) common objects or entities that will make sense in the sentence.

• 46K tuples for isBefore and isAfter were collected together as sequences of events. Given a base event, the workers were asked to write a short 3-sentence story by Table 8 : CONCEPTNET relations mapped to ATOMIC 20 20 relations. For labels mapping to multiple ATOMIC 20 20 relations, the one that received the majority mapping is bolded.

Table 8: CONCEPTNET relations mapped to ATOMIC2020 relations. For labels mapping to multiple ATOMIC2020 relations, the one that received the majority mapping is bolded.

providing a preceding and following event. The workers were given the option to opt out of writing a story if they felt that the event they were given didn't make sense enough to create a story.

• As discussed in the main text ( §3), 130K ObjectUse tuples were crowdsourced by eliciting common objects and their uses for every event in the collected event sequences. For each event, a worker was asked to provide 2-4 common items that were needed during the displayed event. Atypical ObjectUse was collected in a second pass, where for each collected unique object, the workers were prompted with the object and asked to provide an atypical, creative or resourceful use for the item shown.

Integration of CONCEPTNET Tuples. The tuples for the remaining relations are populated through the integration of the commonsense portion of CONCEPTNET. As discussed in the main text, a select subset of CONCEPTNET(v5.7) tuples (172K) were integrated into ATOMIC 20 20 . The primary challenge in integrating CONCEPTNET tuples into ATOMIC 20 20 was in identifying knowledge that is most likely to reflect commonsense information. CONCEPT-NET(v5.7) contains tuples built on not only concept relationships directly sourced from human informants, but also on information pulled from other lexical sources such as Word-Net (Miller 1995) and DBpedia (Auer et al. 2007) , which automatically extracts knowledge from Wikipedia articles (Speer, Chin, and Havasi 2017) . As a result, even those relations that are designed to primarily represent commonsense knowledge (i.e., the OMCS relations) include among the mix, tuples that reflect factual or lexical co-occurrence knowledge. These examples deviate from the type of knowledge we would ideally consider as "commonsense," i.e., qualitative experiential knowledge gained through subjective observation of and interaction with the world. Relations such as InstanceOf("is instance/example of") stands as a case in point (e.g., "tortilla" is an example of "flatbread" or "toffee" is an example of "candy"). While included within the OMCS relations, the encoded information can be hard to distinguish from the more accepted taxonomic relations such as IsA ("is a kind/type of"). 12 Relationships found in relations such as RelatedTo and DistinctFrom are too underspecified with regards to the meaning they represent, and for other relations such as LocatedNear, and negative forms such as NotCapableOf or NotHasProperty, the relationships amount to general lexical relationships. Thus, the process of CONCEPTNET(v5.7) knowledge selection (described in §3) was judiciously guided by three competing priorities: when possible, we prioritized (1) qualitative commonsense over factual knowledge, (2) general knowledge over highly specific knowledge (e.g., personal names), and (3) meanings that are specific enough to be meaningfully categorized. Since the ideal route of verifying imported data via crowdsourcing can be resource-intensive, we opted for an approach whereby relations were first selected based on the data they represent; then tuples were pruned based on heuristics that leverage lexical and syntactic information of the concepts. As mentioned in the main text, 10% of the data selected for integration was validated by crowdworkers, yielding a greater than 93% acceptance rate. Three relations, namely HasProperty, CapableOf, and MotivatedByGoal, were sent for instance-by-instance crowdsourcing for the purpose of debiasing human-related descriptions, and subdividing semantically distinct elements within the category (e.g., MotivatedByGoal mapped to xIntent and xReason). The resulting CONCEPTNET-to-ATOMIC 20 20 relation mapping details are shown in Table 8 .

B Symbolic Knowledge Graph Details Human Evaluation

Qualifying Crowdsource Workers. To ensure high-quality annotations, we qualified a pool of 173 workers through a paid qualification task that tested their ability to follow directions and provide reasonable answers to the qualification test. The qualification test contained 6 manually selected tuples from ATOMIC and CONCEPTNET, including both easy and tricky relations to annotate. A worker was qualified if they provided 100% acceptable answers. Workers providing 5 of 6 correct answers were also accepted only when they provided a reasonable written substantiation for their incorrect choice. Workers were paid an average of $15 per hour for their evaluations.

Human Readable Relation Templates. Since the KB relation labels are rather telegraphic on their own, we used human readable language forms (based ATOMIC 20 20 and CON-CEPTNET definitions) for prompt display in crowdsourced evaluations. The complete list is available in Table 9 .

Table 9: Human readable templates for each relation used for crowdsourced human evaluations.

Kb Accuracy & Coverage

In Table 2 , what type of tuples generally end up with no judgment? Tuples receiving no judgment fall into three general categories: (1) either the head or the tail concept is too specialized for the workers to judge without consulting reference (e.g., "klebsiella" is part of "bacteria," "drug cocktail" made of "nucleoside reverse transcriptase inhibitor");

(2) concepts refer to highly specific entities or referents (e.g., "singh" capable of "bring key," "falkland island islas malvinas" part of "argentina"); and (3) Reject candidates that workers have decided to hedge on (e.g., "dandelion" used for "love," "democrat" desires "matter"). Such tuples are mostly found in TransOMCS, as evidenced by the high fraction of tuples that received No Judgment at less than half of ATOMIC 20 20 's Accept rate (see 2). Does the accuracy ratings breakdown for each KB provide further insights? A closer look at the raw accuracy ratings shows an interesting emergent rating pattern across KBs (Table 2) . For all KBs with the exception of TRANSOMCS, we observe that the majority of social-interaction Accept originate from the sometimes/likely rating. However, such preference is not seen in the physical-entity tuples, which show a slightly higher tendency for the always/often rating. For event-centered tuples, ATOMIC 20 20 favors the sometimes/likely, while CONCEPTNET does not. TRANSOMCS shows highest ratings for the sometimes/likely and invalid ratings, and the patterns are invariant across the board.

One additional point to mention is that ATOMIC 20 20 socialinteraction and event-centered tuples proportionally contain more of the human-crowdsourced commonsense knowledge than the physical-entity category, which, with the sole exception of ObjectUse, includes tuples integrated from CONCEPTNET graph. The observation that much of the knowledge in ATOMIC 20 20 is sometimes or likely true, reflects our intentional efforts to deprioritize factual information over qualitative commonsense knowledge. More importantly, it shows that most of the knowledge within the ATOMIC 20 20 graph can be, under the right circumstances, defeasible. That is, one can pose a likely hypothesis that a hindrance to "X writes stories" is that "X can't read;" however, such a hypothesis can be defeated if we also know that X has written stories before. We find that such context-dependent ambiguities are of more compelling interest to us, as certainties may be better covered by language models.

C Neural Knowledge Graph Details

Dataset Split. Table 10 reports the number of tuples for each three-way split (train/dev/test) of each knowledge graph. The ATOMIC 20 20 split preserves the splits from ATOMIC and CONCEPTNET: any tuple in ATOMIC 20 20 that appears in the train (resp. dev, test) set of ATOMIC or CONCEPTNET belongs to the train (resp. dev, test) set of ATOMIC 20 20 . Overall, ATOMIC 20 20 provides over 50% more tuples than the initial version ATOMIC. are special delimiter tokens that indicate the start and end of the tail of a given relation for a given head entity / event. At inference time, the head and relation of a tuple are given as input and the model's generation following the [GEN] token is recorded as its prediction of the tail entity. We finetuned GPT2-XL on each CSKG for one epoch, using a batch size of 32 and a learning rate of 5e − 5 on an Nvidia RTX-8000 GPU. The final trained models for each CSKG will be publicly released as part of our code release. In Table 2 ). From left to right are the ratings for social-interaction tuples, physical-entity tuples, and event-centered tuples. We use the CONCEPTNET-to-ATOMIC 20 20 relation mappings (shown in Table 8 ) to categorize CONCEPTNET and TRANSOMCS relations into the three categories. For multiple mappings, we map the CONCEPTNET/TRANSOMCS labels to the majority mapped label (in bold in Table 8 ). Note that the latter two figures do not include ATOMIC as the KB only includes social-interaction relations. Table 11 : Automated metrics for the quality of the tail generations for the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for all test prefixes for each KG. Similar results were obtained on the 5K sampled prefixes that were randomly selected for the human evaluation (see Table 7 ).

Table 10: Number of tuples per KB and per split.

Table 11: Automated metrics for the quality of the tail generations for the knowledge models COMET(GPT2-XL) and COMET(BART). Each approach uses greedy decoding for all test prefixes for each KG. Similar results were obtained on the 5K sampled prefixes that were randomly selected for the human evaluation (see Table 7).

trained on ATOMIC 20 20 . Details about BART Training. BART (Lewis et al. 2020) is a denoising sequence-to-sequence pretrained language model. Similar to previous transformer-based language models (Devlin et al. 2019 ), BART's pretraining objective is to recover its input, which is corrupted through various strategies such as token and span masking, and sentence permutation. For pretraining, BART uses a 160GB free-text dataset drawn from news, books, stories, and web texts. We used the BART-large version of the model, 13 which has 24 layers, 1024-dimensional hidden states, 16 attention heads in its self-attention layers, and 406M total parameters. For hyper-parameter search, we fine-tuned BART on each commonsense knowledge graph for one epoch with batch sizes {64, 32, 16}, learning rates {1e-3, 1e-5, 1e-7}, and three random seeds.

Details about GPT-3 Evaluation. We evaluate GPT-3 (Brown et al. 2020) using OpenAI's language completion API. Similar to zero-shot evaluation on GPT2-XL, we use templates to evaluate the ability of the language model to generate a tail given the head and relation. We use the same templates as GPT2-XL. For priming examples, we prime each relation with 5 examples of heads and tails per relation, 13 from HuggingFace's implementation (Wolf et al. 2019) .

randomly selected from relations in the training set. We ran 3 random seeds to select priming examples to avoid spelling mistakes and other fragments from data collection. We ran with temperature 0.4.

Additional Automated Evaluation. In order to have a direct comparison between automated and human evaluations, we report in Section 5 the automated metrics on the same test subsets that were used for human evaluation. For completeness, in this section, we provide the automated evaluation results on the full test sets (Table 11 ). These results confirm the findings of Section 5.

Relation

Human COMET (GPT2 XL) COMET (BART)

Figure 4: Percentage distribution of raw accuracy ratings broken down by KB (i.e., breakdown of Table 2). From left to right are the ratings for social-interaction tuples, physical-entity tuples, and event-centered tuples. We use the CONCEPTNET-toATOMIC2020 relation mappings (shown in Table 8) to categorize CONCEPTNET and TRANSOMCS relations into the three categories. For multiple mappings, we map the CONCEPTNET/TRANSOMCS labels to the majority mapped label (in bold in Table 8). Note that the latter two figures do not include ATOMIC as the KB only includes social-interaction relations.

Figure 5: Example generations of models on relations from ATOMIC2020. Red, purple and green rows represent social-interaction commonsense, event-centered commonsense, and physical-entity commonsense, respectively.

D Additional Reproducibility Items

All experiments were conducted on a cluster with 8 GPUs of type NVIDIA Quadro RTX 8000 with 48 GB of GDDR6 memory each. To allow replication of results, whenever possible, a default, fixed value was assigned to the random seed that initializes the pseudo-random number generator, as specified in the source code. The details of the experimentation of the models (i.e. GPT2-XL and BART), including their hyper-parameter settings, are described in Appendix C. All the data as well as the source code required for conducting experiments will be made publicly available upon publication. Figure 6 : A snapshot of commonsense knowledge relationships in ATOMIC 20 20 . Gray nodes represent events. Red, purple and green nodes represent social-interaction commonsense, event-centered commonsense, and physical-entity commonsense, respectively. Rest of the colors represent intersection of the categories.

Figure 6: A snapshot of commonsense knowledge relationships in ATOMIC2020. Gray nodes represent events. Red, purple and green nodes represent social-interaction commonsense, event-centered commonsense, and physical-entity commonsense, respectively. Rest of the colors represent intersection of the categories.

We were unable to include Cyc(Lenat 1995) in our study due to the discontinuation of its research license and the cost of the commercial license (over $1M). CONCEPTNET includes a subset of Cyc -OpenCyc.

CONCEPTNET 5.7 defines weight as "the strength with which this edge expresses this assertion". A pilot crowdsource assessment step found any tuple with weight ≤ 0.5 unreliable w.r.t. its validity.6 Hereafter, as we focus on CSKGs, by ConceptNet, we refer to the commonsense subset, unless specified otherwise.

Since desire relations are about cognitive states of sentient beings, they also provide a degree of commonsense about socialinteraction. However, we point out that these relations indicate generic characterizations of animate entities rather than describing situationally-based cognitive mental states (e.g., X being 'encouraged' only applies to the event it is situated in). For this reason, we include these relations under physical-entity commonsense.

For a discussion of force dynamics in cognitive linguistic and lexical semantic literature cf.Herskovits (2009);Landau and Jackendoff (1991);Talmy (1988).

http://anonymous 11 To achieve this, we removed modal verbs, lemmatized the head verb of the sentence, and inserted a 'want to' phrase before the verb.

In fact, CONCEPTNET(v5.7) recognizes the similarities between IsA and InstanceOf and has accordingly deprecated InstanceOf in favor of IsA. Nevertheless, InstanceOf is still found in CONCEPTNET(v5.7).