TransOMCS: From Linguistic Graphs to Commonsense Knowledge

Hongming Zhang
Daniel Khashabi
Y. Song
D. Roth
ArXiv
2020
View in Semantic Scholar

Abstract

Commonsense knowledge acquisition is a key problem for artificial intelligence. Conventional methods of acquiring commonsense knowledge generally require laborious and costly human annotations, which are not feasible on a large scale. In this paper, we explore a practical way of mining commonsense knowledge from linguistic graphs, with the goal of transferring cheap knowledge obtained with linguistic patterns into expensive commonsense knowledge. The result is a conversion of ASER [Zhang et al., 2020], a large-scale selectional preference knowledge resource, into TransOMCS, of the same representation as ConceptNet [Liu and Singh, 2004] but two orders of magnitude larger. Experimental results demonstrate the transferability of linguistic knowledge to commonsense knowledge and the effectiveness of the proposed approach in terms of quantity, novelty, and quality. TransOMCS is publicly available at: this https URL.

1 Introduction

Commonsense knowledge is defined as the knowledge that people share but often omit when communicating with each other. In their seminal work, [Liu and Singh, 2004] defined commonsense knowledge as the knowledge "used in a technical sense to refer to the millions of basic facts and understandings possessed by most people." After around 20 years of development, ConceptNet 5.5 [Speer et al., 2017] , built based on the original ConceptNet [Liu and Singh, 2004] , contains 21 million edges connecting over 8 million nodes. However, most of the knowledge assertions in ConceptNet 5.5 are still facts about entities integrated from other knowledge sources. The core of ConceptNet, which is inherited from the Open Mind CommonSense (OMCS) project [Liu and Singh, 2004] , only contains 600K pieces of high-quality commonsense knowledge in the format of tuples, e.g., ('song', Used-For, 'sing') . The gap between the small scale of existing commonsense knowledge resources and the broad demand of downstream applications motivates us to explore richer ap-Throughout the history of AI, many works were developed to extract various kinds of knowledge from raw texts with human-designed linguistic patterns. For example, Ope-nIE [Etzioni et al., 2008] aims at identifying open relations between different entities (e.g., 'Paris'-CapitalOf -'France') and Hearst patterns [Hearst, 1992] are used to extract hyponyms (e.g., 'apple'-IsA-'fruit'). These patterns are often of high-precision. However, they typically suffer from brittleness and low coverage. To overcome the limitations of pattern-based methods, supervised commonsense knowledge acquisition methods were proposed. They either model the commonsense knowledge acquisition problem as a knowledge graph completion task to predict relations between concepts [Li et al., 2016] or model it as a generation task by leveraging pre-trained language models to generate tail concepts [Bosselut et al., 2019] . However, these approaches often are supervised with expensive annotations and are restricted to the distribution of the training data.

In this paper, we propose to discover new commonsense knowledge from linguistic graphs whose nodes are words and edges are linguistic relations, which is motivated by the observations [Resnik, 1997; Zhang et al., 2019] that selectional preferences over linguistic relations can reflect humans' commonsense about their word choice in various contexts. Here, we use the linguistic graphs from ASER [Zhang et al., 2020] , which is extracted from raw text with dependency parser and explicit discourse connectives and provides 27 millions of eventualities extracted using dependency patterns and 10 millions of discourse relations as its core. Then, we develop an algorithm for discovering patterns from the overlap of existing commonsense and linguistic knowledge bases and use a commonsense knowledge ranking model to select the highest-quality extracted knowledge. As a result, we can build TransOMCS, a new commonsense knowledge graph.

In summary, our contributions are: (1) We formally define the task of mining commonsense knowledge from linguistic graphs and propose an approach to address it; (2) We construct a large-scale commonsense resource TransOMCS, with size two orders of magnitude larger than OMCS; (3) We conduct both intrinsic and extrinsic experiments to show the value of TransOMCS.

2 Problem Definition

We start by defining the task of mining commonsense knowledge from linguistic graphs. Given a seed commonsense knowledge set C (which contains m tuples) and a linguistic graph set G (which contains n linguistic graphs G) with m n. Each commonsense fact is in a tuple format (h, r, t) ∈ C, where r ∈ R, the set of human-defined commonsense relations (e.g., 'UsedFor', 'CapableOf', 'AtLocation', 'MotivatedByGoal') , and h and t are arbitrary phrases. Our objective is to infer a new commonsense knowledge set C + (with m + pieces of commonsense knowledge) from G with the help of C such that m + m.

3 Method

3.1 Overview

The proposed framework is shown in Figure 1 . In the beginning, for each seed commonsense tuple (h, r, t) ∈ C, we match and select supporting linguistic graphs that contain all the terms in h and t, and then extract the linguistic patterns for each commonsense relation with the matched commonsense tuple and linguistic graph pairs. Next, we use a pattern filtering module to select the highest quality patterns. Finally, we train a discriminative model to evaluate the quality of the extracted commonsense knowledge. The details are as follows.

3.2 Knowledge Resources

We first introduce the details about the selected commonsense and linguistic knowledge resources. For the seed commonsense knowledge, we use the English subset of Con-ceptNet 5.5 [Speer et al., 2017] . Following conventional approaches [Saito et al., 2018; Bosselut et al., 2019] , we consider only the relations covered by the original OMCS project [Liu and Singh, 2004] , except those with vague meanings (i.e., 'RelatedTo') or those well-acquired by other knowledge resources (i.e., 'IsA' 2 ). In total, 36,954 words, 149,908 concepts, and 207,407 tuples are contained in the selected dataset as C. For the linguistic knowledge resource, Figure 2 : Example of linguistic graphs and extracted patterns for different commonsense relations, which are extracted with the matching of words in seed commonsense tuples and the graphs. Given a linguistic graph as the input, these patterns can be applied to extract OMCS-like commonsense knowledge. Extracted head and tail concepts are indicated with blue and red circles respectively.

Figure 2: Example of linguistic graphs and extracted patterns for different commonsense relations, which are extracted with the matching of words in seed commonsense tuples and the graphs. Given a linguistic graph as the input, these patterns can be applied to extract OMCS-like commonsense knowledge. Extracted head and tail concepts are indicated with blue and red circles respectively.

we use the core subset of ASER [Zhang et al., 2020] with 37.9 million linguistic graphs 3 to form G.

3.3 Pattern Extraction

Given a matched pair of a commonsense tuple (h, r, t) ∈ C and a linguistic graph G ∈ G, the goal of the pattern extraction module is to find a pattern over linguistic relations such that given r, we can accurately extract all the words in h and t from G. Here, we formally define each pattern P as follows:

Definition 1. Each pattern P contains three components: head structure p h , tail structure p t , and internal structure p i . p h and p t are the smallest linguistic sub-graph in G that can cover all the words in h and t, respectively. p i is shortest path from p h to p t in G.

First, we extract p h and p t from G to cover all the words in h and t. We take the head pattern as an example. For each word in h, we first find its position in G. To avoid any ambiguity, if we find more than one match in G, we discard the current pair and record no pattern. Then, with the position of the first word in h as the start node, we conduct a breadthfirst search (BFS) algorithm over G to find a sub-structure of G that covers all and only the words in h. If the BFS algorithm finds such a sub-structure, we treat it as p h , otherwise, we discard this example and return no pattern. We extract the tail pattern p t with G and t in a similar way. After extracting p h and p t , we then collect the internal structure p i , which is the shortest path from p h to p t over G. To do so, we collapse all the nodes and edges in p h and p t into single 'head' and 'tail' nodes, respectively. Then we use 'head' as the starting node to conduct a new BFS to find the shortest path from node 'head' to 'tail'. We aggregate p h , p i , and p t together to generate the overall pattern P . Examples of patterns are shown in Figure 2 . For each commonsense relation r, we first collect all the commonsense tuples of relation r in C to form the subset C r . Then for each (h, r, t) ∈ C r , we go over the whole G to extract matched patterns with the aforementioned algorithm. The time complexity of the overall algorithm is O(|C| • |G| • N 2 ), where |C| is the size of C, |G| is the size of G and N is the maximum number of the nodes in G.

3.4 Pattern Selection And Knowledge Extraction

We evaluate the plausibility of pattern P regarding commonsense relation r as follows:

EQUATION (1): Not extracted; please refer to original document.

where F (P |r) is the scoring function we use to determine the quality of P regarding r and P r is the set of patterns extracted for r. We design F (P |r) as follows:

EQUATION (2): Not extracted; please refer to original document.

where C(P |r) indicates counts of observing P regarding r, L(P ) indicates the length of P to encourage complex patterns, and U (P |r) =

C(P |r)/ √ |C r | r ∈R C(P |r )/ √ |C r |

is the uniqueness score of P regarding r. We select the patterns with the plausibility score higher than 0.05. On average, 2.8 high confident patterns are extracted for each relation. After extracting the matched patterns, we then apply the extracted patterns to the whole linguistic graph set G to extract commonsense knowledge. For each G ∈ G and each relation r, we go through all the extracted patterns P ∈ P r . If we can find a matched P in G, we will then extract the corresponding words of p h and p t in G as the head words and tail words, respectively. The extracted head and tail word pairs are then considered as a candidate knowledge, related via r.

3.5 Commonsense Knowledge Ranking

To minimize the influence of the pattern noise, we propose a knowledge ranking module to rank all extracted knowledge based on the confidence. To do so, we first use human annotators to label a tiny subset of extracted knowledge and then use it to train a classifier. We assign a confidence score to each extracted knowledge via the learned classifier.

Dataset Preparation

For each commonsense relation type, we randomly select 1,000 tuples 4 to annotate. For each tuple, five annotators from Amazon Mechanical Turk are asked to decide if they think the tuple is a plausible commonsense statement. If at least four annotators vote for plausibility, we label that tuple as a positive example. Otherwise, we label it as a negative example. Additionally, we include all the matched examples from OMCS as positive examples. In total, we obtain 25,923 annotations for 20 commonsense relations. Among these annotations, 18,221 are positive examples and 7,702 are negative examples. On average, each tuple has 12.7 supporting linguistic graphs. We randomly select 10% of the dataset as the test set and the rest as the training set. 5

Task Definition

The goal of this module is to assign confidence scores to all extracted knowledge so that we can rank all the knowledge based on their quality. As for each extracted tuple, we may observe multiple supporting linguistic graphs. We model it as a multi-instance learning problem. Formally, we define the overall annotated knowledge set as K. For each k ∈ K, denote its supporting linguistic graph set as G k . We use F (k|G k ) to denote the plausibility scoring function of k given G k , and it can be defined as follows:

EQUATION (3): Not extracted; please refer to original document.

where |G k | is the number of graphs in G k and f (k|g) is the plausibility score of k given g.

Model Details

The proposed model, as shown in Figure 3 , contains three components: transformer, graph attention and the final plausibility prediction.

Figure 3: Demonstration of the model. The blue and red colors denote the head words and tail words, respectively. The gray and green colors and indicate the other words and features, respectively.

Transformer: We use the transformer module as the basic layer of our model. Formally, assuming g contains n words w 1 , w 2 , ..., w n , we denote the representation of all words after the transformer module 6 as e 1 , e 2 , ..., e n . In our model, we adopt the architecture of BERT [Devlin et al., 2019] and their pre-trained parameters (BERT-base) as the initialization. Graph Attention: Different from conventional text classification tasks, linguistic relations play a critical role in our task. Thus in our model, we adopt the graph attention module [Velickovic et al., 2018] to encode the information of these linguistic relations. For each w in g, we denote its representation asê, which is defined as follows:

e = e ∈N (e) a e,e • e ,

where N (e) is the representation set of words that are connected to w and a e,e is the attention weight of e regarding e. Here, we define the attention weight as:

a e,e = e NNa([e,e ])

ē∈N (e) e NNa([e,ē]) ,

where [.] indicates vector concatenation and NN a is the dense neural network we use to predict the attention weight before the softmax module. After the graph attention module, the representation of words are then denoted asê 1 ,ê 2 , ...,ê n . Plausibility Prediction: In the last part of our model, we first concatenate e andê together for all the words and then create head embedding o head and tail embedding o tail using mean pooling over [e,ê] of all words appear in the head or tail respectively. Besides embedding features, two important features, graph frequency o f re (how many times this graph appears) and graph type o type (whether this graph is node or edge), are also considered for the final plausibility prediction.

We define plausibility prediction function f (k|g) as:

EQUATION (6): Not extracted; please refer to original document.

where N N p is a fully connected layer we use to predict the plausibility. We use the cross-entropy as the loss function and stochastic gradient descent as the optimization method.

Model Performance

We train the classifier for different relations separately as some relations are not exclusive with each other (e.g., 'Re-cievesAction' and 'UsedFor') and test our model on our collected data. We compare our model with random guess and directly using BERT, which is identical to our model excluding the graph attention part, in

4.1 Evaluation Metrics

Quantity: We first measure quantity of the algorithms. Specifically, we are interested in two metrics: the number of acquired commonsense knowledge tuples and the number of unique words.

Novelty: For measuring novelty of the outputs, we follow [Bosselut et al., 2019] and report the proportion of novel tuples and concepts, denoted as Novel t and Novel c , respectively. Here 'novel', similar to the definition in [Bosselut et al., 2019] , means that one cannot find the whole tuple or concept in the training/development data via string match. For COMET and LAMA, as the head concept is given as input and thus only the tail concept is generated by the models, we only select the tail concepts to calculate the concept novelty. Quality: We use Amazon Mechanical Turk to evaluate the quality of the acquired commonsense tuples. For each model, we randomly select 100 tuples from the overall set to test the quality. For each tuple, five workers are invited to annotate whether this tuple is plausible or not. If at least four annotators label the tuple as plausible, we then consider it as a plausible tuple. Additionally, to investigate the quality of tuples with novel concepts, for each model, we also report the performance of all sampled novel tuples. As we use the accuracy (# valid tuples/ # all tuples) to evaluate the quality, we denote the overall quality of tuples with novel concepts and the whole tuple set as ACC n and ACC o , respectively. We denote our method as TransOMCS, indicating that we transfer knowledge from linguistic patterns to OMCS-style commonsense knowledge.

4.3 Implementation Details

For both COMET and LAMA, we include two variants in our experiments. The first one follows COMET's paper and uses concept/relation pairs among the most confident subset of OMCS as input. Due to the small size of this subset (only 1.2K positive examples), even if we use the 10-beam search decoding, we can only generate 12K tuples. To overcome this limitation and test whether these models can be used for generating large-scale commonsense knowledge, we consider a different evaluation setting where we randomly select 24K concepts from the concepts extracted by our approach and then randomly pair the selected concepts with commonsense relations as the input. We refer to the first and modified settings with Original and Extended subscripts, respectively.

Table 1: Performance of different plausibility prediction models.

4.4 Result Analysis

The summary of model evaluations is listed in Table 2 . Compared with all baseline methods in their original settings, TransOMCS outperforms them in quantity. Even the smallest subset (top 1%) of TransOMCS outperforms their largest generation strategy (10-beam search) by ten times. TransOMCS also outperforms COMET in terms of novelty, especially the percentage of novel concepts. The reason behind this is that COMET is a pure machine-learning approach and it learns Table 3 : Comparison of different commonsense resources or acquisition methods. The quality of COMET depends on the scale of its generated knowledge (high quality on the test set, but low quality on the large scale generation.) COMET and TransOMCS require a small number of annotations as supervision.

Table 2: Main Results. For our proposed method, we present both the full TransOMCS and several subsets based on the plausibility ranking scores. TransOMCSOriginal means the subset, whose head/relation appear in the test set as used by COMETOriginal and LAMAOriginal. As LAMA is unsupervised, the Novelty metric is not applicable. For our model, for the fair comparison, we exclude all annotated knowledge.

Table 3: Comparison of different commonsense resources or acquisition methods. The quality of COMET depends on the scale of its generated knowledge (high quality on the test set, but low quality on the large scale generation.) COMET and TransOMCS require a small number of annotations as supervision.

to generate the tail concept in the training set. It is possible that the stronger their models are, the more likely they overfit the training data, the fewer novel concepts are generated.

As for the quality, when the training data is similar to the test data, COMET provides the best quality. For example, in the original setting, the greedy decoding strategy achieves 90% overall accuracy. As a comparison, the quality of LAMA is less satisfying, which is mainly because LAMA is fully unsupervised. Besides that, in the extended setting, the quality of both models drops, which is mainly because the randomly generated pairs could include more rare words. Compared with them, TransOMCS (top 1%) provides the comparable quality with COMET, but with a larger scale. When the quantity is comparable, TransOMCS outperforms both of them in terms of quality. We summarize the comparisons in Table 3 .

4.5 Case Study

We show case studies in Figure 4 to further analyze different acquisition methods. COMET: COMET is the only one that can generate long concepts. At the same time, it also suffers from generating meaningless words. For example, two of the generated concepts end with 'be', which is confusing and meaningless. This is probably because COMET only generates lemmas rather than normal words. Besides that, COMET could overfit the training data, even though the ten outputs are not exactly the same, four of them mean the same thing ('kill others').

Figure 4: Case study of three knowledge acquisition methods. Two head-relation pairs are selected from the original and extended settings respectively. We indicate low quality tuples with confusing face emojis. For each model, we select the top ten most confident results.

LAMA: The most significant advantage of LAMA is that it is unsupervised. However, it has two major drawbacks: (1) it can only generate one-token concepts, which is far away from enough for commonsense knowledge;

(2) the quality of LAMA is not as good as the other two methods.

TransOMCS: Compared with COMET, TransOMCS can generate more novel commonsense knowledge. For example, our model knows that 'human' is capable of having cells and creating life. Besides that, unlike LAMA, TransOMCS can generate multi-token concepts. At the same time, our approach also has two limitations: (1) it cannot extract long concepts, which are difficult to find an exact pattern match;

(2) as the extraction process strictly follows the pattern matching, it could extract semantic incomplete knowledge. For example, 'human' is capable of 'have'. The original linguistic graph should be "human have something", but as the pattern is '()<-nsubj<-()', the object is missing.

5 Extrinsic Evaluation

We conduct experiments on two downstream tasks: commonsense reading-comprehension [Ostermann et al., 2018] and dialogue generation . We select [Wang et al., 2018] and [Luong et al., 2015; Zhang et al., 2020] as our base models for the comprehension and the generation task, respectively. For the fair comparison, we keep all the model architecture and parameters the same across different trials. We report the results with the common metric of each dataset: accuracy on the comprehension task and the BLEU score [Papineni et al., 2002 ] on the dialog generation task. The overall result is shown in Table 4 . For the reading comprehension task, adding the top 1% of TransOMCS contributes 0.37 overall accuracy, compared to 0.21 contribution of OMCS. Meanwhile, the contributions of COMET and LAMA are minor for this task. For the dialogue generation task, TransOMCS shows remarkable improvement in the quality of generated responses. At the same time, adding other knowledge resources to OMCS does not provide any meaningful improvements to the performance. The reason Figure 4 : Case study of three knowledge acquisition methods. Two head-relation pairs are selected from the original and extended settings respectively. We indicate low quality tuples with confusing face emojis. For each model, we select the top ten most confident results. Table 4 : Experimental results on downstream tasks. For COMET and LAMA, we report the performance of their most accurate and largest setting. For TransOMCS, as current models cannot handle large-scale data, we only report the performance of the most confident 1%. All the numbers are computed based on the average of four different random seeds rather than the best seed as reported in the original paper.

Table 4: Experimental results on downstream tasks. For COMET and LAMA, we report the performance of their most accurate and largest setting. For TransOMCS, as current models cannot handle large-scale data, we only report the performance of the most confident 1%. All the numbers are computed based on the average of four different random seeds rather than the best seed as reported in the original paper.

behind this could be that COMET and LAMA provide limited high quality novel commonsense knowledge. For example, the original OMCS on average contributes 1.46 supporting tuples 7 and TransOMCS contribute another 3.36 supporting tuples. As a comparison, COMET Original , COMET Extended , LAMA Original , and LAMA Extended only provide 0.01, 0.07, 0.49, 0.01 additional tuples respectively. This experiment result shows that TransOMCS has more novel knowledge.

6 Related Work

Commonsense knowledge covers a variety of knowledge types like knowledge about typical location or causes of events in OMCS [Liu and Singh, 2004] , events causes, effects, and temporal properties in ATOMIC and [Zhou et al., 2020] , and physical attributes of objects [Elazar et al., 2019] . As an important knowledge resource for AI systems, commonsense knowledge has been shown helpful in many downstream tasks such as question answering [Lin et al., 2019] and reading comprehension [Wang et al., 2018] . Conventional commonsense resources (e.g., OMCS and ATOMIC) are often constructed via crowdsourcing aimed to provide high-quality results; although such expensive processes restrict their scale. Recently, several attempts [Li et al., 2016; Davison et al., 2019] have been made to enrich the existing commonsense knowledge by learning to predict new relations between concepts. However, these ap-proaches cannot generate new nodes (concepts). To address this problem, several models were proposed to generate commonsense tuples in either supervised [Bosselut et al., 2019] or unsupervised [Petroni et al., 2019] fashions. Different from them, this work shows that linguistic patterns can be extracted for commonsense relations and consequently be used for producing valuable commonsense knowledge.

7 Conclusion

In this paper, we exhibit the transferability from linguistic knowledge (dependency and discourse knowledge) to commonsense knowledge (i.e., the OMCS-style knowledge) by showing that a large amount of high-quality commonsense knowledge can be extracted from linguistic graphs. We formally define the task of mining commonsense knowledge from linguistic graphs and present TransOMCS, a commonsense knowledge resource extracted from linguistic graphs into the format of the OMCS subset of ConceptNet, but two orders of magnitude larger than the original OMCS. While TransOMCS is noisier than OMCS, it can still make significant contributions to downstream tasks due to its larger coverage, as evident by the extrinsic experiments.

https://github.com/HKUST-KnowComp/TransOMCS proaches to acquire commonsense knowledge from raw text, which is cheaper and more feasible.

The extraction of 'IsA' relations belongs to the task of hyponym detection and such knowledge has been well preserved by knowledge resources like Probase[Wu et al., 2012].

As both the internal structure of eventualities and external relations between eventualities could be converted to commonsense knowledge, we treat all the eventualities and eventuality pairs in ASER as the linguistic graphs.

For relation 'DefinedAs' and 'LocatedNear', as our model only extracted 26 and 7 tuples respectively, we annotate all of them and exclude them when we compute the overall accuracy in Section 4.

As the annotated dataset is slightly imbalanced, when we randomly select the test set, we make sure the number of positive and negative examples are equal.6 For technical details of the Transformer network, you may refer to the original paper[Vaswani et al., 2017].

Here by supporting tuple, we mean that the head and tail concept appear in the post and response respectively.