CommonGen: A Constrained Text Generation Challenge for Generative Commonsense Reasoning

Bill Yuchen Lin
Minghan Shen
Wangchunshu Zhou
Pei Zhou
Chandra Bhagavatula
Yejin Choi
Xiang Ren
FINDINGS
2020
View in Semantic Scholar

Abstract

Recently, large-scale pre-trained language models have demonstrated impressive performance on several commonsense-reasoning benchmark datasets. However, building machines with commonsense to compose realistically plausible sentences remains challenging. In this paper, we present a constrained text generation task, CommonGen associated with a benchmark dataset, to explicitly test machines for the ability of generative commonsense reasoning. Given a set of common concepts (e.g., dog, frisbee, catch, throw); the task is to generate a coherent sentence describing an everyday scenario using these concepts (e.g., “a man throws a frisbee and his dog catches it”). The CommonGen task is challenging because it inherently requires 1) relational reasoning with background commonsense knowledge and 2) compositional generalization ability to work on unseen concept combinations. Our dataset, constructed through a combination of crowdsourced and existing caption corpora, consists of 77k commonsense descriptions over 35k unique concept-sets. Experiments show that there is a large gap between state-of-the-art text generation models (e.g., T5) and human performance (31.6% v.s. 63.5% in SPICE metric). Furthermore, we demonstrate that the learned generative commonsense reasoning capability can be transferred to improve downstream tasks such as CommonsenseQA (76.9% to 78.4 in dev accuracy) by generating additional context.

1 Introduction

Commonsense reasoning, the ability to make acceptable and logical assumptions about ordinary scenes in our daily life, has long been acknowledged as a critical bottleneck of artificial intelligence and natural language processing (Davis and Marcus, 2015) . Most recent commonsense reasoning challenges, such as CommonsenseQA (Tal- 1 Our code and data can be found at http://inklab. usc.edu/CommonGen/. Work in progress.

dog | frisbee | catch | throw -A dog leaps to catch a thrown frisbee.

-The dog catches the frisbee when the boy throws it.

-A man throws away his dog 's favorite frisbee expecting him to catch it in the air.

Expected Output: everyday scenarios covering all given concepts.

[Humans]

GPT2: A dog throws a frisbee at a football player.

UniLM: Two dogs are throwing frisbees at each other .

BART: A dog throws a frisbee and a dog catches it. T5 : dog catches a frisbee and throws it to a dog [Machines] exercise | rope | wall | tie | wave -A man in a gym exercises by waving ropes tied to a wall. -The gym owner decided to tie a rope to the wall so people could make a wave in it for exercise.

Concept-Set: [Humans]

GPT2: A woman is tied up in a rope and swinging a wave at a wall.

UniLM: A man with a rope and tie is doing some exercise on a wall.

BART: A man is tied to a rope and is waving his arms and doing exercises on the wall.

[Machines]

Concept-Set: a collection of objects/actions. Figure 1 : An example from our COMMONGEN dataset. GPT-2 (Radford et al., 2019) , UniLM (Dong et al., 2019) , BART ) and T5 (Raffel et al., 2019) are large pre-trained text generation models, finetuned on the proposed task. mor et al., 2019), SocialIQA (Sap et al., 2019b) , WinoGrande (Sakaguchi et al., 2019) and Hel-laSwag (Zellers et al., 2019b) , have been framed as discriminative tasks -i.e. AI systems are required to choose the correct option from a set of choices based on a given context. While significant progress has been made on these discriminative tasks, we argue that commonsense reasoning in text generation poses a distinct complementary challenge. In this paper, we advance machine commonsense towards generative reasoning ability.

Figure 1: An example from our COMMONGEN dataset. GPT-2 (Radford et al., 2019), UniLM (Dong et al., 2019), BART (Lewis et al., 2019) and T5 (Raffel et al., 2019) are large pre-trained text generation models, finetuned on the proposed task.

Generative Commonsense Reasoning

Humans acquire the ability to compose sentences by learning to understand and use common concepts that they recognize in their surrounding environment (Tincoff and Jusczyk, 1999) . The acquisition of such an ability is regarded as a significant milestone of human development (Moore, 2013) . Can machines acquire such generative commonsense reasoning ability? To initiate the invesitagtion, we present COMMONGEN -a novel constrained generation task that requires machines to generate a sentence describing a day-to-day scene using concepts from a given concept-set. For example, given a set of concepts: {exercise, rope, wall, tie, wave}, machines are required to generate a sentence such as "a man in a gym exercises by waving ropes tied to a wall." To successfully solve the task, models need to incorporate two key capabilities: a) relational reasoning, and b) compositional generalization. Grammatically sound sentences may not always be realistic as they might violate our commonsense (e.g., "a dog throws a frisbee ..." in Fig. 1 ). In order to compose a plausible sentence that describes an everyday scenario, models need to construct a grammatical sentence while adhering to and reasoning over the commonsense relations between the given concepts. Models additionally need compositional generalization ability to infer about unseen concept compounds. This encourages models to reason about a potentially infinite number of novel combinations of familiar concepts -an ability believed to be a limitation of current AI systems (Lake and Baroni, 2017; Keysers et al., 2020) .

Therefore, in support of the COMMONGEN task, we present a dataset consisting of 29,599 conceptsets associated with 49,129 sentences. We explicitly design our dataset collection process to capture the key challenges of relational reasoning and compositional generalization described above. We establish comprehensive baseline performance for state-of-the-art language generation models. The best model, based on T5 (Raffel et al., 2019) , achieves 36.60% with significant gap compared to human performance of 63.50% in the SPICE metric -demonstrating the difficulty of the task. Our analysis shows that state-of-the-art models struggle at the task, generating implausible sentencese.g. "dog throws a frisbee ..." , "give massage to a table", etc. -pointing to interesting future research directions for the community.

2 Task Formulation And Challenges

We formulate the proposed COMMONGEN task with mathematical notations and discuss its inherent challenges with concrete examples.

The input is an unordered set of k concepts x = {c 1 , c 2 , . . . , c k } ∈ X (i.e. a concept-set), where each concept c i ∈ C is a common object (noun) or action (verb). We use X to denote the space of all possible concept-sets and use C to denote the concept vocabulary (a subset of ConceptNet's single-word concepts). The expected output is a simple, grammatical sentence y ∈ Y that describes a common scenario in our daily life, using 2 all given concepts in x. A scenario can depict either a static situation or a short series of actions. The task is to learn a structured predictive function f ∶ X → Y, which maps a concept-set x to a sentence y. The unique challenges of this task come from two major aspects as follows. Relational Reasoning with Commonsense. Expected generative reasoners should prioritize the most plausible scenes over an infinite number of less plausible scenes. Recall the first illustrative examples in Figure 1 , the underlying knowledge are implicit and compositional: (a) dogs love to perform tricks with humans, (b) catching a frisbee is a trick and (c) humans love to play this game with dogs. As for the other example in Section 1 about {exercise, rope, wall, tie, wave}, we also need to compose the following commonsense facts: (i) doing exercises is to cost energy, (ii) waving a rope can cost energy, and (iii) it is more useful when the rope is tied to a wall.

In order to complete a scenario, a generative commonsense reasoner also needs to reasonably associate additional concepts (e.g., 'gym' and 'man') as agents or environments for completing a natural and coherent scenario in our daily life.

This not only requires understanding underlying commonsense relations between concepts, but also incrementally composing them towards a globally optimal scenario. The underlying reasoning chains are inherently based on a variety of background knowledge such as spatial relations, object properties, physical rules, temporal event knowledge, social conventions, etc. However, they may not be recorded in any existing knowledge bases. Compositional Generalization. Humans can compose a sentence to describe a scenario about the concepts they may never seen them cooccurring. For example, there is a testing conceptsetx ={pear, basket, pick, put, tree}. The concept 'pear' never appear in the training data, and 'pick' never co-occurs with 'basket'. Meanwhile, there are some relevant training examples:

• x 1 ={apple, bag, put} → y 1 ="a boy puts an apple in a bag." • x 2 ={apple, tree, pick} → y 2 ="a girl picks an apple from the tree." • x 3 ={apple, basket, wash} → y 3 ="a girl takes an apple from the basket and washes it."

We, humans, can generalize from these seen scenarios and infer that a plausible output:ŷ ="a girl picks some pears from a tree and put them into her basket." This compositionally generalization ability via analogy, i.e., to make "infinite use of finite means" (Chomsky, 1965) , is challenging for machines. This analogical challenge not only requires inference about similar concepts (e.g., 'apple' → 'pear') but also their latent associations.

3 The Commongen Dataset

We now introduce the construction and analysis of the proposed COMMONGEN dataset in this section. To ensure that the concepts in each input concept-set are likely to be present together in a everyday scene, we utilize a wide range of existing caption corpora for sampling frequent concept-sets (Section 3.1). We also carefully control the overlap between the training set and development/test set, such that the task is more challenging in terms of compositional generalization. Afterwards, we employ workers on the crowd-sourcing platform AMT for collecting more human-written sentences (Section 3.2), and thus enrich the diversity of development and test set. Finally, we present the statistics of the COMMONGEN dataset, and utilize ConceptNet as an intermediate tool to investigate the concept connectivity and the distribution of various knowledge types (Section 3.3).

3.1 Collecting Concept-Sets From Captions

It is obviously nonsense if we ask a reasoner to generate a scenario about an arbitrarily concept-set, which is impossible even for humans. The expected concept-sets of our task are supposed to be very likely to co-occur in common, daily-life scenes. Such everyday scenarios are ubiquitous in images and video clips, and this leads us to think about using image/video captioning datasets as a natural resource for collecting concept-sets and sentences. We therefore collect a large amount of caption sentences from all publicly available visual caption corpora, including image captioning datasets, such as Flickr30k (Young et al., 2014) , MSCOCO (Lin et al., 2014) , Conceptual Captions (Sharma et al., 2018) , and also video captioning datasets such as LSMDC (Rohrbach et al., 2017) , ActivityNet (Krishna et al., 2017), and VATEX .

We first conduct part-of-speech tagging over all sentences in the corpora such that words in sentences can be matched to the concept vocabulary of ConceptNet. Then, we compute the sentence frequency of concept-sets that consist of 3∼5 concepts. That is, for each combination of three/four/five concepts in the vocabulary, we know how many sentences are in the corpora covering all concepts.

Towards building a more representative dataset, we expect our selected subset of concept-sets can reflect the distribution in the real world. A straightforward intuition is to directly treat the frequency as the measure of likelihood of concept-sets, and then conduct probabilistic sampling based on this distribution. However, this method tends to sample concept-sets that contain one or two single highly frequent concept, thus leading to corpus-dependent bias. Also, merely using the sentence number can be imprecise to measure the scenario diversity since many images and videos were sampled interdependently. We therefore design a scoring function to weight a concept-set x to incorporate diversity and penalty of inverse set frequency:

score(x) = |S(x)| | ⋃ s i ∈S(x) {w|w ∈ s i }| ∑ s i ∈S(x) Length(s i ) ρ(x).

We denote S(x) as the set of different sentences that contain all its concepts {c 1 , c 2 , . . . , c k } = x, s i as one of the sentences, and |S(x)| to be the number of sentences. The second term is to divide the number of unique words in these sentences by the sum of the lengths of all the sentences, which can roughly represent the diversity of the scenes described in these sentences. Then, we times the result with the last term ρ(x) =

|X |/(max c i ∈x |{x ′ | c i ∈ x ′ and x ′ ∈ X }|).

The idea is to find the concept in x that has the maximum set frequency (i.e. the number of different concept-sets (with non-zero weight) contains it), and then take the inverse with normalization of the number of all concept-sets. This penalty effectively controls the bias towards highly frequent concepts. With the distribution of such scores, we sample 100k concept-sets as candidate inputs.

3.2 Crowd-Sourcing References Via Amt

Although the human-written sentences in the caption corpora can be seen as quality annotations for the COMMONGEN task as well, they were written with specific visual context (i.e. an image or a video clip). Toward better diversity of the scenes about sampled concept-sets and more rigorous evaluation for systems, crowd-sourcing additional human references is necessary that are written with only We highlight the ratios of concept compositions that are unseen in training data, which assures the challenge in compositional generalization ability.

Figure 2: Connectivity analysis in 5-size concept-sets in the test set, each of which consists of 10 concept pairs. For example, 12.0 in blue means: there are 12% concept-sets that have 3 concept pairs with one-hop connections on ConceptNet.

Figure 3: One/two-hop relation frequency in the COMMONGEN dev.&test sets on ConceptNet.

concept-sets as the context. We decide to use the AMT platform for collecting such sentences for covered the top-ranked 2,500 concept-sets in the sampled results, due to the expensive cost of human efforts in writing sentences and the difficulty in verifying the quality of collected sentences. Each of them is assigned to at least three different workers. To encourage workers to write about everyday scenarios about given concept-sets, we ask them to write rationale sentences as well to explain what commonsense facts they have used. Examples of rationales are shown in Figure 4 .

Figure 4: A case study with a concept-set {give, lay, massage, table} for qualitative analysis of machine generations. Human references are collected from AMT and the crowd-workers are required to provide rationales. More case studies are shown in Figure 5 in Appendix.

We use these 2,500 concept-sets as the dev and test set examples for their higher weights and better diversity of human-written sentences. Furthermore, we use the remaining concept-sets as the training examples, for which we use the associated captions as the target outputs. Note that we explicitly control the overlap between the training and dev/test examples by filtering training concept-sets that have more than two overlapping concepts with any example in the dev/test set.

The basic statistics of the final dataset is shown in Table 1 . There are on average four sentences for each example in dev and test sets, which provide a richer and more diverse test-bed for further automatic and manual evaluation. We highlight the ratio of novel concept compositions (i.e., concept, concept-pair, and concept-triple) in dev/test, which never (co-)occur in training examples. This makes COMMONGEN challenging in terms of compositional generalization ability.

Table 1: The basic statistics of the COMMONGEN data. We highlight the ratios of concept compositions that are unseen in training data, which assures the challenge in compositional generalization ability.

3.3 Analysis About Commonsense Knowledge

We here introduce deeper analysis of the dataset by utilizing the largest commonsense knowledge graph (KG), ConceptNet (Speer et al., 2017) , as an tool to study connectivity and relation types.

Connectivity Distribution. Obviously, if the concepts inside a given concept-set is more densely connected with each other on the KG, then it is easier to write a scenario about them. In each 5size concept-set (i.e. a concept-set consists of five concepts), there are 10 unique pairs of concepts, the connections of which we are interested in. As shown in Figure 2 , if we look at the one-hop links on the KG, about 60% of the 5-size concept-set have less than one link among all concept-pairs. On the other hand, if we consider two-hop links, then nearly 50% of them are almost fully connected (i.e. each pair of concepts has connections).

These two observations together suggest that the COMMONGEN has a reasonable difficulty: the concepts are not too distant or too close, and reasoning about the associated scenes is thus neither too difficult nor too trivial.

Relation Distribution. Furthermore, the relation types of such connections can also tell us what kinds of commonsense knowledge are potentially useful for relational reasoning towards generation. We report the frequency of different relation types

4 Methods

In this section, we briefly introduce the adopted baseline methods that are tested on the proposed COMMONGEN task. As there is no principled approach for the proposed setting, to the best of our knowledge, we mainly consider it as a conditional sentence generation task that can be solved by many sequence-to-sequence frameworks.

Encoder-Decoder Models. Bidirectional RNNs and Transformers (Vaswani et al., 2017) are two most popular architectures for seq2seq learning. We use them with the addition of attention mechanism (Luong et al., 2015) with copying ability (Gu et al., 2016) , which are based on an open-source framework OpenNMT-py (Klein et al., 2017) . We use bRNN-CopyNet and Trans-CopyNet denote them respectively. To alleviate the influence from the concept ordering in such sequential learning methods, we randomly permute them multiple times for training and decoding and then get their average performance. To explicitly eliminate the order-sensitivity of inputs, we replace the encoder with a mean pooling-based MLP network (MeanPooling-CopyNet).

Non-autoregressive generation.

Recent advances (Lee et al., 2018; Stern et al., 2019) in conditional sentence generation have an embeding interest on (edit-based) non-autoregressive generation models, which iteratively refine generated sequences. We assume that these models potentially would have better performance because of explicit modeling on iterative refinements, and thus study the most recent such model Levenshtein Transformer (LevenTrans) by Gu et al. (2019) .

Pre-trained Language Generation Models. We also employ various pre-trained language generation models, including GPT-2 (Radford et al., 2019) , UniLM (Dong et al., 2019) , UniLM-v2 (Bao et al., 2020) , BERT-Gen (Bao et al., 2020) , BART , and T5 (Raffel et al., 2019) , to tackle this task and test their generative commonsense reasoning ability. We fine-tuned all the above models on our training data with a seq2seq format.

Specifically, to use GPT-2 for this sequence-tosequence task, we condition the language model on the format "c 1 c 2 . . . c k = y" during finetuning, where c i is a concept in the given conceptset and connects with other concepts with a blank; y is a target sentence. For inference, we sample from the fine-tuned GPT-2 model after a prompt of "c 1 c 2 . . . c k =" with beam search and use the first generated sentence as the output sentence. For BERT-Gen, we use the s2s-ft package 4 to fine-tune them in a sequence-to-sequence fashion similar to the sequence-to-sequence LM objective employed by UniLM. As for T5, the state-of-the-art text-to-text pretrained model which is pre-trained with a multitask objective by prepending a task description before the input text, we prepend the input concept set with a simple prompt: "generate a sentence with:" and fine-tune the model with the source sentence on the format "generate a sentence with c 1 c 2 . . . c k ."

5 Evaluation

In this section, we first introduce our metrics for automatic evaluation, then analyze the performance of tested systems, and finally provide qualitative analysis with case studies.

5.1 Metrics

Following other conventional generation tasks, we use several widely-used automatic metrics to automatically assess the performance, such as BLEU (Papineni et al., 2002) , ROUGE (Lin, 2004) , METEOR (Banerjee and Lavie, 2005) , which mainly focus on measuring surface similarities. We report the concept Coverage, which is the average percentage of input concepts that are present in lemmatizatized outputs.

In addition, we argue that it is more suitable to use evaluation metrics specially design for captioning task, such as CIDEr (Vedantam et al., 2015) and SPICE (Anderson et al., 2016) . They usually assume system generations and human references use similar concepts, and thus focus on evaluate the associations between mentioned concepts instead of n-gram overlap. For example, the SPICE metric use dependency parse trees as proxy of scene graphs to measure the similarity of scenarios.

Table 2: The distributions of the relation categories on one/two-hop connections.

To estimate human performance within each metric, we treat each reference sentence in dev/test data as a "system prediction" to be compared with all other references, which is equivalent to compute inter-annotator agreement within each metric. Thus, systems that have better generative ability than average crowd-workers should exceed this. Table 3 presents the experimental results of all compared methods in different metrics. We can see that all fine-tuned pre-trained models (the lower group) outperform non-pretrained models (the upper group) with a significant margin. This is not surprising because the their pretraining objectives, including masked language modeling, word ordering, and text infilling which predicts missing words or text spans, are relevant to our task. On the other hand, we find that the key disadvantage of nonpretrained models with CopyNet still falls in the failure of using all given concepts (i.e., low coverage), which results in worse results.

Table 3: Experimental results of different baseline methods on the COMMONGEN test set. The first group of models are non-pretrained models, while the second group is large pretrained models that we have fine-tuned. The best models are bold and second best ones are underlined within each metric.

5.2 Experimental Results

Among them, UniLM, BART, and T5 performs the best, which may be due to its inherent sequenceto-sequence pre-training framework. We found that [bRNN-CpNet] : Lays massage someone table vertical gives on and the water.

[Trans-CpNet]: Massage lays on the kitchen.

[MP-CpNet]: A massage table being calling with an improvisation lay free speaker. [LevenTrans] : A man chatting at the table.

[ : A man gives a massage to a table.

[BERT-Gen]: A woman lays down on a table and gives a massage to a man.

[UniLM]: A woman lays down a massage on a table and gives a massage.

[UniLM-v2]: A woman is laying down and giving a massage on a table.

[BART]: A man lays on a table and gives a massage to a woman laying on the table.

[T5]: Woman lay on a table and gives a massage.

[Machine Generations]

1. The man lays down on the massage table and the therapist gives him a massage.

[Rationale]: The man must lay down to receive a massage.

The therapist is the giver of massages. The table is a massage table. 2. Lay down on the table and the masseuse will give you a neck massage.

[Rationale]: A masseuse is a woman who gives massages professionally. Massages are usually done on tables.

3. The woman gives the man who lays on the table a massage.

[Rationale]: Some massages are done laying down; people like to get massages; tables are used for people to get massages; people lay on tables to get massages.

[Human references from AMT] [Input concept-set]: { give, lay, massage, table } Figure 4 : A case study with a concept-set {give, lay, massage, table} for qualitative analysis of machine generations. Human references are collected from AMT and the crowd-workers are required to provide rationales. More case studies are shown in Figure 5 in Appendix.

Figure 5: Three cases for qualitative analysis of machine generations. References are collected from AMT crowdworkers and they are required to provide rationales. Note that the second one is a positive case showing that some models can successfully generate reasonable scenarios. However, most models perform poorly on the other cases.

BART has the best concept coverage, which is probably due to its comprehensive pretraining tasks that aim to recover text with noise. The results suggest that further modifying over pre-trained models is a promising direction for generative commonsense reasoning. This also shows that our dataset would be a good test-bed for comparing the commonsense reasoning ability of different pre-trained language models.

Recent work (Lv et al., 2020) finds that the OMCS corpus (Singh et al., 2002) , which has derived the ConceptNet, is a valuable resource for retrieving relevant commonsense facts for discriminative reasoning about questions. We follow the same steps to retrieve related facts by querying input concepts. Then, we concatenate them with the original concept-sets as the final input sequence to the above-mentioned methods, mimicking abstractive summarization tasks. However, we only observe very marginal improvement when using retrieved OMCS sentences as additional inputs. We argue that imposing commonsense knowledge with additional graph structures (Lin et al., 2019) between input concepts is a more promising future direction for the COMMONGEN task as graphs are naturally order-insensitive. Figure 4 shows the top generations of different models and human references about an input concept-set: {give, lay, massage, table}. We find that non-pretrained seq2seq models can successfully use part of given concepts, while the generated sentences are neither grammatical nor coherent. The vanilla LevenTrans model only uses one of the given concepts, although it aims to modeling the edits explicitly and generates syntactically sound sentences. bRNN-CopyNet uses all four concepts with the powerful copy mechanism, but generates nonsensical sentences.

5.3 Qualitative Analysis With A Case Study

The outputs of fine-tuned pre-trained models are significantly more grammatical and commonsensical. Although they are not equipped with an explicit module for enforcing the use of given concepts, most of them can cover all concepts in their outputs. We can see that the scenarios in the outputs of GPT-2, UniLM-v1/2, and T5 only involve a single person, and the other two models associate their scenarios with two persons. This makes the person doing two contradictory actions in their output scenarios (e.g., 'laying on a table' and 'giving a massage'). GPT-2 creates an even funny nonsensical composition ('gives a massage to a table'), due to this issue. Although BERT-Gen indeed incorporates a second person in its output, it still has the contradiction. The model closet to human references is BART within this case study, if it did not generate the 'lays on a table and' to describe the man. This suggests that a second pass to remove some local optimal generations is necessary for assuring plausibility of the scenario.

6 Related Work

Commonsense benchmark datasets. There are many emerging datasets for testing machine commonsense from different angles, such as commonsense extraction , next situation prediction (SWAG (Zellers et al., 2018) , CODAH , Hel-laSWAG (Zellers et al., 2019b) ), cultural and social understanding Sap et al., 2019a,b) , visual scene comprehension (Zellers et al., 2019a) , and general commonsense question answering (Talmor et al., 2019; .

Recent studies have shown that simply finetuning large pre-trained language models, e.g. RoBERTa , can yield near-human, or even exceeding-human, performance in these discriminative reasoning scenarios such as the SWAG dataset. We argure that the underlying reasons are two-fold: 1) The creation of distractor choices has annotator bias (Geva et al., 2019) which can be easily detected by NLU models. 2) Self-supervised training objectives in BERT-like models (Devlin et al., 2019) align well with the multi-choice QA setting; the SWAG task shares almost the same scenario with the Next Sentence Prediction (NSP) task, and because the CSQA task can be viewed as learning to recover missing words that are masked by "wh-words", it can be distantly learned using Masked Language Modeling (MLM). Therefore, these success does not necessarily mean machine reasoners can produce novel assumptions in an open, realistic, generative setting.

Constrained Text Generation. Constrained text generation aims to decode sentences with expected attributes such as sentiment Hu et al., 2017) , tense (Hu et al., 2017) , template (Zhu et al., 2019) , style (Fu et al., 2018; Li et al., 2018) , topics (Feng et al., 2018) , etc. A similar scenario with our task is lexically constrained encoding, which has been mainly studied in the machine translation community (Hasler et al., 2018; Dinu et al., 2019; Hokamp and Liu, 2017) . One recent work in this line is the CGMH (Miao et al., 2019) method, which aims to sample sentences with an ordered sequence of keywords from language models but cannot be fine-tuned and adopted in our case. Topical story generation (Fan et al., 2018; Yao et al., 2019) is also a related direction, while it targets generating longer, creative stories around the given topics, making it hard to directly adopt them to our task. Additionally, the COMMONGEN task brings some more challenges mentioned in Section 2. Prior constrained generation methods cannot address these issues together in a unified model, and thus we expect COMMONGEN to be also a benchmark dataset for future works in this direction.

Injecting Commonsense for NLG. There are also a few works that incorporate commonsense knowledge in language generation tasks such as essay generation (Guan et al., 2019; , video storytelling , and conversational systems (Zhang et al., 2019) . These works suggest that generative commonsense reasoning has a great potential to benefit downstream applications. Our proposed COMMONGEN, to the best of our knowledge, is the very first constrained sentence generation dataset for assessing and conferring generative machine commonsense and we hope it can benefit such applications.

7 Conclusion

Our major contribution in this paper are as follows:

1. we present COMMONGEN, a novel constrained generation task for generative commonsense reasoning, and a large-scale dataset;

2. we carefully analyze the inherent challenges of the proposed task, i.e., a) relational reasoning with latent commonsense knowledge, and b) compositional generalization.

3. our extensive experiments systematically examine recent pre-trained language generation models (e.g., UniLM, BART, T5) on the task , and find that their performance is still far from humans, generating grammatically sound yet realistically implausible sentences.

Our study points to interesting future research directions on modeling commonsense knowledge in language generation process, towards conferring machines with generative commonsense reasoning ability. We hope COMMONGEN would also benefit downstream NLG applications such as conversational systems and storytelling models.

2. The man clean the window on the ladder stand by using squeegee.

[Rationale]: man need to clean the window by using squeegee on the ladder stand 3. The man stood beside the ladder and cleaned the window with a squeegee.

[Rationale]: people can stand next to ladders. People clean windows. Squeegees are used to clean windows.

[Human references from AMT] 3) [Input concept-set]: { clean, ladder, squeegee, stand, window } Figure 5 : Three cases for qualitative analysis of machine generations. References are collected from AMT crowdworkers and they are required to provide rationales. Note that the second one is a positive case showing that some models can successfully generate reasonable scenarios. However, most models perform poorly on the other cases.

Note that morphological inflections are allowed.

of the one/two-hop connections among conceptpairs in the dev and test examples inFig. 3. To better summarize the distributions, we categorize these relations into five major types and present their distribution inTable 2, respectively for one/two-hop connections between concept pairs.3 Relation definitions are at https://github.com/ commonsense/conceptnet5/wiki/Relations.

https://github.com/microsoft/unilm