Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Report a problem with this paper

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge



When answering a question, people often draw upon their rich world knowledge in addition to the particular context. Recent work has focused primarily on answering questions given some relevant document or context, and required very little general background. To investigate question answering with prior knowledge, we present CommonsenseQA: a challenging new dataset for commonsense question answering. To capture common sense beyond associations, we extract from ConceptNet (Speer et al., 2017) multiple target concepts that have the same semantic relation to a single source concept. Crowd-workers are asked to author multiple-choice questions that mention the source concept and discriminate in turn between each of the target concepts. This encourages workers to create questions with complex semantics that often require prior knowledge. We create 12,247 questions through this procedure and demonstrate the difficulty of our task with a large number of strong baselines. Our best baseline is based on BERT-large (Devlin et al., 2018) and obtains 56% accuracy, well below human performance, which is 89%.

1 Introduction

When humans answer questions, they capitalize on their common sense and background knowledge about spatial relations, causes and effects, scientific facts and social conventions. For instance, given the question "Where was Simon when he heard the lawn mower?", one can infer that the lawn mower is close to Simon, and that it is probably outdoors and situated at street level. This type of knowledge seems trivial for humans, but is still out of the reach of current natural language understanding (NLU) systems. * The authors contributed equally Figure 1 : (a) A source concept ('river') and three target concepts (dashed) are sampled from CONCEPT-NET (b) Crowd-workers generate three questions, each having one of the target concepts for its answer (), while the other two targets are not (). Then, for each question, workers choose an additional distractor from CONCEPTNET (in italics), and author one themselves (in bold).

Figure 1: (a) A source concept (‘river’) and three target concepts (dashed) are sampled from CONCEPTNET (b) Crowd-workers generate three questions, each having one of the target concepts for its answer (3), while the other two targets are not (7). Then, for each question, workers choose an additional distractor from CONCEPTNET (in italics), and author one themselves (in bold).

Work on Question Answering (QA) has mostly focused on answering factoid questions, where the answer can be found in a given context with little need for commonsense knowledge (Hermann et al., 2015; Rajpurkar et al., 2016; Nguyen et al., 2016; Joshi et al., 2017) . Small benchmarks such as the Winograd Scheme Challenge (Levesque, 2011) and COPA (Roemmele et al., 2011) , targeted common sense more directly, but have been difficult to collect at scale.

Recently, efforts have been invested in developing large-scale datasets for commonsense reasoning. In SWAG (Zellers et al., 2018b) , given a textual description of an event, a probable subsequent event needs to be inferred. However, it has been quickly realized that models trained on large amounts of unlabeled data (Devlin et al., 2018) capture well this type of information and performance on SWAG is already at human level. VCR (Zellers et al., 2018a) is another very recent attempt that focuses on the visual aspects of common sense. Such new attempts highlight the breadth of commonsense phenomena, and make it evident that research on common sense has only scratched the surface. Thus, there is need for datasets and models that will further our understanding of what is captured by current NLU models, and what are the main lacunae.

In this work, we present COMMONSENSEQA, a new dataset focusing on commonsense question answering, based on knowledge encoded in CONCEPTNET (Speer et al., 2017) . We propose a method for generating commonsense questions at scale by asking crowd workers to author questions that describe the relation between concepts from CONCEPTNET ( Figure 1) . A crowd worker observes a source concept ('River' in Figure 1 ) and three target concepts ('Waterfall', 'Bridge', 'Valley') that are all related by the same CONCEPT-NET relation (AtLocation). The worker then authors three questions, one per target concept, such that only that particular target concept is the answer, while the other two distractor concepts are not. This primes the workers to add commonsense knowledge to the question, that separates the target concept from the distractors. Finally, for each question, the worker chooses one additional distractor from CONCEPTNET, and authors another distractor manually. Thus, in total, five candidate answers accompany each question.

Because questions are generated freely by workers, they often require background knowledge that is trivial to humans but is seldom explicitly reported on the web due to reporting bias (Gordon and Van Durme, 2013). Thus, questions in COMMONSENSEQA have a different nature compared to prior QA benchmarks, where questions are authored given an input text.

Using our method, we collected 12,247 commonsense questions. We present an analysis that illustrates the uniqueness of the gathered questions compared to prior work, and the types of commonsense skills that are required for tackling it. We extensively evaluate models on COMMON-SENSEQA, experimenting with pre-trained models, fine-tuned models, and reading comprehen-sion (RC) models that utilize web snippets extracted from Google search on top of the question itself. We find that fine-tuning BERT-LARGE (Devlin et al., 2018) on COMMONSENSEQA obtains the best performance, reaching an accuracy of 55.9%. This is substantially lower than human performance, which is 88.9%.

To summarize, our contributions are: 1. A new QA dataset centered around common sense, containing 12,247 examples. 2. A new method for generating commonsense questions at scale from CONCEPTNET. 3. An empirical evaluation of state-of-the-art NLU models on COMMONSENSEQA, showing that humans substantially outperform current models. The dataset can be downloaded from www. tau-nlp.org/commonsenseqa. The code for all our baselines is available at github. com/jonathanherzig/commonsenseqa.

2 Related Work

Machine common sense, or the knowledge of and ability to reason about an open ended world, has long been acknowledged as a critical component for natural language understanding. Early work sought programs that could reason about an environment in natural language (McCarthy, 1959) , or leverage a world-model for deeper language understanding (Winograd, 1972) . Many commonsense representations and inference procedures have been explored (McCarthy and Hayes, 1969; Kowalski and Sergot, 1986 ) and large-scale commonsense knowledge-bases have been developed (Lenat, 1995; Speer et al., 2017) . However, evaluating the degree of common sense possessed by a machine remains difficult.

One important benchmark, the Winograd Schema Challenge (Levesque, 2011) , asks models to correctly solve paired instances of coreference resolution. While the Winograd Schema Challenge remains a tough dataset, the difficulty of generating examples has led to only a small available collection of 150 examples. The Choice of Plausible Alternatives (COPA) is a similarly important but small dataset consisting of 500 development and 500 test questions (Roemmele et al., 2011) . Each question asks which of two alternatives best reflects a cause or effect relation to the premise. For both datasets, scalability is an issue when evaluating modern modeling approaches.

With the recent adoption of crowdsourcing, several larger datasets have emerged, focusing on predicting relations between situations or events in natural language. JHU Ordinal Commonsense Inference requests a label from 1-5 for the plausibility that one situation entails another (Zhang et al., 2017) . The Story Cloze Test (also referred to as ROC Stories) pits ground-truth endings to stories against implausible false ones (Mostafazadeh et al., 2016) . Interpolating these approaches, Situations with Adversarial Generations (SWAG), asks models to choose the correct description of what happens next after an initial event (Zellers et al., 2018b) . LM-based techniques achieve very high performance on the Story Cloze Test and SWAG by fine-tuning a pre-trained LM on the target task (Radford et al., 2018; Devlin et al., 2018) .

Investigations of commonsense datasets, and of natural language datasets more generally, have revealed the difficulty in creating benchmarks that measure the understanding of a program rather than its ability to take advantage of distributional biases, and to model the annotation process (Gururangan et al., 2018; Poliak et al., 2018) . Annotation artifacts in the Story Cloze Test, for example, allow models to achieve high performance while only looking at the proposed endings and ignoring the stories (Schwartz et al., 2017; Cai et al., 2017) . Thus, the development of benchmarks for common sense remains a difficult challenge.

Researchers have also investigated question answering that utilizes common sense. Science questions often require common sense, and have recently received attention Mihaylov et al., 2018; Ostermann et al., 2018) ; however, they also need specialized scientific knowledge. In contrast to these efforts, our work studies common sense without requiring additional information. SQUABU created a small handcurated test of common sense and science questions (Davis, 2016) , which are difficult for current techniques to solve. In this work, we create similarly well-crafted questions but at a larger scale.

3 Dataset Generation

Our goal is to develop a method for generating questions that can be easily answered by humans without context, and require commonsense knowledge. We generate multiple-choice questions in a process that comprises the following steps.

1. We extract subgraphs from CONCEPTNET, each with one source concept and three target concepts. 2. We ask crowdsourcing workers to author three questions per subgraph (one per target concept), to add two additional distractors per question, and to verify questions' quality. 3. We add textual context to each question by querying a search engine and retrieving web snippets. The entire data generation process is summarized in Figure 2 . We now elaborate on each of the steps:

Figure 2: COMMONSENSEQA generation process. The input is CONCEPTNET knowledge base, and the output is a set of multiple-choice questions with corresponding relevant context (snippets).

Extraction from CONCEPTNET CONCEPT-NET is a graph knowledge-base G ⊆ C × R × C, where the nodes C represent natural language concepts, and edges R represent commonsense relations. Triplets (c 1 , r, c 2 ) carry commonsense knowledge such as '(gambler, CapableOf, lose money)'. CONCEPTNET contains 32 million triplets. To select a subset of triplets for crowdsourcing we take the following steps:

1. We filter triplets with general relations (e.g., RelatedTo) or relations that are already well-explored in NLP (e.g., IsA). In total we use 22 relations. 2. We filter triplets where one of the concepts is more than four words or not in English. 3. We filter triplets where the edit distance between c 1 and c 2 is too low. This results in a set of 236,208 triplets (q, r, a), where we call the first concept the question concept and the second concept the answer concept.

We aim to generate questions that contain the question concept and where the answer is the answer concept. To create multiple-choice questions we need to choose distractors for each question. Sampling distractors at random from CONCEPT-NET is a bad solution, as such distractors are easy to eliminate using simple surface clues.

To remedy this, we propose to create question sets: for each question concept q and relation r we group three different triplets {(q, r, a 1 ), (q, r, a 2 ), (q, r, a 3 )} (see Figure 1 ). This generates three answer concepts that are semantically similar and have a similar relation to the question concept q. This primes crowd workers to formulate questions that require background knowledge about the concepts in order to answer the question.

The above procedure generates approximately 130,000 triplets (43,000 question sets), for which we can potentially generate questions.

Crowdsourcing questions We used Amazon Mechanical Turk (AMT) workers to generate and validate commonsense questions.

AMT workers saw, for every question set, the question concept and three answer concepts. They were asked to formulate three questions, where all questions contain the question concept. Each question should have as an answer one of the answer concepts, but not the other two. To discourage workers from providing simple surface clues for the answer, they were instructed to avoid using words that have a strong relation to the answer concept, for example, not to use the word 'open' when the answer is 'door'.

Formulating questions for our task is nontrivial. Thus, we only accept annotators for which at least 75% of the questions they formulate pass the verification process described below.

Adding additional distractors To make the task more difficult, we ask crowd-workers to add two additional incorrect answers to each formulated question. One distractor is selected from a set of answer concepts with the same relation to the question concept in CONCEPTNET (Figure 1 , in red). The second distractor is formulated manually by the workers themselves (Figure 1 , in purple). Workers were encouraged to formulate a distractor that would seem plausible or related to the question but easy for humans to dismiss as incorrect. In total, each formulated question is accompanied with five candidate answers, including one # CONCEPTNET distinct question nodes 2,254 # CONCEPTNET distinct answer nodes 12,094 # CONCEPTNET distinct nodes 12,107 # CONCEPTNET distinct relation lables 22 average question length (tokens) 13.41 long questions (more than 20 tokens)

Measurement Value

10.3% average answer length (tokens) 1.5 # answers with more than 1 token 44% # of distinct words in questions 14,754 # of distinct words in answers 4,911 Verifying questions quality We train a disjoint group of workers to verify the generated questions.

Verifiers annotate a question as unanswerable, or choose the right answer. Each question is verified by 2 workers, and only questions verified by at least one worker that answered correctly are used. This processes filters out 15% of the questions.

Adding textual context To examine whether web text is useful for answering commonsense questions, we add textual information to each question in the following way: We issue a web query to Google search for every question and candidate answer, concatenating the answer to the question, e.g., 'What does a parent tell their child to do after they've played with a lot of toys? + "clean room"'. We take the first 100 result snippets for each of the five answer candidates, yielding a context of 500 snippets per question. Using this context, we can investigate the performance of reading comprehension (RC) models on COM-MONSENSEQA.

Overall, we generated 12,247 final examples, from a total of 16,242 that were formulated. The total cost per question is $0.33. Table 1 describes the key statistics of COMMONSENSEQA.

Table 1: Key statistics for COMMONSENSEQA

4 Dataset Analysis

CONCEPTNET concepts and relations COM-MONSENSEQA builds on CONCEPTNET, which contains concepts such as dog, house, or row boat, connected by relations such as Causes, CapableOf, or Antonym. The top-5 question concepts in COMMONSENSEQA are 'Person' (3.1%), 'People' (2.0%), 'Human' (0.7%), 'Water' (0.5%) and 'Cat' (0.5%). In addition, we present the main relations along with the percentage of questions generated from them in

47.3 Causes

What is the hopeful result of going to see a play? A. being entertained, B. meet, C. sit, D. ...

17.3 Capableof

Why would a person put flowers in a room with dirty gym socks? A. smell good, B. many colors, C. continue to grow , D. ...

9.4 Antonym

Someone who had a very bad flight might be given a trip in this to make up for it? A. first class, B. reputable, C. propitious , D. ...

8.5 Hassubevent

How Table 2 : Top CONCEPTNET relations in COMMONSENSEQA, along with their frequency in the data and an example question. The first answer (A) is the correct answer worth noting that since question formulators were not shown the CONCEPTNET relation, they often asked questions that probe other relationships between the concepts. For example, the question "What do audiences clap for?" was generated from the AtLocation relation, but focuses on social conventions instead.

Table 2: Top CONCEPTNET relations in COMMONSENSEQA, along with their frequency in the data and an example question. The first answer (A) is the correct answer

Question formulation Question formulators were instructed to create questions with high language variation. 122 formulators contributed to question generation. However, 10 workers formulated more than 85% of the questions. We analyzed the distribution of first and second words in the formulated questions along with example questions. Figure 4 presents the breakdown. Interestingly, only 44% of the first words are WHwords. In about 5% of the questions, formulators used first names to create a context story, and in 7% they used the word "if" to present a hypothetical question. swer questions in COMMONSENSEQA, we randomly sampled 100 examples from the development set and performed the following analysis. For each question, we explicitly annotated the types of commonsense skills that a human uses to answer the question. We allow multiple commonsense skills per questions, with an average of 1.75 skills per question. Figure 3 provides three example annotations. Each annotation contains a node for the answer concept, and other nodes for concepts that appear in the question or latent concepts. Labeled edges describe the commonsense skill that relates the two nodes. We defined commonsense skills based on the analysis of LoBue and Yates (2011), with slight modifications to accommodate the phenomena in our data. Table 3 presents the skill categories we used, their definition and their frequency in the analyzed examples.

Figure 3: Examples of manually-annotated questions, with the required skills needed to arrive at the answers (red circles). Skills are labeled edges, and concepts are nodes.
Figure 4: Distribution of the first and second words in questions. The inner part displays words and their frequency and the outer part provides example questions.
Table 3: Skills and their frequency in the sampled data. As each example can be annotated with multiple skills, the total frequency does not sum to 100%.

5 Baseline Models

Our goal is to collect a dataset of commonsense questions that are easy for humans, but hard for current NLU models. To evaluate this, we experiment with multiple baselines. Table 4 summarizes the various baseline types and characterizes them based on (a) whether training is done on COM-MONSENSEQA or the model is fully pre-trained,

Table 4: Baseline models along with their characteristics. Training states whether the model was trained on COMMONSENSEQA, or was only trained a different dataset. Context states whether the model uses extra context as input.

The 13%

If there are people watching a priest, what is he doing? Figure 4 : Distribution of the first and second words in questions. The inner part displays words and their frequency and the outer part provides example questions.


Training Context VECSIM LM1B QABILINEAR QACOMPARE ESIM GPT BERT BIDAF++ Table 4 : Baseline models along with their characteristics. Training states whether the model was trained on COMMONSENSEQA, or was only trained a different dataset. Context states whether the model uses extra context as input.

and (b) whether context (web snippets) is used. We now elaborate on the different baselines. a VECSIM A model that chooses the answer with highest cosine similarity to the question, where the question and answers are represented by an average of pre-trained word embeddings. b LM1B Inspired by Trinh and Le (2018), we employ a large language model (LM) from Jozefowicz et al. 2016, which was pre-trained on the One Billion Words Benchmark (Chelba et al., 2013) . We use this model in two variations. In the first (LM1B-CONCAT), we simply concatenate each answer to the question. In the second (LM1B-REP), we first cluster questions according to their first two words. Then, we recognize five high-frequency prefixes that cover 35% of the development set (e.g., "what is"). We rephrase questions that fit into one of these prefixes as a declarative sentence that contains the answer. E.g., we rephrase "What is usually next to a door?" and the candidate answer "wall" to "Wall is usually next to a door". For questions that do not start with the above prefixes, we concatenate the answer as in LM1B-CONCAT. In both variations we return the answer with highest LM probability. c QABILINEAR This model, propsed by Yu et al. (2014) for QA, scores an answer a i with a bilinear model: qW a i , where the question q and answers a i are the average pre-trained word embeddings and W is a learned parameter matrix. A softmax layer over the candidate answers is used to train the model with cross-entropy loss. d QACOMPARE This model is similar to an NLI model from Liu et al. (2016) . The model represents the interaction between the question q and a candidate answer a i as: h = relu([q; a i ; q a i ; q − a i ]W 1 + b 1 ), where ';' denotes concatenation and is element-wise product. Then, the model predicts an answer score using a feed forward layer: hW 2 + b 2 . Average pre-trained embeddings and softmax are used to train the model. e ESIM We use ESIM, a strong NLI model (Chen et al., 2016) . Similar to Zellers et al. (2018b) , we change the output layer size to the number of candidate answers, and apply softmax to train with cross-entropy loss.

f BIDAF++ A state-of-the-art RC model, that uses the retrieved Google web snippets (Section 3) as context. We augment BIDAF (Seo et al., 2016) with a self-attention layer and ELMo representa-tions (Peters et al., 2018; Huang et al., 2018) . To adapt to the multiple-choice setting, we choose the answer with highest model probability.

g GENERATIVE PRE-TRAINED TRANS-FORMER (GPT) Radford et al. (2018) proposed a method for adapting pre-trained LMs to perform a wide range of tasks. We applied their model to COMMONSENSEQA by encoding each question and its candidate answers as a series of delimiterseparated sequences. For example, the question "If you needed a lamp to do your work, where would you put it?", and the candidate answer "bedroom" would become "[start] If ... ?

[sep] bedroom [end]". The hidden representations over each [end] token are converted to logits by a linear transformation and passed through a softmax to produce final probabilities for the answers. We used the same pre-trained LM and hyper-parameters for fine-tuning as Radford et al. (2018) on ROC Stories, except with a batch size of 10.

h BERT Similarly to the GPT, BERT fine-tunes a language model and currently holds state-of-theart across a broad range of tasks (Devlin et al., 2018) . BERT uses a masked language modeling objective, which predicts missing words masked from unlabeled text. To apply BERT to COMMONSENSEQA, we linearize each questionanswer pair into a delimiter-separated sequence (i.e., "[CLS] If ... ? [SEP] bedroom [SEP]") then fine-tune the pre-trained weights from uncased BERT-LARGE. 1 Similarly to the GPT, the hidden representations over each [CLS] token are run through a softmax layer to create the predictions. We used the same hyper-parameters as Devlin et al. (2018) for SWAG.

6 Experiments

Experimental Setup We split the data into a training/development/test set with an 80/10/10 split. We perform two types of splits: (a) random split -where questions are split uniformly at random, and (b) question concept split -where each of the three sets have disjoint question concepts. We empirically find (see below) that a random split is harder for models that learn from COMMONSENSEQA, because the same question concept appears in the training set and development/test set with different answer concepts, and networks that memorize might fail in such a scenario. Since the random split is harder, we consider it the primary split of COMMONSENSEQA.

We evaluate all models on the test set using accuracy (proportion of examples for which prediction is correct), and tune hyper-parameters for all trained models on the development set. To understand the difficulty of the task, we add a SANITY mode, where we replace the hard distractors (that share a relation with the question concept and one formulated by a worker) with random CONCEPT-NET distractors. We expect a reasonable baseline to perform much better in this mode.

For pre-trained word embeddings we consider 300d GloVe embeddings (Pennington et al., 2014) and 300d Numberbatch CONCEPTNET node embeddings (Speer et al., 2017) , which are kept fixed at training time. We also combine ESIM with 1024d ELMo contextual representations, which are also fixed during training.

Human Evaluation To test human accuracy, we created a separate task for which we did not use a qualification test, nor used AMT master workers. We sampled 100 random questions and for each question gathered answers from five workers that were not involved in question generation. Humans obtain 88.9% accuracy, taking a majority vote for each question.

Results Table 5 presents test set results for all models and setups.

Table 5: Test set accuracy for all models.

The best baselines are BERT-LARGE and GPT with an accuracy of 55.9% and 45.5%, respectively, on the random split (63.6% and 55.5%, respectively, on the question concept split). This is well below human accuracy, demonstrating that the benchmark is much easier for humans. Nevertheless, this result is much higher than random (20%), showing the ability of language models to store large amounts of information related to commonsense knowledge.

The top part of Table 5 describes untrained models. We observe that performance is higher than random, but still quite low. The middle part describes models that were trained on COMMON-SENSEQA, where BERT-LARGE obtains best performance, as mentioned above. ESIM models follow BERT-LARGE and GPT, and obtain much lower performance. We note that ELMo representations did not improve performance compared to GloVe embeddings, possibly because we were un- Table 6 : BERT-LARGE baseline analysis. For each category we provide two examples, the correct answer, one distractor, model accuracy and frequency in the dataset. The predicted answer is in bold.