# Unsupervised Commonsense Question Answering with Self-Talk

## Abstract

Natural language understanding involves reading between the lines with implicit background knowledge. Current systems either rely on pre-trained language models as the sole implicit source of world knowledge, or resort to external knowledge bases (KBs) to incorporate additional relevant knowledge. We propose an unsupervised framework based on \emph{self-talk} as a novel alternative to multiple-choice commonsense tasks. Inspired by inquiry-based discovery learning (Bruner, 1961), our approach inquires language models with a number of information seeking questions such as "$\textit{what is the definition of ...}$" to discover additional background knowledge. Empirical results demonstrate that the self-talk procedure substantially improves the performance of zero-shot language model baselines on four out of six commonsense benchmarks, and competes with models that obtain knowledge from external KBs. While our approach improves performance on several benchmarks, the self-talk induced knowledge even when leading to correct answers is not always seen as useful by human judges, raising interesting questions about the inner-workings of pre-trained language models for commonsense reasoning.

## 1 Introduction

Human level natural language understanding involves reading between the lines and relying on implicit background knowledge. Consider the scene in Figure 1 : Alice let Bob stand in front of her at the concert. Using physical and social commonsense -(i) Bob and Alice want to see the stage, and (ii) If Bob is taller, they would block Alice's view -one can infer that Alice is taller than Bob. Such examples are ubiquitous across natural language understanding (NLU) tasks such as reading comprehension (Hirschman et al., 1999) and recognizing textual entailment (Dagan et al., 2013) , and even more so in tasks dedicated to commonsense reasoning such as the Winograd schema challenge (WSC; Levesque et al., 2012) .

Most current NLU models rely on pre-trained language models (LMs; e.g. Radford et al., 2019; Devlin et al., 2019; . The standard practice is to use task-specific data to fine-tune a pre-trained LM in a supervised manner. Alternatively, LM score is used to rank answer choices in a zero-shot setup Sakaguchi et al., 2020) . In both setups, pre-trained LMs yield improved performance upon prior methods, greatly due to the world knowledge that such LMs capture, having been trained on massive texts (Petroni et al., 2019; Davison et al., 2019) .

Despite the performance boost, LMs as knowledge providers suffer from various shortcomings: (i) insufficient coverage: due to reporting bias, many trivial facts might not be captured by LMs (purple set in Figure 1 ), because they are rarely written about (Gordon and Van Durme, 2013) . (ii) insufficient precision: the distributional training objective increases the probability of non-facts (light green set in Figure 1 ) that are semantically similar to true facts, as in negation ("birds cannot fly"; Kassner and Schütze, 2019) . LMs excel in predicting the semantic category of a missing word, but might predict the wrong instance in that category (e.g., depending on the phrasing, BERT sometimes arXiv:2004.05483v1 [cs.CL] 11 Apr 2020 Dataset Context + Question Choices

## Copa

The man broke his toe. 1) He got a hole in his sock. What was the cause of this?

2) He dropped a hammer on his foot.

Distil-GPT2 (63.7) > GPT2-M (61.8) > GPT2-L (60.6) > GPT2 (59.7) > GPT (58.6) > GPT2-XL (57.9) > XLNet-base (51.9) > XLNet-L (49.

5) CSQA GPT2-L (31.8) > GPT2-XL (31.2) > GPT2-M (27.7) > GPT (27.6) > GPT2 (25.6) > Distil-GPT2 (25.4) > XLNet-base (21.5) > XLNet-L (20.8) MC-TACO GPT2-XL (58.1) > GPT2-L (56.6) > GPT2-M (53) > GPT2 (50.1) > Distil-GPT2 (48.8) > GPT (47.7) > XLNet-L (37) > XLNet-base (34.2) Social IQa GPT2-XL (45.5) > GPT2-L (44.4) > GPT2-M (43.4) > GPT2 (41.8) > GPT (41.6) > Distil-GPT2 (40.4) > XLNet-L (33.6) > XLNet-base (33.1) PIQA GPT2-XL (69.6) > GPT2-L (67.9) > GPT2-M (65.6) > GPT2 (62) > Distil-GPT2 (59.6) > GPT (57.9) > XLNet-base (49.2) > XLNet-L (48.8) Wino.

GPT2-XL (54) > GPT2-L (52.9) > GPT (52.2) > GPT2 (51.2) > Distil-GPT2 (50.9) > GPT2-M (50.2) > XLNet-base (49.1) > XLNet-L (48.7)

COPA COMET (61.1) > GPT2-XL (58.6) > Google Ngrams (58.4) > GPT2-M (58.2) > XLNet-L (58.2) > GPT (58.1) > GPT2 (58.0) CSQA COMET (29.8) > Google Ngrams (29.1) > GPT2-M (26.3) > ConceptNet (26.1) > GPT2-L (26.1) > XLNet-L (25.8) > GPT2 (25.8) MC-TACO

Google Ngrams (49.1) > ConceptNet (48.9) > GPT2 (48.7) > GPT2-L (48.6) > GPT2-XL (48.5) > Distil-GPT2 (48.1) > GPT2-M (48.1) Social IQa COMET (41.4) > GPT2-XL (40.9) > GPT2-L (40.6) > Distil-GPT2 (40. 2018), which works well for the simple sentences in COPA. The XLNet models perform poorly, perhaps due to their smaller training corpus (16GB vs 40GB in GPT-2, both using web text).

Best Knowledge Source. Among the knowledge informed models, COMET achieves the best performance across tasks. This likely happens first because COMET can dynamically generate predictions for any context, while the other two knowledge sources are static and lack coverage. Second, as expected, COMET improves the predictions for Social IQa, which was built based on the ATOMIC resource on which COMET is trained. Table 4 sorts the knowledge sources based on the average development accuracy across LMs. PIQA and MC-TACO, tasks that require different types of knowledge from social commonsense, perform well with ConceptNet and Google Ngrams. With respect to self-talk models, there is a rather small difference in performance between the different LMs used as knowledge sources, with slight preference to GPT-2 in most datasets.

We also experimented with combining the clarifications from all the knowledge sources, which didn't prove beneficial except for MC-TACO (where it added +7.9 points to the dev accuracy, bringing it to 66.7). We assume that some resources added noise, making the whole smaller than the sum of its parts.

## Common

Where on a river can you hold a cup 1) waterfall 2) bridge 3) valley SenseQA upright to catch water on a sunny day? 4) pebble 5) mountain MC-TACO [...] dream of becoming a judge. How many 1) 63 years 2) 7 weeks years did it take for Mark to become a judge? 3) 7 years 4) 7 seconds 5) 7 hours

## Social Iqa

In the school play, Robin played a hero in the 1) sorry for the villain struggle to the death with the angry villain. 2) hopeful that Robin will succeed How would others feel as a result?

3) like Robin should lose the fight

COMET Bailey saved money because they wanted to buy something good. How would you describe Bailey?

Bailey is seen as good with money.

## Piqa

To separate egg whites from the yolk using a 1) [...] Release, which creates suction and lifts the yolk. water bottle, you should 2) [...] Keep pushing, which creates suction and lifts the yolk.

## Winogrande

Katrina had the financial means to afford 1) Katrina a new car while Monica did not, 2) Monica since had a high paying job. predicts red as the color of a dove). Finally, (iii) it is unclear that LMs are capable of performing multiple reasoning steps involving implicit knowledge.

To increase the coverage of high-precision world knowledge and facilitate multi-hop reasoning by making intermediate reasoning steps explicit, prior work incorporated KBs (e.g. ConceptNet; Speer and Havasi, 2012) and knowledge-informed models into LM-based models (Xia et al., 2019; .

In this paper, we study pre-trained LMs as an alternative to external KBs in providing knowledge to commonsense question answering tasks. We propose an unsupervised model that uses an LM as the answer scorer, and a (possibly different) LM as a knowledge source. We formulate the process of obtaining relevant knowledge as a self-talk, inquirybased discovery learning (Bruner, 1961) , with the following steps: 1) seeking out knowledge by generating natural-language "clarification questions" conditioned on a given context; 2) generating their corresponding answers ("clarifications"); and 3) incorporating the clarifications as additional context.

Our model does not rely on external knowledge or additional supervision. Yet, we show that on 4 out of 6 tasks it substantially improves upon a zero-shot baseline that relies on LM score alone and performs on par, and sometimes better than, models that use external knowledge sources.

Integrating external knowledge warrants discerning relevant and helpful facts for solving a particular instance. LMs further require identifying that a clarification is factually-correct. We show that even among the clarifications that helped the pre-diction, humans perceived many as unhelpful or even incorrect, demonstrating that LM-based models often solve problems correctly for seemingly incorrect reasons. Our results call for future research on robust and correct knowledge integration to LM-based question answering systems.

COMET Joel complained to Ian about the condition of the house. preferred a messy space.

We focused on the multiple-choice question answering tasks exemplified in Table 1 and detailed below. Each instance consists of an optional context, an optional question, and several answer choices. The development sets sizes vary from 100 (COPA) to 1,954 (Social IQa). (Gordon et al., 2012) : Asking about either a plausible cause or a plausible result, among two alternatives, of a certain event expressed in a simple sentence.

## Copa: Choice Of Plausible Alternatives

CommonSenseQA: commonsense Question Answering (Talmor et al., 2019) . General questions about concepts from ConceptNet. To increase the challenge, the distractors are related to the target concept either by a relationship in ConceptNet or as suggested by crowdsourcing workers.

## Mc-Taco:

Multiple Choice Temporal commonsense . Questions about temporal aspects of events such as ordering (Table 1), duration, stationarity, frequency, and typical time. The distractors were selected in an adversarial way using BERT. 1 Figure 2 : Model illustration for WinoGrande. Each answer choice (Brett, Ian) is assigned to the concatenation of the context and a clarification. The score for each choice is the best LM score across clarifications (2 in this case).

Social IQa: Social Interaction Question Answering (Sap et al., 2019b) . Questions regarding social interactions, based on the ATOMIC dataset (Sap et al., 2019a) . Contexts describe social interactions and questions refer to one of a few aspects (e.g. the subject's motivation, following actions, etc.). The answers were crowdsourced.

PIQA: Physical Interaction Question Answering (Bisk et al., 2020) . Questions regarding physical commonsense knowledge. Contexts are goals derived from an instruction website, typically involving less prototypical uses of everyday objects (e.g., using a bottle to separate eggs). The answers were crowdsourced, and an adversarial filtering algorithm was used to remove annotation artifacts. 2

WinoGrande (Sakaguchi et al., 2020) . A largescale version of WSC that exhibits less bias thanks to adversarial filtering and use of placeholders instead of pronouns. As opposed to WSC that was curated by experts, WinoGrande was crowdsourced with a carefully designed approach that produces diverse examples which are trivial for humans.

## 3 Models

A given instance consists of an optional context c, an optional question q, and answer choices: a k i=1 . We first describe the baseline model, which makes the prediction based on the instance alone (Section 3.1). We then describe a knowledgeinformed model that relies on external resources (Section 3.2). Finally, we discuss the proposed inquiry-based model, which uses a pre-trained LMs to produce clarifications (Section 3.3).

## 3.1 Lm-Only Baseline

We use a pre-trained language model LM s to score the plausibility of different text fragments. We ex-periment with the various LMs provided by the transformers package : GPT (Radford et al., 2018) , GPT2 (Radford et al., 2019 , all sizes), a distilled GPT2 , and XLNet (Yang et al., 2019, both sizes) .

We assign each of the answer choices a i into the combination of the context and the question, and obtain opt i = combine(c, q, a i ). The combine function is computed differently for each task. For example, in COPA, where the question might be either about the cause or the effect of the context, we create the following texts for cause: " [context] . As a result, [choice]" and for effect: "[context]. The cause for it was that [choice]".

We denote the score of each answer choice as score(a i ) = CE(opt i ), where CE is cross-entropy loss defined as:

CE(t 1 ...t n ) = − 1 n n i=1 log 2 p LMs (t i | t 1 ...t i−1 ).

We predict the a i with the lowest score as the correct answer, which is the most likely option according to LM s : y = argmin i score(a i ).

## 3.2 Baseline Model With External Knowledge

In the setup illustrated in Figure 2 , each instance consists of an additional clarification list: CL = {cl 1 , ..., cl m }. Those are text fragments containing potentially relevant knowledge for solving the instance. For instance, the clarification "The purpose of the internship is to help people find jobs" might help answering the question "which of Brett and Ian found a job less quickly after graduation?". We don't expect all the clarifications to be relevant and helpful for answering the main question. Instead, the model relies on the single clarification that increases its belief of a certain answer choice. Thus, the score of each answer choice is selected as the score of the text containing the clarification that most supports it, i.e., whose combination with it yields the minimal loss: score(a i ) = min cl∈CL CE(opt i + cl). Again we predict y = argmin i score(a i ). We extract clarifications from the following sources, exemplified in Figure 3 .

ConceptNet. Similarly to previous work, we extract relation paths between words from the context and the question, and words from the answer choices. Since we incorporate the knowledge into the model as text, we convert each ConceptNet relation to a natural language template as in Davison et al. (2019) . We limit the path length to 2 edges in order to maintain high precision.

Corpus. For pairs of words from the context and question and from the answer choices, we extract their joint occurrences (with minimum frequency of 100) in Google N-grams (Brants and Franz) . This yields text fragments of up to 5 words rather than well-formed sentences, with the potential of describing the relationship between the two words (Shwartz and Dagan, 2018) .

COMET. COMET ) is a knowledge base construction model trained on the ATOMIC resource (Sap et al., 2019a) which consists of everyday situations along with multiple commonsense dimensions such as their causes, effects, pre-and post-conditions, etc. We generate all the dimensions unless we can generate specific relations that are more likely to help. Specifically, in Social IQa, we heuristically try to understand which type of relation in COMET the question asks for. In COPA, we use the pre-condition relations for cause questions (xIntent, xNeed) and the postcondition relations for effect questions (xEffect, xReact, xWant, oEffect, oReact, oWant) . When possible, we replace personX with the syntactic subject of the context or the question.

## 3.3 Self-Talk Model

Our proposed model makes the prediction identically to Figure 2 , but extracts the clarifications from pre-trained LMs. We treat the knowledge extrac- Figure 4 : Generating a clarification with LM: 1) Generate a question, conditioned on the context (pink) and question prefix (yellow). 2) Generate an answer, conditioned on the context, generated question and a corresponding answer prefix. The clarification is a concatenation of the answer prefix and generated text (green).

tion from LMs as a process of self-asking clarification questions about the context and "discovering" their answers. Figure 4 exemplifies this process for WinoGrande with a generator language model LM g . For the sake of simplicity, the illustration depicts the process of generating a single pair of clarification question and answer. We start by generating multiple clarification questions conditioned on the context, by 1) concatenating one of several question prefixes, which we curated for each task (e.g. "What is the purpose of", see the appendix); and 2) generating 5 questions for each prefix using Nucleus sampling with p = 0.2, i.e., sampling from the top 20% tokens (Holtzman et al., 2019) . 3 We limit the question length to up to 6 tokens excluding the prefix.

For each well-formed question that we obtained at the previous step, e.g. "What is the purpose of the 3 This value was chosen in preliminary experiments and is significantly lower than the standard value for p in the literature, which is typically around 0.9. We use a low value because we optimize for factual correctness, and our preliminary experiments have shown that lower p values produce texts that are more "faithfull" to their training corpus (at the price of being more bland). Figure 5 : Generating a clarification for Social IQa conditioned on the context, the given question (pink), and a heuristically matched answer prefix (yellow).

internship?", we generate multiple answers using a similar method. Each question prefix corresponds to an answer prefix. We use the concatenation of the context, generated clarification question, and answer prefix as the prompt for generating an answer (clarification). We limit the answer length to 10 generated tokens, and use Nucleus sampling with p = 0.5. We generate 10 answers for each clarification question and keep all the well-formed clarifications. Note that the clarification questions themselves are only means to generate the clarifications, and they are not used by our model.

In some datasets, an instance consists of both a context and a question. In this case, we can use the instance question as a "clarification" question and generate additional clarification questions similar to it. Figure 5 exemplifies this shortcut for Social IQa: instead of generating a clarification question, the given question "Why did Austin do this?" is used, and together with a heuristically matched answer prefix, the model can generate a potentially direct solution: "Austin did this because they wanted to keep him alive".

Since we did not train the clarification generator to ask sensical, relevant, and helpful questions, nor did we train the answer generator to generate coherent and factually correct answers, we can assume that some of the generated clarifications do not provide useful information to the model. Table 2 displays the performance of the best model in each category according to the development accuracy. We report the performance of the following models: majority baseline, LM baseline (Baseline), LM-based model with external knowledge (Ext. Knowledge), Self-talk, supervised models from prior work when applicable (Pre. Sup), 4 and human performance. Our zero-shot models are highlighted in purple. As expected, the overall performance is worse for the zero-shot models compared to the state-ofthe-art supervised models, but they perform substantially better than the majority baselines on most tasks, with the exception of WinoGrande where they only slightly outperform it. Among the LMbased models, self-talk performs on par or within a few points from the external knowledge model.

## 4 Results

Best LM. Table 3 shows the ranking of the LMs according to their development accuracy averaged across the different knowledge sources. In general there is a preference to GPT-2, and in particular to the larger models, except for COPA in which the distilled version works best. A possible explanation might be that the language model distillation reduces the likelihood of rare words (Tang and Lin, Dataset Rank (Mean Dev Acc.)

## 5 Human Evaluation Of The Clarifications

While the performance on the end task serves as an extrinsic evaluation for the quality of the generated clarifications, we are also interested in evaluating it intrinsically. From preliminary experiments we know that there is a high ratio of noisy clarifications. Thus, we analyze the clarifications that help predict the correct answer, i.e. clarifications with the best LM score in their instance and whose existence change the answer from an incorrect prediction by the baseline to a correct prediction by the model.

The annotation task was carried out in Amazon Mechanical Turk. To ensure the quality of annotations, we required that the workers be located in the US, UK, or Canada, and have a 99% approval rate for at least 5,000 prior tasks. We aggregated annotation from 3 workers using majority vote. The annotations yielded moderate levels of agreement, with Fleiss Kappa κ = 0.43 (Landis and Koch, 1977) . Among the different categories of annotations we measured pairwise accuracy, which ranged from 60.41% (the answer is factually correct) to 92.26% (the question is completely not understandable). For the sake of brevity, we focus on the analysis of the answers to the clarification questions. Figure 6 shows the human evaluation results for each combination of task and knowledge source. The top part of the figure shows that across tasks and resources, most clarifications are grammatical or at least understandable, with the exception of XLNet. The bottom part shows the percentage of clarifications considered relevant, correct, and helpful. 6 Most clarifications were considered relevant to the context, around half of them were considered factually correct, and some 20-40% were considered helpful. Considering that these are all clarifications that indeed helped the model, this is an interesting though not completely unexpected finding: the model utilizes knowledge that humans wouldn't consider as helpful, and likely also vice versa.

Breaking down by knowledge source, we observe that when the datasets were created using a knowledge source (ConceptNet for Com-monSenseQA, and Social IQa uses ATOMIC, on which COMET is trained), clarifications from that resource are considered correct. We also note that somewhat surprisingly, relatively few ConceptNet clarifications were considered correct, despite limiting the relation paths up to 2 edges.

## 6.1 External Knowledge In Neural Models

Approaches for incorporating external knowledge into a neural model consist of several compo-nents: (1) the task addressed;

(2) neural model; (3) knowledge sources; and (4) incorporation method. Most models target tasks that require commonsense knowledge, such as the story cloze test (Roc-Stories; Mostafazadeh et al., 2016) and machine comprehension tasks (Kočiskỳ et al., 2018; Ostermann et al., 2018; Talmor et al., 2019) . The neural component has recently shifted from biLSTM to transformer-based representations, specifically pre-trained LMs such as BERT (Devlin et al., 2019) and RoBERTa .

With respect to the knowledge source, the vast majority of papers rely on ConceptNet to extract relation paths between concepts and entities identified in the input (Speer and Havasi, 2012 , see an example in Figure 3 ). Additional resources include WordNet (Lin et al., 2017; Wang and Jiang, 2019) , mining scripts from corpora (Lin et al., 2017) , knowledge base embeddings , hand-crafted rules (Lin et al., 2017; Tandon et al., 2018) , and tools such as sentiment analyzers and knowledgeinformed LMs .

The external knowledge is typically incorporated into the neural model by learning a vector representation of the symbolic knowledge (e.g. subgraphs from ConceptNet), and attending to it via attention mechanism when representing the inputs (Bauer et al., 2018; Paul and Frank, 2019; Lin et al., 2019) . Alternative approaches include using the knowledge to score answer candidates and prune implausible ones (Lin et al., 2017; Tandon et al., 2018) , and training in a multi-task setup via auxiliary tasks pertaining to knowledge (Xia et al., 2019) .

## 6.2 Extracting Knowledge From Lms

Pre-trained LMs such as GPT2 (Radford et al., 2019) and BERT (Devlin et al., 2019) capture various types of world knowledge. Petroni et al. (2019) showed that such LMs can be used in a KB completion task over ConceptNet and Wikidata (Vrandečić and Krötzsch, 2014) by converting KB relations into natural language templates and querying the LM for the missing part in the triplet (concept 1 , relation, concept 2 ). For instance, querying BERT for suitable substitutes to the mask in "Dante was born in [MASK]" assigns the highest probability to Rome. Davison et al. (2019) similarly showed that BERT assigns higher scores to natural language fragments of true rather than fictitious ConceptNet triplets, and semi-automated the template creation by using GPT2 to score hand-crafted templates.

While both works have shown somewhat promising results, other work showed that knowledge extracted from LMs is expectantly not always accurate. Specifically, Kassner and Schütze (2019) showed that negated facts are also considered likely by the LM, while Logan et al. (2019) pointed out that LMs may over-generalize and produce incorrect facts such as "Barack Obama's wife is Hillary".

## 6.3 Generating Questions And Explanations

There are numerous research directions investigating automatic question generation (Vanderwende, 2008) . Motivations vary from data augmentation to QA tasks (Du et al., 2017; Dhingra et al., 2018; Du and Cardie, 2018; Sachan and Xing, 2018) through conversational machine reading (Saeidi et al., 2018; Pan et al., 2019) , simplifying questions to make them more easily answerable (Buck et al., 2018; Talmor and Berant, 2018; Perez et al., 2020) , to using questions as means for other purposes such as sentence representation and summarization (Guo et al., 2018; Potash and Suleman, 2019) .

## 7 Disucssion And Conclusion

We presented an unsupervised framework for multiple choice commonsense tasks that generates and integrates background knowledge from pre-trained LMs. On most tasks, it performs substantially better than the baseline and similarly to a model that had access to external knowledge resources.

By design, our model makes a single additional reasoning step explicit. A preliminary experiment in which we incorporated clarification pairs to facilitate two hops got mixed results. An interesting future direction is to generate each clarification in response to the previous ones, in a dialogue setup (Saeidi et al., 2018) . Another challenge is the "needle in a haystack" problem of the clarifications, and one way to address it is to develop a model that is capable of "introspection", specifically knowing what it doesn't know. A more structured knowledge generation might also make the combination of various knowledge sources more successful.

Filling in knowledge gaps and making implicit intermediate reasoning steps explicit is imperative going forward. We hope that our framework will facilitate future research in this area.

Our code and data is available at github.com/vered1986/self talk. The question and answer prefixes used for each task. " " in the answer prefix is replaced with the generated question (excluding the question mark), e.g. "What is the definition of a cat?" yields the following answer prefix: "The definition of a cat is". The question and answer templates in Social IQa correspond to COMET dimensions. " [NAME] " is replaced with the syntactic subject of the sentence (see for example the first row in Table 6 ). Her teammate threw it to her.

Before that, what did the player need?

Before, the player needed to go to the ball field.

## Conceptnet

The boy skipped dinner. What was the cause for it?

He ate a big lunch. What is the relationship between 'dinner' and 'lunch'?

Lunch is the opposite of dinner.

Working on the elaborate task was taxing, it require extreme what?

concentration What is the relationship between 'working' and 'concentration'? In order for working to happen, concentration needs to happen.

Jan had been on vacation in Hawaii for a week. Last night, she finally returned home. What will Jan want to do next?

As a result, Jan wanted to return to work. What is the relationship between 'vacation' and 'work'?

Vacation is the opposite of work.

How can I slice up a banana?

Use a knife, on a cutting board, and slice the banana into quarter sized slices.

What is the relationship between 'slice' and 'knife'?

Slice is a specific way of doing cut. Knife is used for cut.

## Distilgpt2

What do professors primarily do?

Robin proposed another solution after the first one was not successful. What does Robin need to do before this?

Before, Robin needed to come up with other solutions.

What does Robin need to do before this?

Before doing that, Robin first had to do some research on how he could make it.

## Teach Courses

What is the main function of a professor's teaching career,? The main function of a professor's teaching career, is to teach students how they can improve their knowledge.

## Mc-Taco

GPT2-XL Growing up on a farm near St. Paul, L. Mark Bailey didn't dream of becoming a judge. How many years did it take for Mark to become a judge?

## 7 Years

What happened first?

The first thing that happened was he went to law school and became an attorney.

## Gpt

After several years of private practice from 1982-90, he became the judge of Decatur County Court for a year. When did he get a law degree?

## 1981

What happened first?

The first thing that happened was that his wife, linda, had passed away.

## How Would You Describe Bailey?

Bailey is seen as hardworking.

## Gpt2

Skylar always arrives 15 minutes early for class, takes extensive notes, and studies hard before tests. How would you describe Skylar?

Skylar is seen as a dedicated student.

How would you describe Skylar?

Skylar is a very good student who does well in the classroom.

The hotel had a terrible gym but a nice restaurant; the was something management attended to.

restaurant What is the definition of 'restaurant'?

'restaurant' is defined as an establishment that serves food, drinks and services.

## Gpt2-M

Carson was at a friends house and they decided to play video games. What does Carson need to do before this?

Before, Carson needed to know about their friends house.

What does Carson need to do before this?

Before doing that, Carson first had to get his mother's permission before playing the game.

Mark wanted to lose some weight by starting a diet and workout routine. He focused mainly on the because he was lazy.

diet What is the definition of 'workout routine'?

'workout routine' is defined as any kind of exercise that involves doing physical activity.

## Gpt2-L

Taylor gave a raise and a promotion to Kendall. How would you describe Taylor?

Taylor is seen as as someone who appreciates what Kendall does.

How would you describe Taylor?

Taylor is a good guy, but he's also an idiot.

## Comet

To cut circles out of dough without a cookie cutter.

Use a glass by turning it upside down and using the rim.

What does you want as a result?

As a result, you wants to get the dough out of the oven.

## Gpt2-Xl

What pan to use for frying eggs?

Nonstick pans are perfect for frying eggs.

What is the definition of a 'pan'?

The definition of a 'pan' is : A metal or ceramic vessel used in cooking.

XLNet-base Sharpen a pencil.

Use a knife. What are the properties of a you sharpen a pencil,? The properties of a you sharpen a pencil, are that it will not break or be dulled.

The parakeet flew to Hunter but flew past Samuel because had some bird seed in their hand.

## Ian

What does Joel want as a result?

As a result, Joel wants to get rid of the clutter.

## Hunter

What does it mean to be a 'bird'?

Be a 'bird' means the same thing as it does for us humans.

GPT Matt wanted to change either the plain bathroom or the colorful bedroom, but the was already colorful.

## Bedroom

What is the definition of 'bedroom'?

'bedroom' is defined as a room with a bed and nightstand in it.

XLNet-base My employer offers a bonus of either a phone or a television, but unfortunately the is just way too large to be useful.

## Phone

What is the definition of 'phone'?

'phone' is defined as device that connects a person directly into the world. Table 6 : Example instances from each dataset and the clarifications generated for them in various resources. We only include clarifications that helped the model predict the correct answer.

To make this task compatible with the other tasks, we only kept a single correct answer per instance, making our results not comparable to previously reported results.

Word associations and dataset-specific features that are not informative for the task are identified by a strong baseline and removed(Gururangan et al., 2018;Zellers et al., 2018).