Social IQA: Commonsense Reasoning about Social Interactions

Maarten Sap
Hannah Rashkin
Derek Chen
Ronan Le Bras
Yejin Choi
EMNLP 2019
2019
View in Semantic Scholar

Abstract

We introduce Social IQa, the first largescale benchmark for commonsense reasoning about social situations. Social IQa contains 38,000 multiple choice questions for probing emotional and social intelligence in a variety of everyday situations (e.g., Q: "Jordan wanted to tell Tracy a secret, so Jordan leaned towards Tracy. Why did Jordan do this?" A: "Make sure no one else could hear"). Through crowdsourcing, we collect commonsense questions along with correct and incorrect answers about social interactions, using a new framework that mitigates stylistic artifacts in incorrect answers by asking workers to provide the right answer to a different but related question. Empirical results show that our benchmark is challenging for existing question-answering models based on pretrained language models, compared to human performance (>20% gap). Notably, we further establish Social IQa as a resource for transfer learning of commonsense knowledge, achieving state-of-the-art performance on multiple commonsense reasoning tasks (Winograd Schemas, COPA).

1 Introduction

Social and emotional intelligence enables humans to make pragmatic inference about others' mental states and anticipate their behavior (Ganaie and Mudasir, 2015) . For example, when someone spills food all over the floor, we can infer that they will likely want to clean up the mess, rather than taste the food off the floor or run around in the mess (Figure 1, middle) . This example illustrates how Theory of Mind, i.e., the ability to reason about the implied emotions and behavior of others, enables humans to navigate social situations ranging from simple conversations with friends to both authors contributed equally In the school play, Robin played a hero in the struggle to the death with the angry villain.

Figure 1: Three context-question-answers triples from SOCIAL IQA, along with the type of reasoning required to answer them. In the top example, humans can trivially infer that Tracy pressed upon Austin because there was no room in the elevator. Similarly, in the bottom example, commonsense knowledge tells us that people typically root for the hero, not the villain.

Reasoning About Motivation

Why did Tracy do this?

Tracy had accidentally pressed upon Austin in the small elevator and it was awkward.

(a) get very close to Austin (b) squeeze into the elevator (c) get flirty with Austin A Q Figure 1 : Three context-question-answers triples from SOCIAL IQA, along with the type of reasoning required to answer them. In the top example, humans can trivially infer that Tracy pressed upon Austin because there was no room in the elevator. Similarly, in the bottom example, commonsense knowledge tells us that people typically root for the hero, not the villain. complex negotiations in courtroom settings (Apperly, 2010) .

While humans trivially acquire and develop such social reasoning skills (Moore, 2013) , this is still a challenge for machine learning models, in part due to the lack of large-scale resources to train and evaluate modern AI systems' social and emotional intelligence. Although recent advances in pretraining large language models have yielded promising improvements on several com-monsense inference tasks, these models still struggle to reason about social situations, as shown in this and previous work (Davis and Marcus, 2015; Nematzadeh et al., 2018; Talmor et al., 2019) . This is partly due to language models being trained on written text corpora, where reporting bias of knowledge limits the scope of commonsense knowledge that can be learned (Gordon and Van Durme, 2013; Lucy and Gauthier, 2017) .

In this work, we introduce Social Intelligence QA (SOCIAL IQA), the first large-scale resource to learn and measure social and emotional intelligence in computational models. 1 SOCIAL IQA contains 38k multiple choice questions regarding the pragmatic implications of everyday, social events (see Figure 1 ). To collect this data, we design a crowdsourcing framework to gather contexts and questions that explicitly address social commonsense reasoning. Additionally, by combining handwritten negative answers with adversarial question-switched answers (Section 3.3), we minimize annotation artifacts that can arise from crowdsourcing incorrect answers (Schwartz et al., 2017; Gururangan et al., 2018) .

This dataset remains challenging for AI systems, with our best performing baseline reaching 64.5% (BERT-large), significantly lower than human performance. We further establish SOCIAL IQA as a resource that enables transfer learning for other commonsense challenges, through sequential finetuning of a pretrained language model on SOCIAL IQA before other tasks. Specifically, we use SOCIAL IQA to set a new state-of-the-art on three commonsense challenge datasets: COPA (Roemmele et al., 2011) (83.4%) , the original Winograd (Levesque, 2011) (72.5%), and the extended Winograd dataset from Rahman and Ng (2012) (84.0%).

Our contributions are as follows: (1) We create SOCIAL IQA, the first large-scale QA dataset aimed at testing social and emotional intelligence, containing over 38k QA pairs. (2) We introduce question-switching, a technique to collect incorrect answers that minimizes stylistic artifacts due to annotator cognitive biases. (3) We establish baseline performance on our dataset, with BERTlarge performing at 64.5%, well below human performance. (4) We achieve new state-of-the-art accuracies on COPA and Winograd through sequential finetuning on SOCIAL IQA, which implicitly endows models with social commonsense knowledge.

2 Task Description

SOCIAL IQA aims to measure the social and emotional intelligence of computational models through multiple choice question answering (QA). In our setup, models are confronted with a question explicitly pertaining to an observed context, where the correct answer can be found among three competing options. By design, the questions require inferential reasoning about the social causes and effects of situations, in line with the type of intelligence required for an AI assistant to interact with human users (e.g., know to call for help when an elderly person falls; Pollack, 2005) . As seen in Figure 1 , correctly answering questions requires reasoning about motivations, emotional reactions, or likely preceding and following actions. Performing these inferences is what makes us experts at navigating social situations, and is closely related to Theory of Mind, i.e., the ability to reason about the beliefs, motivations, and needs of others (Baron-Cohen et al., 1985) . 2 Endowing machines with this type of intelligence has been a longstanding but elusive goal of AI (Gunning, 2018) .

Atomic

As a starting point for our task creation, we draw upon social commonsense knowledge from ATOMIC (Sap et al., 2019) to seed our contexts and question types. ATOMIC is a large knowledge graph that contains inferential knowledge about the causes and effects of 24k short events. Each triple in ATOMIC consists of an event phrase with person-centric variables, one of nine inference dimensions, and an inference object (e.g., "PersonX pays for PersonY's ", "xAttrib", "generous").

Given this base, we generate natural language contexts that represent specific instantiations of the event phrases found in the knowledge graph. Furthermore, the questions created probe the commonsense reasoning required to navigate such contexts. Critically, since these contexts are based off of ATOMIC, they explore a diverse range of motivations and reactions, as well as likely preceding or following actions.

3 Dataset Creation

SOCIAL IQA contains 37,588 multiple choice questions with three answer choices per question. Questions and answers are gathered through three phases of crowdsourcing aimed to collect the context, the question, and a set of positive and negative answers. We run crowdsourcing tasks on Amazon Mechanical Turk (MTurk) to create each of the three components, as described below.

3.1 Event Rewriting

In order to cover a variety of social situations, we use the base events from ATOMIC as prompts for context creation. As a pre-processing step, we run an MTurk task that asks workers to turn an ATOMIC event (e.g., "PersonX spills all over the floor") into a sentence by adding names, fixing potential grammar errors, and filling in placeholders (e.g., "Alex spilled food all over the floor."). 3

3.2 Context, Question, & Answer Creation

Next, we run a task where annotators create full context-question-answers triples. We automatically generate question templates covering the nine commonsense inference dimensions in age, culture, or developmental disorders (Korkmaz, 2011 ATOMIC. 4 Crowdsourcers are prompted with an event sentence and an inference question to turn into a more detailed context 5 (e.g. "Alex spilled food all over the floor and it made a huge mess.") and an edited version of the question if needed for improved specificity (e.g. "What will Alex want to do next?"). Workers are also asked to contribute two potential correct answers.

3.3 Negative Answers

In addition to correct answers, we collect four incorrect answer options, of which we filter out two. To create incorrect options that are adversarial for models but easy for humans, we use two different approaches to the collection process. These two methods are specifically designed to avoid different types of annotation artifacts, thus making it more difficult for models to rely on data biases. We integrate and filter answer options and validate final QA tuples with human rating tasks.

Table 1: Data statistics for SOCIAL IQA.

Handwritten Incorrect Answers (HIA) The first method involves eliciting handwritten incorrect answers that require reasoning about the context. These answers are handwritten to be similar to the correct answers in terms of topic, length, and style but are subtly incorrect. Two of these answers are collected during the same MTurk task as the original context, questions, and correct answers. We will refer to these negative responses as handwritten incorrect answers (HIA). the context, as shown in Figure 2 . We do this to avoid cognitive biases and annotation artifacts in the answer candidates, such as those caused by writing incorrect answers or negations (Schwartz et al., 2017; Gururangan et al., 2018) . In this crowdsourcing task, we provide the same context as the original question, as well as a question automatically generated from a different but similar ATOMIC dimension, 6 and ask workers to write two correct answers. We refer to these negative responses as question-switching answers (QSA). By including correct answers to a different question about the same context, we ensure that these adversarial responses have the stylistic qualities of correct answers and strongly relate to the context topic, while still being incorrect, making it difficult for models to simply perform patternmatching. To verify this, we compare valence, arousal, and dominance (VAD) levels across answer types, computed using the VAD lexicon by Mohammad (2018) . Table 2 shows effect sizes (Cohen's d) of the differences in VAD means. Indeed, QSA and correct answers differ substantially less than HIA answers (d≤.1). 7

Figure 2: Question-Switching Answers (QSA) are collected as the correct answers to the wrong question that targets a different type of inference (here, reasoning about what happens before instead of after an event).

Table 2: Effect sizes (Cohen’s d) when comparing average dominance, arousal and valence values of different answer types (d>0: answer type A has higher mean than answer type B). For valence (sentiment polarity) and dominance, effect sizes comparing QSA and correct answers are much smaller, indicating that these are more similar tonally. Notably, all three answer types have comparable levels of arousal (intensity).

3.4 Qa Tuple Creation

As the final step of the pipeline, we aggregate the data into three-way multiple choice questions. For each created context-question pair contributed by crowdsourced workers, we select a random correct answer and the incorrect answers that are least entailed by the correct one, following inspiration from Zellers et al. (2019) .

For the training data, we validate our QA tuples through a multiple-choice crowdsourcing task where three workers are asked to select the right answer to the question provided. 8 In order to ensure even higher quality, we validate the dev and test data a second time with five workers. Our final dataset contains questions for which the correct answer was determined by human majority voting, discarding cases without a majority vote. We also apply filtering to make the task more challenging by using a deep stylistic classifier to remove easier examples on the dev and test sets. 9 To obtain human performance, we run a separate task asking three new workers to select the correct answer on a random subset of 400 dev and 400 test examples. Human performance on these subsets is 88% and 86%, respectively.

3.5 Data Statistics

To keep contexts separate across train/dev/test sets, we assign SOCIAL IQA contexts to the same partition as the ATOMIC event the context was based on. Shown in Table 1 (top), this yields a total set of around 33k training, 2k dev, and 2k test tuples. We additionally include statistics on word counts and vocabulary of the training data. We report the averages of correct and incorrect answers in terms of: token length, number of unique tokens, and number of times a unique answer appears in the dataset. Note that due to our three-way 9 We also tried filtering to remove examples from the training set but found it did not significantly change performance. We will release tags for the easier training examples with the full data multiple choice setup, there are twice as many incorrect answers which influences these statistics. We also include a breakdown ( Figure 3 ) across question types, which we derive from ATOMIC inference dimensions. 10 In general, questions relating to what someone will feel afterwards or what they will likely do next are more common in SOCIAL IQA. Conversely, questions pertaining to (potentially involuntary) effects of situations on people are less frequent.

Figure 3: SOCIAL IQA contains several question types which cover different types of inferential reasoning. Question types are derived from ATOMIC inference dimensions.

4 Methods

We establish baseline performance on SOCIAL IQA, using large pretrained language models based on the Transformer architecture (Vaswani et al., 2017) . Namely, we finetune OpenAI-GPT (Radford et al., 2018) and BERT (Devlin et al., 2019) , which have both shown remarkable improvements on a variety of tasks. OpenAI-GPT is a uni-directional language model trained on the BookCorpus (Zhu et al., 2015) , whereas BERT is a bidirectional language model trained on both the BookCorpus and English Wikipedia. As per previous work, we finetune the language model representations but fully learn the classifier specific parameters described below.

Multiple choice classification To classify sequences using these language models, we follow the multiple-choice setup implementation by the respective authors, as described below. 10 We group agent and theme ATOMIC dimensions together (e.g., "xReact" and "oReact" become the "reactions" question type). First, we concatenate the context, question, and answer, using the model specific separator tokens. For OpenAI-GPT, the format becomes start delimiter classify , where start , delimiter , and classify are special function tokens. For BERT, the format is similar, but the classifier token comes before the context. 11 For each triple, we then compute a score l by passing the hidden representation from the classifier token h CLS ∈ R H through an MLP:

l = W 2 tanh(W 1 h CLS + b 1 )

where W 1 ∈ R H×H , b 1 ∈ R H and W 2 ∈ R 1×H . Finally, we normalize scores across all triples for a given context-question pair using a softmax layer. The model's predicted answer corresponds to the triple with the highest probability.

5.1 Experimental Set-Up

We train our models on the 33k SOCIAL IQA training instances, selecting hyperparameters based on the best performing model on our dev set, for which we then report test results. Specifically, we perform finetuning through a grid search over the hyper-parameter settings (with a learning rate in {1e−5, 2e−5, 3e−5}, a batch size in {3, 4, 8}, and a number of epochs in {3, 4, 10}) and report the maximum performance. 3and 4illustrate the model choosing answers that might have happened before, or that might happen much later after the context, as opposed to right after the context situation. In Examples (5) and 6, the model chooses answers that may apply to people other than the ones being asked about.

Context

Models used in our experiments vary in sizes: OpenAI-GPT (117M parameters) has a hidden size H=768, BERT-base (110M params) and BERT-large (340M params) hidden sizes of H=768 and H=1024, respectively. We train using the HuggingFace PyTorch (Paszke et al., 2017) implementation. 12

5.2 Results

Our results (Table 3) suggest SOCIAL IQA is still a challenging benchmark for existing computational models, compared to human performance. Our best performing model, BERT-large, outperforms other models by several points on the dev and test set. We additionally ablate our best model's representation by removing the context and question from the input, verifying that reasoning over both is necessary for this task.

Table 3: Experimental results. We additionally perform an ablation by removing contexts and questions, verifying that both are necessary for BERT-large’s performance. Human evaluation results are obtained using 400 randomly sampled examples.

Learning Curve To better understand the effect of dataset scale on model performance on our task, we simulate training situations with limited knowledge. We present the learning curve of BERT-large's performance on the dev set as it is trained on more training set examples (Figure 4) . Although the model does significantly improve over a random baseline of 33% with only a few hundred examples, the performance only starts to converge after around 20k examples, providing evidence that large-scale benchmarks are required for this type of reasoning.

Figure 4: Dev accuracy when training BERT-large with various number of examples (multiple runs per training size), with human performance (88.1%) shown in orange. In order to reach >80%, the model would require nearly 1 million training examples.

Error Analysis We include a breakdown of our best model's performance on various question types in Figure 5 and specific examples of errors in the last four rows of Table 4 . Overall, questions related to pre-conditions of the context (people's motivations, actions needed before the context) are less challenging for the model. Conversely, the model seems to struggle more with questions relating to (potentially involuntary) effects, stative descriptions, and what people will want to do next.

Figure 5: Average dev accuracy of BERT-large on different question types. While questions about effects and motivations are easier, the model still finds wants and descriptions more challenging.

Table 4: Example CQA triples from the SOCIAL IQA dev set with BERT-large’s predictions ( : BERT’s prediction, X: true correct answer). The model predicts correctly in (1) and (2) and incorrectly in the other four examples shown here. Examples (3) and (4) illustrate the model choosing answers that might have happened before, or that might happen much later after the context, as opposed to right after the context situation. In Examples (5) and (6), the model chooses answers that may apply to people other than the ones being asked about.

In examples (3) and (4) of Table 4 , the model selects answers which are incorrectly timed with respect to the context and question (e.g., "arrive at a hotel" is something Remy likely did before checking in with the concierge, not afterwards). Additionally, the model often chooses answers related to a person other than the one asked about. In (6), after the arm wrestling, though it is likely that Aubrey will feel ashamed, the question relates to what Alex might feel-not Aubrey. This illustrates how challenging social situations can be for models to make nuanced inferences about, compared to humans who can trivially reason about the causes and effects for multiple participants.

6 Social Iqa For Transfer Learning

In addition to being the first large-scale benchmark for social commonsense, we also show that SOCIAL IQA can improve performance on downstream tasks that require commonsense, namely the Winograd Schema Challenge and the Choice of Plausible Alternatives task. We improve state of the art on both tasks by sequentially finetuning on SOCIAL IQA before the task itself.

COPA The Choice of Plausible Alternatives task (COPA; Roemmele et al., 2011 ) is a twoway multiple choice task which aims to measure commonsense reasoning abilities of models. The dataset contains 1,000 questions (500 dev, 500 test) that ask about the causes and effects of a premise. This has been a challenging task for computational systems, partially due to the limited amount of training data available. As done previously (Goodwin et al., 2012; Luo et al., 2016) , we finetune our models on the dev set, and report performance only on the test set.

Winograd Schema The Winograd Schema Challenge (WSC; Levesque, 2011) is a wellknown commonsense knowledge challenge framed as a coreference resolution task. It contains a collection of 273 short sentences in which a pronoun must be resolved to one of two antecedents (e.g., in "The city councilmen refused the demonstrators a permit because they feared violence", they refers to the councilmen). Because of data scarcity in WSC, Rahman and Ng (2012) created 943 Winograd-style sentence pairs (1886 sentences in total), henceforth referred to as DPR, which has been shown to be slightly less challenging than WSC for computational models.

We evaluate on these two benchmarks. While the DPR dataset is split into train and test sets (Rahman and Ng, 2012) , the WSC dataset contains a single (test) set of only 273 instances for evaluation purposes only. Therefore, we use the DPR dataset as training set when evaluating on the WSC dataset.

6.1 Sequential Finetuning

We first finetune BERT-large on SOCIAL IQA, which reaches 66% on our dev set (Table 3) . We then finetune that model further on the taskspecific datasets, considering the same set of hyperparameters as in §5.1. On each of the test sets, we report best, mean, and standard deviation of all models, and compare sequential finetuning results to a BERT-large baseline.

Results Shown in Table 5 , sequential finetuning on SOCIAL IQA yields substantial improvements over the BERT-only baseline (between 2.6 and 5.5% max performance increases), as well as the general increase in performance stability (i.e., lower standard deviations). Echoing similar findings by Phang et al. (2019) , this suggests that BERT-large can benefit from both the large scale and the QA format of commonsense knowledge in SOCIAL IQA, which it struggles to learn from small benchmarks only.

Table 5: Sequential finetuning of BERT-large on SOCIAL IQA before the task yields state of the art results (bolded) on COPA (Roemmele et al., 2011), Winograd Schema Challenge (Levesque, 2011) and DPR (Rahman and Ng, 2012). For comparison, we include previous published state of the art performance (* denotes unpublished work).

Finally, we find that sequentially finetuned BERT-SOCIAL IQA achieves state-of-the-art results on all three tasks, showing improvements of previous best performing models. 13

7 Related Work

Commonsense Benchmarks: Commonsense benchmark creation has been well-studied by previous work. Notably, the WinoGrad Schema Challenge (WSC; Levesque, 2011) and the Choice Of Plausible Alternatives dataset (COPA; Roemmele et al., 2011) are expert-curated collections of commonsense QA pairs that are trivial for humans to solve. Whereas WSC requires physical and social commonsense knowledge to solve, COPA targets the knowledge of causes and effects surrounding social situations. While both benchmarks are of high-quality and created by experts, their small scale (150 and 1,000 examples, respectively) poses a challenge for modern modelling techniques, which require many training instances.

More recently, Talmor et al. (2019) introduce CommonsenseQA, containing 12k multiplechoice questions. Crowdsourced using Concept-Net (Speer and Havasi, 2012) , these questions mostly probe knowledge related to factual and physical commonsense (e.g., "Where would I not want a fox?"). In contrast, SOCIAL IQA explicitly separates contexts from questions, and focuses on the types of commonsense inferences humans perform when navigating social situations.

Commonsense Knowledge Bases: In addition to large-scale benchmarks, there is a wealth of work aimed at creating commonsense knowledge repositories (Speer and Havasi, 2012; Sap et al., 2019; Zhang et al., 2017; Lenat, 1995; Espinosa and Lieberman, 2005; Gordon and Hobbs, 2017) that can be used as resources in downstream reasoning tasks. While SOCIAL IQA is formatted as a natural language QA benchmark, rather than a taxonomic knowledge base, it also can be used as a resource for external tasks, as we have demonstrated experimentally.

Constrained Or Adversarial Data Collection:

Various work has investigated ways to circumvent annotation artifacts that result from crowdsourcing. Sharma et al. (2018) extend the Story Cloze data by severely restricting the incorrect story ending generation task, reducing some of the sentiment and negation artifacts. Rajpurkar et al. (2018) create an adversarial version of the extractive question-answering challenge, SQuAD (Rajpurkar et al., 2016) , by creating 50k unanswerable questions. Instead of using humangenerated incorrect answers, Zellers et al. (2018) use adversarial filtering of machine generated incorrect answers to minimize their surface patterns. Our dataset also aims to reduce annotation artifacts by using a multi-stage annotation pipeline in which we collect negative responses from multiple methods including a unique adversarial questionswitching technique.

8 Conclusion

We present SOCIAL IQA, the first large-scale benchmark for social commonsense. Consisting of 38k multiple-choice questions, SOCIAL IQA covers various types of inference about people's actions being described in situational contexts. We design a crowdsourcing framework for collecting QA pairs that reduces stylistic artifacts of negative answers through an adversarial questionswitching method. Despite human performance of close to 90%, computational approaches based on large pretrained language models only achieve accuracies up to 65%, suggesting that these social inferences are still a challenge for AI systems. In addition to providing a new benchmark, we demonstrate how transfer learning from SOCIAL IQA to other commonsense challenges can yield significant improvements, achieving new state-ofthe-art performance on both COPA and Winograd

Theory of Mind is well developed in most neurotypical adults(Ganaie and Mudasir, 2015), but can be influenced by

We do not generate templates if the ATOMIC dimension is annotated as "none."5 Workers were asked to contribute a context 7-25 words longer than the event sentence.

Using the following three groupings of ATOMIC dimensions: {xWant, oWant, xNeed, xIntent}, {xReact oReact, xAttr}, and {xEffect, oEffect}.7 Cohen's d<.20 is considered small(Sawilowsky, 2009). We find similarly small effect sizes using other sentiment/emotion lexicons. 8 Agreement on this task was high (Cohen's κ=.70)

BERT's format is [CLS] [UNUSED] [SEP] [SEP]

https://github.com/huggingface/ pytorch-pretrained-BERT

Note that unlike our model,Radford et al. (2019) obtained 70.7% on WSC in a zero-shot setting. Also note that OpenAI-GPT was reported to achieve 78.6% on COPA, but that result was not published, nor discussed in the OpenAI-GPT white paper(Radford et al., 2018).