Social Bias Frames: Reasoning about Social and Power Implications of Language

Maarten Sap
Saadia Gabriel
Lianhui Qin
Dan Jurafsky
Noah A. Smith
Yejin Choi
ACL
2020
View in Semantic Scholar

Abstract

Warning: this paper contains content that may be offensive or upsetting. Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but rather the implied meanings, that frame people’s judgments about others. For example, given a statement that “we shouldn’t lower our standards to hire more women,” most listeners will infer the implicature intended by the speaker - that “women (candidates) are less qualified.” Most semantic formalisms, to date, do not capture such pragmatic implications in which people express social biases and power differentials in language. We introduce Social Bias Frames, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes onto others. In addition, we introduce the Social Bias Inference Corpus to support large-scale modelling and evaluation with 150k structured annotations of social media posts, covering over 34k implications about a thousand demographic groups. We then establish baseline approaches that learn to recover Social Bias Frames from unstructured text. We find that while state-of-the-art neural models are effective at high-level categorization of whether a given statement projects unwanted social bias (80% F1), they are not effective at spelling out more detailed explanations in terms of Social Bias Frames. Our study motivates future work that combines structured pragmatic inference with commonsense reasoning on social implications.

Language has the power to reinforce stereotypes and project social biases onto others. At the core of the challenge is that it is rarely what is stated explicitly, but all the implied meanings that frame people's judgements about others. For example, given a seemingly innocuous statement "we shouldn't lower our standards to hire more women," most listeners will infer the implicature intended by the speaker -that "women (candidates) are less qualified." Most frame semantic formalisms, to date, do not capture such pragmatic frames in which people express social biases and power differentials in language.

We introduce SOCIAL BIAS FRAMES, a new conceptual formalism that aims to model the pragmatic frames in which people project social biases and stereotypes on others. In addition, we introduce the Social Bias Inference Corpus, to support large-scale modelling and evaluation with 100k structured annotations of social media posts, covering over 26k implications about a thousand demographic groups.

Social Bias Inference Corpus, available at http:// tinyurl.com/social-bias-frames.

Founta et al. (2018) find that the prevalence of toxic content online is <4%.

This study was approved by the University of Washington IRB.

https://github.com/huggingface/ transformers

We direct workers to the Crisis Text Line (https:// www.crisistextline.org/)

1 Introduction

Language has enormous power to project social biases and reinforce stereotypes on people (Fiske, 1993) . The way such biases are projected is rarely in what is stated explicitly, but in all the implied What do you call a movie with an all-Muslim cast? A box office bomb.

We shouldn't lower our standards just to hire more women. Figure 1 : Understanding and explaining why a seemingly innocuous statement is potentially unjust requires reasoning about the conversational implicatures and commonsense implications with respect to the overall offensiveness, intent, and power differentials among different social groups. SOCIAL BIAS FRAMES aims to represent the various pragmatic meanings related to social bias implications, by combining categorical and free-text annotations, e.g., that "women are less qualified" is implied by the phrase "lowering our standards to hire more women." layers of meanings that frame and influence peoples judgements about others. For example, even with a seemingly innocuous statement that an all-Muslim movie was a "box office bomb", most people can instantly recognize the implied demonizing stereotype that "Muslims are terrorists" ( ure 1). Understanding these biases with accurate underlying explanations is necessary for AI systems to adequately interact in the social world (Pereira et al., 2016) , and failure to do so can result in the deployment of harmful technologies (e.g., conversational AI systems turning sexist and racist; Vincent, 2016) .

Figure 1: Understanding and explaining why a seemingly innocuous statement is potentially unjust requires reasoning about the conversational implicatures and commonsense implications with respect to the overall offensiveness, intent, and power differentials among different social groups. SOCIAL BIAS FRAMES aims to represent the various pragmatic meanings related to social bias implications, by combining categorical and free-text annotations, e.g., that “women are less qualified” is implied by the phrase “lowering our standards to hire more women.”

Most previous approaches to understanding the implied harm in statements have cast this task as a simple toxicity classification (e.g., Waseem and Hovy, 2016; Founta et al., 2018; Davidson et al., 2017) . However, simple classifications run the risk of discriminating against minority groups, due to high variation and identity-based biases in annotations (e.g., which cause models to learn associations between dialect and toxicity; Sap et al., 2019a; Davidson et al., 2019) . In addition, it is the detailed explanations that are much more informative for people to understand and reason about why a statement is potentially harmful against other people (Ross et al., 2017) .

Thus, we propose SOCIAL BIAS FRAMES, a novel conceptual formalism that aims to model pragmatic frames in which people project social biases and stereotypes on others. Compared to semantic frames, the meanings projected by pragmatic frames are richer thus cannot be easily formalized using only categorical labels. Therefore, as illustrated in Figure 1 , our formalism combines hierarchical categories of biased implications such as intent and offensiveness with implicatures described in free-form text such as groups refer-enced and implied statements. In addition, we introduce SBIC, 1 a new corpus collected using a novel crowdsourcing framework. SBIC supports large scale learning and evaluation with over 100k structured annotations of social media posts, spanning over 26k implications about a thousand demographic groups.

We then establish baseline approaches that learn to recover SOCIAL BIAS FRAMES from unstructured text. We find that while state-of-the-art neural models are effective at making high-level categorization of whether a given statement projects unwanted social bias (86% F 1 ), they are not effective at spelling out more detailed explanations by accurately decoding out SOCIAL BIAS FRAMES. Our study motivates future research that combines structured pragmatic inference with commonsense reasoning on social implications. Important Implications of Our Study We recognize that studying SOCIAL BIAS FRAMES necessarily requires us to confront online content that may be offensive or disturbing. However, deliberate avoidance does not make the problem go away. Therefore, the important premise we take in this study is that assessing social media content through the lens of SOCIAL BIAS FRAMES is important for automatic flagging or AI-augmented writing interfaces, where potentially harmful online contents can be analyzed with detailed expla- Davidson et al. (2017) 3,008 Waseem and Hovy 20161,816 total 16,689 nations for users to consider and verify. In addition, the collective analysis over large corpora can also be insightful for educating people to put more conscious efforts in reducing unconscious biases that they repeatedly project in their language use.

2 Social Bias Frames Definition

To better enable models to account for socially biased implications of language, 2 we design a new pragmatic formalism that distinguishes several related but distinct inferences, shown in Figure 1 . Given a natural language utterance, henceforth, post, we collect both categorical as well as free text inferences (described below), inspired by recent efforts in knowledge graph creation (e.g., Speer and Havasi, 2012; Sap et al., 2019b) . The free-text explanations are crucial to our formalism, as they can both increase trust in predictions made by the machine (Kulesza et al., 2012; Bussone et al., 2015; Nguyen et al., 2018) and encourage a poster's empathy towards targeted group, thereby combating potential biases (Cohen-Almagor, 2014).

Offensiveness denotes the overall rudeness, disrespect, or toxicity of a post. We define it formally as whether it could be considered "offensive to anyone", as previous work has shown this to have higher recall of offensive content (Sap et al., 2019a) . This is a categorical variable with three possible answers (yes, maybe, no).

Intent to offend captures whether the perceived motivation of the author was to offend, which is key to understanding how it is received (Kasper, 1990; Dynel, 2015) . This is a categorical variable 2 In this work, we employ the U.S. socio-cultural lens when discussing bias and power dynamics among demographic groups. with four possible answers (yes, probably, probably not, no).

Lewd or sexual references are a key subcategory of what constitutes potentially offensive material in many cultures, especially in the United States (Strub, 2008) . This is a categorical variable with three possible answers (yes, maybe, no).

Group implications are distinguished from individual-only attacks or insults that do not invoke power dynamics between groups (e.g., "F*ck you" vs. "F*ck you, f*ggot"). This is a categorical variable with two possible answers.

Targeted group describes the social or demographic group that is referenced or targeted by the post. Here we collect free-text answers, but provide a seed list of demographic or social groups to encourage consistency.

Implied statement represents the power dynamic or stereotype that is referenced in the post. We collect free-text answers in the form of simple Hearst-like patterns (e.g., "women are ADJ", "gay men VBP"; Hearst, 1992).

Minority speaker aims to flag posts for which the speaker may be part of the same social group referenced. This is motivated by previous work on how speaker identity influences how a statement is received (Greengross and Miller, 2008; Sap et al., 2019a To create SBIC, we design a crowdsourcing framework to seamlessly distill the biased implications of posts at a large scale.

Table 1: Examples of inference tuples in SBIC. The types of inferences captured by SOCIAL BIAS FRAMES cover (potentially subtle) offensive implications about various demographic groups.

3.1 Data Selection

We draw from two sources of online content, namely Reddit and Twitter, to select posts to annotate. To mitigate the challenge of scarcity of online toxicity (Founta et al., 2018 ), 3 we start by annotating posts made in three intentionally offensive English subReddits (see Table 2 ). By nature, these are very likely to have harmful implications as they are often posted with intents to deride adversity or social inequality (Bicknell, 2007) . Additionally, we include posts from three existing English datasets annotated for toxic or abusive language, filtering out @-replies, retweets, and links. We mainly annotate tweets released by Founta et al. (2018) , who use a bootstrapping approach to sample potentially offensive tweets. We also include tweets from Waseem and Hovy (2016) and Davidson et al. (2017) , who collect datasets of tweets containing racist or sexist hashtags and slurs, respectively.

Table 2: Breakdown of origins of posts in SBIC.

3.2 Annotation Task Design

We design a hierarchical Amazon Mechanical Turk (MTurk) framework to collect biased implications of a given post (snippet shown in Figure 2 . The full task is shown in the supplementary (Figure 5) . For each post, workers indicate whether the post is offensive, whether the intent was to offend, and whether it contains lewd or sexual content. Only if annotators indicate potential offensiveness do they answer the group implication question. If the post targets or references a group or demographic, We show percentages within domains for the top 3 most represented categories, namely gender/sexuality (e.g., women, LGBTQ folks), race/ethnicity (e.g., Black, Latinx, and Asian folks), and culture/origin (e.g., Muslim, Jewish folks).

Figure 2: Snippet of the annotation task used to collect SBIC. Lewdness, group implication, and speaker minority questions are ommited for brevity but shown in larger format in Figure 5 (Appendix).

Figure 5: Snippet of the annotation task used to collect

workers select or write which one(s); per selected group, they then write two to four stereotypes. Finally, workers are asked whether they think the speaker is part of one of the groups references by the post. Optionally, we ask workers for coarsegrained demographic information. 4 We collect three annotations per post, and restrict our worker pool to the U.S. and Canada.

Annotator demographics In our final annotations, our worker pool was relatively genderbalanced and age-balanced (55% women, 42% men, <1% non-binary; 36±10 years old), but racially skewed (82% White, 4% Asian, 4% Hispanic, 4% Black).

Figure 3: Breakdown of targeted group categories by domains. We show percentages within domains for the top 3 most represented categories, namely gender/sexuality (e.g., women, LGBTQ folks), race/ethnicity (e.g., Black, Latinx, and Asian folks), and culture/origin (e.g., Muslim, Jewish folks).

Annotator agreement We compute how well annotators agreed on categorical questions, showing moderate agreement on average. Workers agreed on a post being offensive at a rate of 77% (Cohen's κ=0.53), its intent being to offend at 76% (κ=0.48), and it having group implications at 76% (κ=0.51). Workers marked posts as lewd with substantial agreement (94%, κ=0.66), but agreed less when marking the speaker a minority (94%, Figure 4 : Overall architecture of our full multi-task model, which combines five classifications tasks for categorical variables (in yellow; L 1 ) with a generation task for the free-text variables (in dark blue; L 2 ).

Figure 4: Overall architecture of our full multi-task model, which combines five classifications tasks for categorical variables (in yellow; L1) with a generation task for the free-text variables (in dark blue; L2).

κ=0.18). 5

3.3 Sbic Description

After data collection, SBIC contains 100k structured inference tuples, covering 25k free text group-implication pairs (see Table 3 ). We show example inference tuples in Table 1 .

Additionally, we show a breakdown of the types of targeted groups in Figure 3 . While SBIC covers a variety of types of biases, gender-based, racebased, and culture-based biases are the most represented, which parallels the types of discrimination happening in the real world (RWJF, 2017).

4 Social Bias Inference

Given a post, our model aims to generate the implied power dynamics in textual form, as well as classify the post's offensiveness and other categorical variables. We show a general overview of the full model in Figure 4 .

As input, our model takes a post p, defined as a sequence of tokens p = {w 1 , w 2 , ...} delimited by a start token ([STR]) and a classifier token ([CLF] ). Our encoder model then yields a contextualized representation of each token h i = f e (w i | p) ∈ R H , where H is the hidden size of the encoder.

Classification For predicting the categorical variables ({y i } 5

i=1 ; y i ∈ R), our model combines five logistic classifiers that use the representation at the classifier token, h [CLF] , as input. The final 5 Low κ values are expected for highly skewed categories such as minority speaker (only 4% "yes").

predictions are computed through a projection and a sigmoid layer:

y i = σ(h [CLF] W c i + b c i )

where W c i ∈ R H and b c i ∈ R. During training, we minimize the negative log-likelihood of the predictions:

L 1 = − 5 i=1 log[y i p(ŷ i + (1 − y i )(1 − p(ŷ i ))]

During inference, we simply predict the classes which have highest probability.

Generation For the free-text variables, we take inspiration from recent generative commonsense modelling . Specifically, we frame the inference as a conditional language modelling task, by appending the linearized targeted group (g) and implied statement (s) to the post (using the SEP delimiter token; see Figure 4 ). During training, we minimize the cross-entropy of the linearized (p, g, s) triple using a language modelling objective:

L 2 = − log p(p, g, s)

During inference, we conditionally generate the group g and statement s conditioned on the post p, using greedy (argmax) or sampling decoding.

4.1 Experimental Setup

In this work, we build on the pretrained OpenAI-GPT model by Radford et al. (2018) as our encoder f e , which has yielded impressive classification and generation results (Radford et Table 4 : Experimental results (%-ages) of various models on the classification tasks. L1 corresponds to the multitask classification model, L1+L2 the full multitask model, and *-rnd the full multitask but randomly initialized model. We bold the best results. For easier interpretation, we also report the %-age of instances in the positive class in the dev set.

Table 4: Experimental results (%-ages) of various models on the classification tasks. L1 corresponds to the multitask classification model, L1+L2 the full multitask model, and *-rnd the full multitask but randomly initialized model. We bold the best results. For easier interpretation, we also report the %-age of instances in the positive class in the dev set.

2018; Gabriel et al., 2019) . This model is a unidirectional language model, which means encoded token representations are only conditioned on past tokens (i.e., h i = f e (w i | w 1 , ..., w i−1 )). OpenAI-GPT was trained on English fiction (Toronto Book Corpus; Zhu et al., 2015) For baseline comparison, we consider a multitask classification-only model (L 1 ). We also compare the full multitask model to a baseline generative inference model trained only on the language modelling loss (L 2 ). Finally, we consider a model variant that uses a randomly initialized GPT model to observe the effect of pretraining.

4.2 Evaluation

We evaluate performance of our models in the following ways. For classification, we report precision, recall, and F 1 scores of the positive class.Following previous generative inference work (Sap et al., 2019b) , we use automated metrics to evaluate model generations. We use BLEU-2 and RougeL (F 1 ) scores to capture word overlap between the generated inference and the references, which captures quality of generation (Galley et al., 2015; Hashimoto et al., 2019) . We additionally compute word mover's distance (WMD; Kusner et al., 2015) , which uses distributed word representations to measure similarity between the generated and target text.

4.3 Training Details

As each post can contain multiple annotations, we define a training instance as containing one postgroup-statement triple (along with the five categorical annotations). We then split our dataset into train/dev./test (75:12.5:12.5), ensuring that no post is present in multiple splits. For evaluation (dev., test), we combine the categorical variables by averaging, and compare the generated infer- ences (hypotheses) to all targeted groups and implied statements (references). All experiments are carried our using Hugging-Face's Transformers library. 6 We tune hyperparameters on the dev. set, and report performance for the best performing setting (according to average F1). We train or finetune our models using a batch size of 4, a learning rate of 5e-5 (with linear warm up), and consider training for e ∈ {1, 3, 4, 6} epochs.

5 Results

Listed in Tables 4 & 5, our modelling results indicate that making inferences about social biases in language remains challenging for models.

Classification Most notably for classification, the multitask model outperforms other variants substantially when predicting a post's offensiveness and intent to offend (+8% F1 on both). The classification-only model slightly outperforms the full multitask model on other categories. We hypothesize that correctly predicting those might require more lexical matching (e.g., detecting sexual words for the lewd category). In contrast, the offensiveness and intent gains from full multitasking suggest that for those more subtle semantic categories, more in-domain language model finetuning helps. Highly skewed categories pose a challenge for all models, due to the lack of positive instances. As expected, using the randomly initialized model performs significantly worse than the pretrained version.

Generation When we evaluate on our generation tasks, we find that model performance is comparable across automatic metrics between the full multitask variant (GPT L1+L2) and the freetext only generation model (GPT L2). Surprisingly, the randomly initialized multitask variant performs better on BLEU and WMD on the group target inference, which is likely due to the small and constrained generation space (there are only 1.1k different groups in our corpus; see Table 3 ). When the generation space is larger (for the implied statement), pretrained variants perform better.

Table 5: Automatic evaluation of various models on the generation task. GPT L2 is the text-only model, GPT L1+L2 is the full multitask model. Bl: BLEU-2, RgL: Rouge-L, WMD: Word Mover’s Distance. Higher is better for BLEU and ROUGE scores, and lower is better for WMD.

Error analysis Since small differences in automated evaluation metrics for text generation sometimes only weakly correlate with human judgements (Liu et al., 2016) , we manually perform an error analysis on a select set of generated dev examples from the full multitask model (Table 6) . Overall, the model seems to struggle with generating textual implications that are relevant to the post, instead generating very generic stereotypes about the demographic groups (e.g., in examples b,c). The model generates the correct stereotypes when there is high lexical overlap with the post (e.g., examples d,e). This is in line with previous research showing that large language models rely on correlational patterns in data (Sakaguchi et al., 2019) .

Table 6: Examples of GPT L1+L2 model predictions. The model struggles to pick up on subtle biases (a), and tends to generate generic stereotypes rather than implications that are entailed by the post (b, c).

6 Related Work

Bias and Toxicity Detection Detection of hateful, abusive, or otherwise toxic language has received increased attention recently (Schmidt and Wiegand, 2017) . Most dataset creation work has cast this detection problem as binary classification (Waseem and Hovy, 2016; Wulczyn et al., 2017; Davidson et al., 2017; Founta et al., 2018) , Recently, Zampieri et al. (2019) collected a dataset of tweets with hierarchical categorical annotations of offensiveness and whether a group or individual is targeted. In contrast, SOCIAL BIAS FRAMES covers both hierarchical categorical and free-text annotations. Similar in spirit to our work, recent work has tackled more subtle bias in language, such as microaggressions (Breitfeller et al., 2019) and condescension (Wang and Potts, 2019) . These types of biases are in line with but more narrowly scoped than biases covered by SOCIAL BIAS FRAMES.

Inference about Social Dynamics Various work has tackled the task of making inferences about power and social dynamics.

Particularly, previous work has analyzed power dynamics about specific entities, either in conversation settings (Prabhakaran et al., 2014; Danescu-Niculescu-Mizil et al., 2012) or in narrative text (Sap et al., 2017; Field et al., 2019; Antoniak et al., 2019) . Additionally, recent work in commonsense inference has focused on mental states of participants of a situation (e.g., Rashkin et al., 2018; Sap et al., 2019b) . In contrast to reasoning about particular individuals, our work focuses on biased implications of social and demographic groups as a whole.

7 Ethical Considerations

Risks in deployment Determining offensiveness and reasoning about harmful implications of language should be done with care. When deploying such algorithms, several ethical aspects should be considered including the fairness of the model on speech by different demographic groups or in different varieties of English (Mitchell et al., 2019) . Additionally, practitioners should discuss potential nefarious side effects of deploying such technology, such as censorship (Ullmann and Tomalin, 2019) and dialect-based racial bias (Sap et al., 2019a; Davidson et al., 2019) . Finally, inferences about offensiveness could be paired with promotions of positive online interactions, such as emphasis of community standards (Does et al., 2011) or counter-speech (Chung et al., 2019; Qian et al., 2019) .

Risks in annotation Recent work has highlighted various negative side effects caused by annotating potentially abusive or harmful content (e.g., acute stress; Roberts, 2016). We mitigate these by limiting the number of posts that one worker can annotate in one day, paying workers above minimum wage ($7-$12), and providing crisis management resources to our annotators. 7

8 Conclusion

To help machines reason about and account for societal biases, we introduce SOCIAL BIAS FRAMES, a new structured commonsense formalism that distills knowledge about the biased implications of language. Our frames combine categorical knowledge about the offensiveness, intent, and targets of statements, as well as free-text inferences about which groups are targeted and biased implications or stereotypes. We collect a new dataset of 100k annotations on social media posts using a novel crowdsourcing framework. We establish baseline performance of models built on top of large pretrained language model. We show that while classifying the intent or offensiveness of statements is easier, models struggle to generate relevant inferences about social biases, especially when implications have low lexical overlap with posts. This indicates that more sophisticated models are required for SOCIAL BIAS FRAMES inferences.