Go To:

Paper Title Paper Authors Table Of Contents Abstract References
Home
Report a problem with this paper

RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models

Authors

Abstract

Pretrained neural language models (LMs) are prone to generating racist, sexist, or otherwise toxic language which hinders their safe deployment. We investigate the extent to which pretrained LMs can be prompted to generate toxic language, and the effectiveness of controllable text generation algorithms at preventing such toxic degeneration. We create and release RealToxicityPrompts, a dataset of 100K naturally occurring, sentence-level prompts derived from a large corpus of English web text, paired with toxicity scores from a widely-used toxicity classifier. Using RealToxicityPrompts, we find that pretrained LMs can degenerate into toxic text even from seemingly innocuous prompts. We empirically assess several controllable generation methods, and find that while data- or compute-intensive methods (e.g., adaptive pretraining on non-toxic data) are more effective at steering away from toxicity than simpler solutions (e.g., banning “bad” words), no current method is failsafe against neural toxic degeneration. To pinpoint the potential cause of such persistent toxic degeneration, we analyze two web text corpora used to pretrain several LMs (including GPT-2; Radford et. al, 2019), and find a significant amount of offensive, factually unreliable, and otherwise toxic content. Our work provides a test bed for evaluating toxic generations by LMs and stresses the need for better data selection processes for pretraining.

1 Introduction

Table 1: Data statistics of prompts and continuations in REALTOXICITYPROMPTS.
Table 2: Toxicity of generations conditioned on REALTOXICITYPROMPTS. Left: Expected maximum toxicity (with standard deviations as subscripts) over 25 generations. Right: The empirical probability of generating toxic text at least once over 25 generations.
Table 3: Left: Average maximum toxicity (with standard deviations as subscripts) over 25 generations. Right: The empirical probability of generating toxic text at least once over 25 generations. The best performing detoxification method yielding the lowest toxicity per-category, is bolded. We display DAPT (Toxic) as a reference for the effectiveness of DAPT as a method of controlling LM behavior. All models are evaluated on a full dataset of 100K prompts, except PPLM, which is evaluated on a dataset of 10K prompts, due to computational budget.
Table 4: Examples of (purposefully uncensored) toxic documents that appear in GPT-2’s training corpus, that were also submitted to quarantined or banned subreddits. We highlight spans that contribute to the overall toxicity of the document, which we identify manually.
Table 5: Summary statistics of non-toxic and toxic data used for detoxification experiments.
Table 6: Computational resources used for experiments. Pretraining mostly took place on Graphics Card 1. Generations were completed on both.
Table 7: Hyperparameters for data-based detoxification pretraining. Effective batch size is calculated by multiplying the batch size by the number of gradient accumulation steps.
Table 8: Hyperparameters for generation with all models (with the exception of PPLM).
Table 9: Hyperparameters for generation with PPLM. A description of each hyperparameter can be found in Dathathri et al. (2020).
Table 10: Perplexities after detoxification on web text test set. For each model, we report perplexity scores on the test set and a non-toxic subset of the test set. For all models other than GPT-2, we calculate perplexity with steering mechanisms enabled (such as prepending attribute tokens).
Table 11: Toxicity of GPT-2-small and GPT-2-medium generations in unprompted settings and conditioned on REALTOXICITYPROMPTS.
Table 12: Estimated percentages of documents considered toxic (i.e. PERSPECTIVE API score ≥ 0.5) in OWTC and OPENAI-WT under each PERSPECTIVE API label. Refer to Table 13 for label descriptions.
Table 13: PERSPECTIVE API label descriptions.
Table 14: Examples of toxic documents from the BooksCorpus.
Table 15: Example unprompted toxic generations from GPT-2, GPT-1 and CTRL
Table 16: Example unprompted toxic generations from GPT-3 and CTRL-WIKI

Although they are the backbone of many modern NLP systems (Devlin et al., 2019; Radford et al., 2019; Raffel et al., 2019) , language models (LMs) pretrained on large web text corpora suffer from degenerate and biased behavior (Sheng et al., 2019; Wallace et al., 2019) . As illustrated in Figure 1 , they can easily degenerate into toxicity, even without explicitly toxic prompts, which hinders their Figure 1: Non-toxic examples from REALTOXICI-TYPROMPTS, a new testbed for evaluating neural generations and their toxicity. Despite not containing any toxic language as measured by PERSPECTIVE API, these prompts cause several pretrained LMs to systematically generate highly toxic text (shown in Table 17 in Appendix §E).

Figure 1: Non-toxic examples from REALTOXICITYPROMPTS, a new testbed for evaluating neural generations and their toxicity. Despite not containing any toxic language as measured by PERSPECTIVE API, these prompts cause several pretrained LMs to systematically generate highly toxic text (shown in Table 17 in Appendix §E).