DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts
Despite recent advances in natural language generation, it remains challenging to control attributes of generated text. We propose DEXPERTS: Decoding-time Experts, a decodingtime method for controlled text generation that combines a pretrained language model with “expert” LMs and/or “anti-expert” LMs in a product of experts. Intuitively, under the ensemble, tokens only get high probability if they are considered likely by the experts and unlikely by the anti-experts. We apply DEXPERTS to language detoxification and sentiment-controlled generation, where we outperform existing controllable generation methods on both automatic and human evaluations. Moreover, because DEXPERTS operates only on the output of the pretrained LM, it is effective with (anti-)experts of smaller size, including when operating on GPT-3. Our work highlights the promise of tuning small LMs on text with (un)desirable attributes for efficient decoding-time steering.
Controlling the output of pretrained language models (LMs) is crucial for achieving useful and safe language generation applications, such as nonoffensive sentence completion or friendly conversation generation (See et al., 2019; Sheng et al., 2020; Gehman et al., 2020) . For example, a safe completion to the prompt "When she rejected his advance, he grabbed..." requires avoiding word choices that could lead to continuations with gender-based violence (e.g., "her"; Figure 1 ).
Without such steering, these language models risk generating mindless and offensive content (Sheng et al., 2019; Holtzman et al., 2020) which hinders their safe deployment (Brockman et al., 2020; Bender et al., 2021) . Importantly, as the scale of pretrained LMs increases (e.g., 175B and 1.6T parameters; Brown et al., 2020; Fedus et al., Figure 1: Illustration of DEXPERTS, where a toxic LM acts as an "anti-expert" and a non-toxic LM acts as an "expert". In this toy example, given the prompt, "When she rejected his advance, he grabbed," the toxic LM assigns greater weight to "her" than "his", expressing subtle signals of toxicity that can be leveraged for effective attribute control. The difference in logits z`´zó utput by the expert and anti-expert represents the perturbations to make to the logits z of the pretrained "base" LM. 2021), finetuning or re-training approaches are becoming increasingly computationally infeasible for most researchers.
We propose DEXPERTS, 1 a decoding-time method for controlled text generation based on a product of experts (Hinton, 2002) . Our method combines an out-of-the-box pretrained ("base") LM with "expert" LMs and/or "anti-expert" LMs, which model text with desirable and undesirable attributes, respectively. By generatively modeling text with particular attributes and directly combining the output distributions from each LM, DEXPERTS leverages subtle signals expressible by language models for effective attribute control, without sacrificing generation fluency or diversity. Moreover, because it operates only on the output of the base LM, DEXPERTS can steer with (anti-)experts of smaller size, even in cases where we do not have full access to the base model (e.g., GPT-3 through an API).
We first apply DEXPERTS to the task of language detoxification ( §3), by finetuning an expert and an anti-expert on public comments that are humanannotated for toxicity. Our experimental results show that DEXPERTS can successfully avoid toxicity in language generation while preserving output fluency, outperforming existing detoxification methods on both automatic and human evaluations. Moreover, we find that DEXPERTS continues to outperform baselines when employing only an antiexpert and re-using the base model as the expert, making it one of the only methods that can avoid toxicity without annotated examples of non-toxic content. In analysis, we also show that our method successfully avoids toxic degeneration while using just "650 toxic comments, opening avenues for easily customizable anti-experts.
We then showcase the generalizability of DEX-PERTS by tackling the task of controlling the sentiment of LMs' output ( §4). To this end, we combine a pretrained LM with (anti-)experts modeling positive and negative sentiment. As with language detoxification, DEXPERTS outperforms existing sentiment steering methods on both automatic and human evaluations. Additionally, we show our method is especially effective in the adversarial setting of steering negative prompts toward positive continuations, and vice versa. Finally, we demonstrate a preliminary proof-of-concept using DEXPERTS for stylistic rewriting ( §5).
Our work demonstrates the effectiveness of tuning small LMs on text with desirable and undesirable properties for efficient and effective steering of larger pretrained LMs, and highlights the promise of decoding-time methods for controlled language generation.
2 Experts And Anti-Experts For Controlled Generation
Given input text as a prompt, the task of controlled text generation is to generate a continuation that flows naturally from the prompt while having the desired attribute (e.g., positive sentiment) but not an undesired one (e.g., toxicity). Given a prompt x ăt , the language model computes the logits for the tth token, denoted z t P R |V| , where V is the vocabulary. A probability distribution over the vocabulary is obtained by normalizing and exponentiating z t :
EQUATION (1): Not extracted; please refer to original document.
and the next token is generated by sampling x t " P pX t | x ăt q.
2.1 Dexperts Formalization
DEXPERTS operates on a pretrained language model M by combining its predictions with an expert M`, which models text with a desirable attribute, and an anti-expert M´, which models text with an undesirable attribute. At time step t, we condition each language model M , M`, and M´on the prompt x ăt to obtain z t , zt , and zt , respectively. The product-of-experts ensemble is given by: 2 P pX t | x ăt q " softmax`z t`α`zt´zt˘˘( 2) where α is a hyperparameter that controls the amount of modification to z t , and can be interpreted as the strength of control over the base model. Equivalently,
P pX t | x ăt q9P pX t | x ăt qˆP`p X t | x ăt q P´pX t | x ăt q˙α
(3) Intuitively, a token will only have high probability if it has high probability under both P and P`, and low probability under P´. We can interpret the ratio P`pXt|xătq P´pXt|xătq as a scaling coefficient for each token, which is used to modify the original probability predicted for that token.
2.2 Sampling From Dexperts
Sampling fluent output from language models commonly requires truncating the unreliable tail of the probability distribution, as in top-k (Fan et al., 2018) or nucleus sampling (Holtzman et al., 2020) . We adapt this intuition to our method by truncating the logits z output by the base model prior to combining with the experts. Formally, let V 1 Ă V denote the set of tokens that are a part of the topk/top-p vocabulary of the base LM at time step t. The truncated logits z 1 are given by
EQUATION (4): Not extracted; please refer to original document.
By substituting z with z 1 in Equation 2, we havẽ
P 1 pX t | x ăt q " softmax`z 1 t`α`zt´zt˘(
We obtain our next token x t via pure sampling from the probability distributionP 1 pX t | x ăt q, which has non-zero probability only on tokens in V 1 . In this way, adding in the (anti-)experts can be interpreted as modifying the probability distribution over the candidate tokens in V 1 , without any chance of reintroducing tokens v R V 1 from the tail of the original probability distribution.
3 Toxicity Avoidance
Given that large pretrained LMs are at risk of producing toxic content (Sheng et al., 2019; Gehman et al., 2020) , steering away from toxic "degeneration" is crucial for their safe deployment. Our approach uses an anti-expert that models overt toxicity, as well as an expert that is finetuned on nontoxic data from the same domain.
Note that while obtaining an LM that is truly free from social biases is impossible (Fiske, 1993; Lakoff, 1973) , the "non-toxic" expert serves the purpose of modeling the same domain of comments as the toxic anti-expert, providing more effective contrast. Nonetheless, we provide an ablation using only a toxic anti-expert and show that it remains effective above all previous baselines.
We use GPT-2 Large as our base LM. For our expert and anti-expert, we finetune several sizes of GPT-2 (Small, Medium, Large) on a dataset of humanannotated comments from the Jigsaw Unintended Bias in Toxicity Classification Kaggle challenge. 3 We consider an example toxic if ě 50% of annotators marked it as toxic, and nontoxic if none of the annotators mark it as toxic. This toxic dataset 3 https://bit.ly/3cvG5py
has "160K comments, and the nontoxic dataset "1.4M comments. Note that our toxic dataset is human-annotated and out-of-domain with respect to the pretraining corpus (WebText for GPT-2).
We report results for α " 2.0, chosen after observing the tradeoff between detoxification and fluency, but show results for other values of α in Appendix D.
3.2.1 Generation Prompts
To evaluate the problem of toxic degeneration where a user might unexpectedly receive harmful output from a model, we use a random sample of 10K nontoxic prompts from the RealToxici-tyPrompts dataset (Gehman et al., 2020).
Domain-adaptive pretraining (DAPT; Gururangan et al., 2020) We further pretrain the base model on the non-toxic subset of OpenWebText. This dataset is obtained by scoring the full Open-WebText corpus with the toxicity classifier from Perspective API 4 and keeping the least toxic 2 percent of documents, a corpus of about 150K documents, or 63M tokens, following the implementation of this baseline from Gehman et al. 2020.
Plug-and-play language models (PPLM; Dathathri et al., 2020) PPLM uses gradients from a toxicity classifier to update the LM's hidden representations. We retrain the classifier to be compatible with our larger base model size, on the same toxicity data used in the original paper. 5 Due to the extreme computational expense of PPLM (runtimes are shown in Appendix A.4), we evaluate PPLM on a random subset of 1K prompts.
Generative discriminators (GeDi; Krause et al., 2020) GeDi uses a class-conditioned LM to provide classification probabilities for all possible next tokens via Bayes' rule. We use the toxicity classconditioned LM released by the authors with the recommended generation hyperparameters.
We also explore an antiexpert-only ablation of DEXPERTS, by reusing the base model as the expert. To be clear, we substitute zt " z t in Equation 1, so that we havẽ We use the toxic anti-expert based on GPT-2 Large and the same hyperparameter value α " 2.0.
P pX t | x ăt q " softmax`p1`αqz t´α zt˘(6) Model Toxicity (Ó) Fluency (Ó) Diversity (Ò) Avg.
Non-Toxic Expert Finally, we consider generating directly from the non-toxic expert based on GPT-2 Large.
For all baselines, we use nucleus sampling (Holtzman et al., 2020) with p " 0.9 to generate up to 20 tokens. Note that for our method, nucleus sampling is done as described in §2, by using the nucleus from the base LM. Other training and generation details (e.g., hyperparameters) are described in Appendix A.
3.2.3 Automatic Evaluation
We evaluate our generations for toxicity, fluency, and diversity. Following previous work (Gehman et al., 2020), we characterize generation toxicity using the toxicity score from Perspective API, along two axes: 1) the maximum toxicity over k " 25 generations, and 2) the empirical probability of generating a continuation with toxicity ě 0.5 at least once over k " 25 generations. Generation fluency is measured by the mean perplexity of generated continuations according to a larger pretrained LM, GPT-2 XL. Generation diversity is measured using the mean number of distinct n-grams, normalized by the length of text (Li et al., 2016) , among the 25 generations for each prompt. We report Dist-1, Dist-2, and Dist-3 scores for distinct uni-, bi-, and trigrams, respectively.
Results According to automatic metrics shown in Table 1 , DEXPERTS substantially outperforms all existing baselines at detoxification. In particular, DEXPERTS (medium, large) are among the most fluent controllable generation methods, while fully preserving output diversity compared to the base model. Moreover, the DEXPERTS (anti-only) ablation continues to outperform baselines at detoxification, although with a loss in fluency and diversity that is likely due to the less effective contrast between the base model and anti-expert. We report the per-generation runtime of each method in Appendix A.4 to demonstrate DEXPERTS's efficiency compared to other decoding-time methods.
3.2.4 Human Evaluation
While automatic toxicity classifiers like Perspective API enable the kind of large-scale evaluation required for systematic comparison of methods, an abundance of work shows that their accuracy is far from ideal (Dixon et al., 2018; Sap et al., 2019; Davidson et al., 2019; Hutchinson et al., 2020) in part due to reliance on spurious features, which we discuss in §8. Therefore, we carry out a human evaluation on Amazon Mechanical Turk on 120 random prompts from the 10K nontoxic subset. For each prompt, we compare four pairs of models: DEXPERTS (large) versus GPT-2 Large, PPLM, DAPT, and GeDi. For each pair of models, we randomly sample two generations from each model. This results in a total of 120 promptsˆ4 pairings promptˆ2 generations pairing
" 960 comparisons. Each comparison pair is rated by three Turkers, who select which of the two continuations is: (1) less toxic, (2) more fluent, and (3) more topical, i.e., whether the continuation is natural, Figure 2 : Results of human evaluation for detoxification. DEXPERTS is rated as less toxic more often than every baseline, and equally fluent compared to the base model, GPT-2.
Toxicity ( relevant, and follows logically from the prompt. A screenshot of the user interface is provided in Appendix C.
Results According to human evaluations, DEX-PERTS is rated as less toxic more often than all baselines ( Figure 2 ). In particular, it is rated equally fluent compared to GPT-2, yet less toxic than GPT-2 10% more often than the other way around. See Appendix E for examples of generations.
3.3 Steering Gpt-3
We next use DEXPERTS to steer GPT-3 Ada. Because the OpenAI API 6 allows access to only the top 100 log probabilities at each time step, we can only modify and sample from the probability distribution over the top 100 tokens. Nonetheless, results in Table 2 show that DEXPERTS effectively reduces toxicity from GPT-3 to about the same level as when operating on GPT-2. This demonstrates that DEXPERTS requires only the output of the base model, and indeed, the (anti-)experts do not need to be built on the base model.
3.4 Analysis: Dataset Size
In practice, gathering large amounts of toxic data may be challenging, especially in applications where we would want to customize the anti-expert LM for differing notions of harmful language. To explore the limited data setting, we investigate the relationship between the dataset size used to train the (anti-)experts and its effectiveness at steering the base model. We finetune GPT-2 Large 6 https://openai.com/api/
4 Sentiment-Controlled Generation
As a second application we consider the wellstudied task of controlling the polarity of text's sentiment (e.g., Sudhakar et al., 2019) , steering towards either positive or negative sentiment.
We use the same pretrained model from §3, GPT-2 Large, as our base LM. We finetune GPT-2 (Small, Table 3 : Results for experiments in sentiment-controlled generation. We consider three sets of prompts relative to the base LM: neutral prompts, which are equally likely to lead to positive and negative generations, as well as positive prompts and negative prompts, which lead to overwhelmingly positive and negative generations, respectively. Sentiment is measured as the mean percentage of positive generations of out of the 25 continuations for each prompt, according to HuggingFace's sentiment analysis classifier. Higher is better for positive steering (top); lower is better for negative steering (bottom).
Medium, Large) on a positive sentiment corpus for our positive LM, and on a negative sentiment corpus for our negative LM. We use Stanford Sentiment Treebank (SST-5; Socher et al., 2013), which contains movie reviews labeled by human raters for sentiment on a scale from 1 (very negative) to 5 (very positive). Our positive dataset contains "positive" and "very positive" reviews, and our negative dataset "negative" or "very negative" reviews. Each of these sentiment datasets has about 4K reviews. For ease of notation we consider the positive LM our expert and negative LM our anti-expert, and use α "˘3.2 for steering in each direction. The tradeoff between fluency and sentiment control for many values of α is shown in §4.3.
4.2.1 Generation Prompts
In order to test our method's ability to control sentiment beyond the domain that the sentiment experts are trained on (movie reviews), we collect a dataset of 100K naturally occurring prompts from the OpenWebText Corpus (OWT) (Gokaslan and Cohen, 2019) . Details are outlined in Appendix B. We generate 25 continuations for each prompt from the base LM, and score them using HuggingFace's sentiment analysis classifier (Wolf et al., 2020) trained on SST-5 movie reviews. Using these generations from the base LM, we build three datasets of prompts: (1) 5K "neutral" prompts, which lead to 12 or 13 positive continuations, (2) 2.5K "negative" prompts, which lead to 25 negative continuations, and (3) 2.5K "positive" prompts, which lead to 24 or 25 positive continuations. We consider the negative and positive prompts adversarial settings, where the task is to steer toward the opposite sentiment of the prompt.
We consider the same baselines as in §3, along with a new baseline (CTRL; Keskar et al., 2019).
DAPT Corresponding to our DAPT baseline in §3, we score all documents in OpenWebText with the HuggingFace sentiment classifier, and keep the most positive 2% and most negative 2% (according to the probability of the predicted label) to obtain the positive and negative corpora. We perform another round of pretraining on each corpus to obtain a positive LM and negative LM.
PPLM As with toxicity §3, we retrain the sentiment classifier for PPLM with a larger embedding size compatible with our base model. The training data used is SST-5. Again, we evaluate PPLM on only 10% of the prompts compared to other models, which are randomly selected: 500 neutral prompts, 250 positive prompts, and 250 negative prompts.
GeDi We use GeDi with the sentiment classconditioned LMs released by the original authors, which are trained on IMDB movie reviews (Maas et al., 2011) . (We find that retraining it on SST-5 results in slightly reduced performance, as discussed in Appendix A.) DEXPERTS (anti-only) To explore whether simply steering away from one sentiment will yield the opposite sentiment, we again explore an antiexpert-only version of DEXPERTS. As in §3, we reuse the base model as the expert, and use only a negative anti-expert LM for positive steering, and only a positive anti-expert LM for negative steering. We use α "˘2.0 for this setting.
Positive/Negative Experts Again, we consider decoding directly from the corresponding sentiment expert for positive and negative steering.
Conditional Transformer LM (CTRL; Keskar et al., 2019) To control the sentiment of generations from CTRL , we use the "Reviews" control code and append a rating of "5.0" for positive generations and a rating of "1.0" for negative generations. The sentiment training examples for CTRL came from Amazon reviews (McAuley et al., 2015) .
As with toxicity experiments ( §3), we use nucleus sampling with p " 0.9, and include our training and generation details in Appendix A.
4.2.3 Automatic Evaluation
We evaluate our generations for the target sentiment, fluency, and diversity. To estimate sentiment, we use HuggingFace's sentiment analysis classifier, and report the mean percentage of generations per prompt (out of 25) which are labeled positive (the rest are negative). We evaluate fluency and diversity in the same ways as §3.
Results As shown in Table 3 , DEXPERTS greatly outperforms previous controllable generation methods (PPLM, CTRL, DAPT, GeDi) on both neutral prompts and adversarial prompts. The limited performance of CTRL suggests that the effectiveness of class-conditioned training on domain-specific data is limited to the domain of that data; training on Amazon reviews does not allow generalization outside of the reviews domain. In a similar vein, while the positive and negative experts achieve decent performance (even performing the best on negative prompts), they do so at the expense of much higher output perplexity. This contrast shows two sides of the same coin: we observe that while CTRL acts like a standard language model on out-of-domain prompts (good fluency, poor control), the sentiment experts are highly specialized on movie reviews and tend to steer every generation toward movies (poor fluency, strong control). Meanwhile, DAPT is more effective while maintaining fluency, because its training domain is the same domain as the prompts domain (i.e., OWT), but its performance decreases substantially in the adversarial setting which requires more active steering. We observe that the poor fluency of PPLM is due to occasional generations with extremely high perplexity, suggesting cases of degenerate behavior. DEXPERTS with only an anti-expert is mildly effective on neutral prompts (outperforming or matching the performance of CTRL and PPLM), but works very poorly in the adversarial setting, confirming our intuition that steering away from negative sentiment does not provide sufficiently strong guidance for positive sentiment.
4.2.4 Human Evaluation
For human evaluation, we randomly choose 30 neutral prompts, 30 positive prompts, and 30 negative prompts, and consider five pairs of models: DEX-PERTS versus GPT-2, CTRL, PPLM, DAPT, and GeDi. For each prompt and pairing of models, we sample two generations from each model for each steering direction considered. This results in a total of 120 promptsˆ5 pairings promptˆ2 generations pairing " 1200 pairs, each rated by 3 MTurk workers. We ask annotators to select which generation achieves the desired sentiment better, along with the fluency and topicality questions from §3.2.4.
Results As shown in Figure 4 , DEXPERTS is substantially more effective at steering toward positivity on negative prompts while achieving better topicality and better fluency compared to all other baselines, including GPT-2. In the opposite setting of steering toward negativity on positive prompts, the gap in sentiment control performance between DEXPERTS and each of GPT-2, CTRL, DAPT, and PPLM is even more pronounced: DEXPERTS is Figure 4 : Results of human evaluation for steering toward positivity on negative prompts (left) and steering toward negativity on positive prompts (right). DEXPERTS is substantially more effective at achieving the desired sentiment over every baseline.
rated better than its comparison 62-78% of the time. While GeDi achieves close to DEXPERTS' performance in this setting, its topicality and fluency are much worse. The asymmetry, where negative steering appears easier than positive steering for DEXPERTS, is reflected in automatic evaluation as well. We hypothesize that it is easier to derail a positive prompt with negativity than turn something negative into something positive; but to human readers, these negative continuations may be unexpected (a similar observation was made in previous work; . For the neutral prompts, we see similar trends as those in the automatic and the human adversarial evaluations. Due to space constraints, we include those in Appendix D.2.
4.3 Analysis: Sentiment Versus Fluency
In practice, we may want different levels of sentiment control depending on the application (e.g., aggressively positive marketing pitches versus merely friendly chatbots). Figure 5 shows the relationship between output sentiment and fluency for different choices of α P r´3.4, 3.4s, conditioned on neutral prompts. The smooth tradeoff suggests that α can by adjusted by a practitioner or user, depending on their application. In our experiments, we pick α "˘3.2 because the curve becomes less steep, meaning that a greater cost in fluency does not re- turn as great of an increase in the desired sentiment. The tradeoff between output toxicity and fluency looks very similar for DEXPERTS detoxification ( §3), and is included in Appendix D.1.
5 Stylistic Rewriting With Dexperts
As a preliminary exploration, we go beyond generating text continuations to apply DEXPERTS to stylistic rewriting, i.e., rewriting a sentence in a target style while preserving as much content as possible. We replace the base model with a pretrained autoencoder, BART (Lewis et al., 2020) , and use GPT-2 Large sentiment (anti-)experts from §4 for steering. At each time step, the autoencoder base model conditions on both the input sequence and the generation-so-far, whereas the (anti-)experts condition on only the latter. As a proof of concept, we show some examples of input/output from this system in Table 4 .
Input Ñ Output Examples I love cats and seeing them play with yarn.
α"´4.0 Ý ÝÝÝÝ Ñ I love cats and seeing them play with rotten cereal.
Oatmilk is tasty and good for the environment.
α"´3.5 Ý ÝÝÝÝ Ñ Oatmilk is toxic and bad for the environment.
Great food but horrible staff and very very rude workers! α"2.0 Ý ÝÝÝ Ñ A very nice restaurant Table 4 : Examples of input/output from a preliminary system that applies DEXPERTS to stylistic rewriting. Recall α ą 0 indicates positive rewriting, and α ă 0 indicates negative rewriting.
This exploration suggests that more innovation is required to apply DEXPERTS to stylistic rewriting, but it is a promising direction. We anticipate future work on the subject.
6 Related Work
The task of controlling the output of a language generation model has been widely studied by previous work (for a review, see Prabhumoye et al., 2020) . Prior to using pretrained LMs as a backbone, most work used custom neural models trained for their respective downstream generation tasks, including emotion-aware text generation (Ghosh et al., 2017; Ficler and Goldberg, 2017) , attribute-aware product review generation (Dong et al., 2017) , and friendly or empathetic dialogue response generation (See et al., 2019; Rashkin et al., 2019) .
Since pretrained LMs have shown impressive text generation ability (Radford et al., 2018 (Radford et al., , 2019 , two directions have emerged to control their language generation: training approaches and decoding-time approaches. Training approaches include finetuning the pretrained LMs on datasets that contain the desired attributes (Gururangan et al., 2020) as well as creating a class-conditioned pretrained LM trained on text with specific attributes control code prefixes (Keskar et al., 2019) . In contrast to our method, such approaches can only steer towards desired text attributes, they cannot steer away from them. Additionally, training approaches require significant computational resources, which may no longer be feasible with the size of more recent pretrained LMs (Brown et al., 2020; Fedus et al., 2021) .
Decoding-time methods, a more lightweight approach, have been used controlling the attributes of generated text, as well as for improving its quality (Li et al., 2016; Holtzman et al., 2018; Welleck et al., 2020) . PPLM ) is a steering method that updates a pretrained model's hidden representations according to the gradient of a classifier with respect to the desired class. Unfortunately, this approach is computationally expensive, as shown in this and previous work (Gehman et al., 2020) . Contemporaneous with our work, FUDGE (Yang and Klein, 2021) trains classifiers on partial sequences to predict whether an attribute will be satisfied in the future, and uses Bayesian factorization to obtain the attribute-conditioned probability distribution. GeDi (Krause et al., 2020) uses Bayes' rule similarly, but computes classification probabilities using the output of class-conditioned LMs rather than directly training a classifier. In contrast, our experiments show that directly ensembling LMs' probabilities as opposed to using them for estimating class probabilities is more effective at steering text generation.
We present DEXPERTS, a method for controlled text generation that reweights the predictions of language models based on expert (and anti-expert) opinions. In experiments for two different tasks, detoxification and sentiment control, we show that our method is able to effectively steer the language model towards the desired generations, while preserving the fluency and diversity of generated text. As applications built on language models become ubiquitous, DEXPERTS demonstrates promise in steering these models toward safe and user-friendly generations.
Our study is motivated by the potential harms of using pretrained language models (Bender et al., 2021) , specifically their tendency to generate hateful, offensive, or toxic content (Sheng et al., 2020; Gehman et al., 2020) . Part of our work requires automatically detecting toxicity in generated texts, for which we use the Perspective API. 7 a commercially deployed toxicity detection tool. However, the mismatch between the construct of toxicity and its operationalization through an automatic classifier can cause biased or unintended model behavior (Jacobs and Wallach, 2021) . Specifically, recent work has shown that such hate speech classifiers overestimate the prevalence of toxicity in text that contains a minority identity mention (Hutchinson et al., 2020; Dixon et al., 2018) or text written by racial minorities (Sap et al., 2019; Davidson et al., 2019) , therefore having the real possibility of backfiring against its very aim of fairness and inclusive dialogue. To address this limitation, we also perform a human evaluation of toxicity, for which we obtained IRB approval and sought to pay our workers a fair wage ("US$7-9/h).
We also acknowledge that any controllable detoxification method runs the risk of dual use (Pandya, 2019) , specifically, this technology could be used to automatically generate hateful text (e.g., extremist texts; McGuffie and Newhouse, 2020). For a broader discussion of such risks, and of the risks of large pretrained LMs in general, please see Bender et al. (2021) .
Nevertheless, toxicity in pretrained LMs is an unsolved issue (Sheng et al., 2019; Gehman et al., 2020) . Therefore, we hope future work continues to better define and evaluate the presence of harmful language (e.g., , and to develop systems for mitigating such language that can be personalized to users' diverse experiences with language (e.g., dealing with reclaimed slurs appropriately; Croom, 2013). Table 5 : Hyperparameters for finetuning (anti-)experts for DEXPERTS and continued pretraining in domainadaptive pretraining (DAPT). We finetune the sentiment (anti-)experts and all DAPT models for 3 epochs, and the toxicity (anti-)experts for one epoch.
The finetuning time for each model size is shown in DAPT For our implementation of DAPT in sentiment experiments ( §4), we use HuggingFace's sentiment analysis classifier to filter documents from OpenWebText () for the most positive 2% and most negative 2% of documents. Because the classifier takes a maximum of 512 tokens as input text, we approximate the sentiment of a document with its first 510 tokens (a start and end token are added by the classifier). The hyperparameters for the additional phase of pretraining on the attribute data is given in Table 5 .
PPLM For our implementation of PPLM in experiments, we retrain the toxicity and sentiment classifiers to be compatible with our base model GPT-2 (large), as the original paper used GPT-2 medium for experiments. We use the same training datasets and hyperparameters as in the original PPLM paper.
Hyperparameter Assignment embedding size 1280 number of steps 10 epochs learning rate 1e-4 batch size 64 Table 7 : Hyperparameters for training the attribute classifiers used for PPLM.
GeDi For toxicity and sentiment steering, we download the class-conditioned language models (based on GPT-2 Medium) made available by the original authors. As an experiment, we also align the finetuning data for the sentiment GeDis and the (anti-)experts used in DEXPERTS by finetuning a new class-conditioned LM on SST-5 data (as opposed to IMDB used by in GeDi). We found slightly lower performance on sentiment control ("1-2%) across the settings, and therefore use the original class-conditioned LMs.
A.3 Dataset Details
Dataset Size Non-Toxic Positive Negative
Tokens 63, 457, 536 13, 240, 192 57, 805, 184 Documents 1, 320, 876 264, 837 1, 208, 186
A.4 Generation Details
Generation hyperparameters shared among all methods are shown in Table 11 . Hyperparame- ters for PPLM generation are shown in Table 12 . Following the recommendation of the authors, we performed a hyperparameter search for step size over the values t0.02, 0.06, 0.10, 0.20, 0.40u, and for number of iterations over the values t10, 20, 40, 60u, over a small sample of twenty nontoxic prompts. We picked step size 0.20 and 10 iterations, for the best tradeoff between toxicity reduction and output fluency. Due to the extreme computational expense of this method, we were not able to repeat the hyperparameter search for sentiment prompts. Hyperparameters for GeDi generation are shown in Table 13 . A description of each hyperparameter can be found in We compare the runtime for each controllable generation method used in §3 in Table 14 , all on a single NVIDIA Quadro 6000 GPU.. We see that Table 13 : Hyperparameters for generation with GeDi. A description of each hyperparameter can be found in (Krause et al., 2020) DEXPERTS takes 2 to 3 times the time as decoding directly from the base model, depending on the size of the (anti-)experts. When using the same model size for the guiding language model as in GeDi (GPT-2 Medium), DEXPERTS is more efficient than GeDi, and both methods are 100ˆfaster than PPLM.
B Collection Of Sentiment Prompts
We build our prompts for sentiment experiments ( §4) from the OpenWebText Corpus (Gokaslan and Cohen, 2019) , a corpus of English web text scraped from outbound links on Reddit. We randomly sample 100K documents from OpenWebText and tokenize each document into sentences. Following the creation of RealToxicityPrompts (Gehman et al., 2020), we split each sentence into the prompt, consisting of the first half of tokens, and the continuation, consisting of the remaining tokens. We keep only prompts that are between 4 and 10 tokens long (inclusive). For all tokenization, we use the NLTK library (Bird and Loper, 2004) . This results in 140M prompts, from which we randomly sample 100K prompts. For each of the 100K prompts, we generate 25 continuations from our base model, GPT-2 (large), and score the continuations for sentiment using the HuggingFace sentiment classifier described in §4. The distribution of prompts with n P r0, 25s positive continuations out of 25 is shown in Figure 6 . Interestingly, we observe that more prompts have more negative continuations than positive continuations than vice versa. Based on these generations, we create three sets of prompts as described in §4.
C Human Evaluation
Our interface for human evaluation is shown in Figure 7 . For each category, the annotator is allowed to choose either one of the continuations, or rate the two options as equal. Figure 7 : The interface on Amazon Mechanical Turk used for collecting human evaluation in §3. The interface for positive and negative sentiment evaluation in §4 is equivalent, except replacing "less toxic" with "more positive" and "more negative," respectively. Figure 8 shows the relationship between output toxicity and fluency for different values of α in our method. The relationship is smooth, reflecting the corresponding figure for sentiment in §4.3. Figure 9 shows the results of human evaluation on sentiment control conditioned on neutral prompts.