Ethical-Advice Taker: Do Language Models Understand Natural Language Interventions?
Is it possible to use natural language to intervene in a model’s behavior and alter its prediction in a desired way? We investigate the effectiveness of natural language interventions for reading-comprehension systems, studying this in the context of social stereotypes. Specifically, we propose a new language understanding task, Linguistic Ethical Interventions (LEI), where the goal is to amend a questionanswering (QA) model’s unethical behavior by communicating context-specific principles of ethics and equity to it. To this end, we build upon recent methods for quantifying a system’s social stereotypes, augmenting them with different kinds of ethical interventions and the desired model behavior under such interventions. Our zero-shot evaluation finds that even today’s powerful neural language models are extremely poor ethical-advice takers, that is, they respond surprisingly little to ethical interventions even though these interventions are stated as simple sentences. Fewshot learning improves model behavior but remains far from the desired outcome, especially when evaluated for various types of generalization. Our new task thus poses a novel language understanding challenge for the community.1
McCarthy et al. (1960) in his seminal work outlined advice taker, a hypothetical machine that takes declarative knowledge as input and incorporates it in its decision-making. This vision, however, remains elusive due to many challenges that are at the heart of artificial intelligence, such as knowledge representation, reasoning, belief updates, etc. Now after several decades, thanks in part to pretrained neural language models (Liu et al., 2019b; Lewis et al., 2020; Raffel et al., 2020) , we have high quality systems for many challenge tasks that seemed Warning: Paper contains potentially offensive examples. 1 https://github.com/allenai/ethical-interventions impossible just a few years ago . Motivated by this success, we revisit an aspect of McCarthy et al.'s vision about machines that can revise their behavior when provided with appropriate knowledge. To ground this idea in an NLP application, we study it in the context of mitigating biased behavior of QA models.
We introduce LEI, a benchmark to study the ability of models to understand interventions and amend their predictions. To build this benchmark, we begin with under-specified scenarios that expose model biases . For example, consider the question in Fig. 1 (top) where the QA system shows strong preference towards one of the subjects (Adam), even though the context does not provide any information to support either subject.
We then add bias-mitigating ethical interventions, as shown in Fig. 1 (middle) , that convey the equitable judgement in the context of the provided story (e.g., not conditioning 'hiring' on guessing applicants' gender). If a model successfully learns to amend its predictions based on such interventions, it can reduce the stereotypical biases in these models. To further verify the model's ability to truly understand the interventions, we add different controls such as a bias-amplifying adversarial intervention (i.e., an anti-ethical recommendation), as shown in Fig. 1 (bottom) , where the model is expected to behave in a biased manner. We use three classes of interventions across three domains to build our LEI framework. 2 We evaluate recent pre-trained languages models on LEI to empirically study the extent to which it is possible to intervene in a model's decision making and amend its predictions. Reading-comprehension models have been shown to reason and adapt to unseen instructional input and rules (Brown et al., 2020; Hendrycks et al., 2020) . Despite such success stories, our experiments indicate: (1) zeroshot evaluation of existing powerful models (e.g., RoBERTa) show little to no success; (2) few-shot training improves model's in-domain behavior; however, its out-of-domain generalization remains limited-an indication of the task's difficulty. Contributions. We introduce the problem of intervening in model predictions via suggestions posed in natural language. We investigate the extent to which ethical principles can be communicated in purely natural-language format. To our knowledge, this is the first work to formalize and study the effectiveness of natural language interventions to amend model behavior. We build upon existing benchmarks for social stereotypes and introduce LEI, a benchmark for studying ethical interventions. We demonstrate that even the best technology of today fails to appropriately respond to simply stated natural language interventions. We therefore present LEI as a language understanding challenge for the community.
2 Lei: Linguistic Ethical Interventions
We first describe the general task of natural language interventions followed by our proposed Linguistic Ethical Interventions (LEI) task.
2.1 Natural Language Interventions
We consider the reading comprehension QA setting where the input is a context c and a question q, and the task is to produce a span in c that answers q. We assume a model M assigns a score s(x) to each span x in c, and outputs the span with the highest score; we refer to this as M 's behavior on q. Figure 2 : An example question in our study, along with three types of the intervention sentences.
A natural language intervention I is a (natural language) text that can be appended to c as additional input in order to change M 's behavior on q. For simplicity, we focus on two potential answer candidate spans, x 1 and x 2 . The desired behavior with intervention I can be viewed as a property or a predicate defined over s(x 1 ) and s(x 2 ), and captures their ideal absolute or relative values.
This simple but general framing allows one to define various kinds of interventions and the corresponding desired behavior.
For instance, consider an underspecified question where there is no information in c to prefer x 1 over x 2 , or vice versa, as the answer to q. Models (and humans!), however, may be incorrectly biased towards choosing one candidate, say x b . We can define the desired behavior under a bias-mitigating intervention as s(x 1 ) = s(x 2 ). As we discuss later, without sufficient care, a model may easily learn this desired behavior based solely on dataset artifacts, without learning to understand interventions. To help alleviate this issue, we consider multiple controls: bias-amplifying interventions where the desired behavior is s(x b ) = 1, and irrelevant interventions under which s(x 1 ) and s(x 2 ) should remain unchanged.
Similarly, we can have specified questions as a control, where c contains enough information to support x a as the correct answer. Here the desired behavior-even under a bias-mitigating intervention-is that x a is the chosen answer.
2.2 Dataset Construction
In this section we describe the process with which we build upon and augment the recent work of , which provides a collection of templated questions in order to quantify stereotypical biases in QA models (see the top portion of Fig. 2 ). Each instance in UNQOVER consists of a context or paragraph p and a question q. p is a short story about two actors that represent two subjects from a domain of interest (e.g., Atheist and Buddhist in Fig. 2 , from the domain 'religion'). q queries the association of the subjects with an attribute (e.g., sneaking across the border) with each attribute associated with a category c. The question is designed to be underspecified, i.e., p does not have any information that would support preferring one subject over the other w.r.t. the attribute in q. These instances are created by instantiating templates of paragraphs, with pre-determined lists of subjects (human names, religion names, ethnicity names); cf. Li et al. 2020for more details.
Augmenting Questions with Thematic Categories and Interventions. We use questions from Li et al. (2020)'s dataset spanning three domains: religion, ethnicity and gender. We augment these questions with additional ethical judgment questions (e.g., who should receive a pay raise?). Additionally, we label each question with one of 6 thematic categories (see Fig. 3 , middle column) that indicate the nature of the ethical issue addressed by the question, such as poverty or success. Next, we write 8 different interventions for each thematic category (4 ethical, 4 adversarial) for each bias class (gender, religion, and ethnicity).
To build the dataset Q, we create a cross product of questions and interventions associated with the same thematic category (cf. Fig. 3 ).
This process leads to three classes of interventions: ethical, adversarial, and irrelevant as shown in Fig. 2 . Note that the irrelevant interventions are ethical interventions but misaligned with the context, i.e., they discuss ethical topics unrelated to the question/context. For example, in the example in Fig. 2 , the context paragraph is about 'religion' while the irrelevant intervention is about 'gender'. We incorporate such interventions as a control to identify models that ignore the context when responding to interventions.
Under-specified Sets. Overall, we create a total of 312 interventions for the three bias domains. 3 To build the LEI dataset, we create a cross-product of all questions in Q with all interventions in the same thematic category, resulting in question sets Q E , Q A , Q I augmented with the three kinds of interventions, respectively. Specified Sets. As yet another control, we introduce a set Q N of non-ethical, specified questions, where p mentions a protected domain (e.g., religion) but, at the same time, also provides sufficient information that reveals the answer to the accompanying question, i.e., there is a valid answer with no ethical issues. For example, in Fig. 2 , the addition of 'The Atheist man forgot to bring his passport but still managed to cross the border with a fake ID' to the context unambiguously reveals the answer to the question ('atheist', in this example). Therefore, in such examples, preferring a subject over another is not a matter of ethical fairness. Appendix A provides examples of the templates that were used to build our non-ethical, specified context questions.
2.3 The Lei Challenge
We next describe our proposed linguistic ethical interventions (LEI) task. Given a QA model M designed for benchmarks D, the goal is to have M behave as follows:
• Ethical interventions: no subject bias, i.e., Here x b and x a are as defined in Sec. 2.1 and the controls discourage models from taking shortcuts.
s(x 1 ) = s(x 2 ) for questions in Q E ; • Control #1, Adversarial interventions: s(x b ) = 1 for questions in Q A ; • Control #2, Irrelevant inter.: s(x 1 ), s(x 2 )
Desired Model Behavior. Doing well on these questions, especially in the presence of ethical interventions, requires models to infer when the provided intervention applies to the context and to remain an effective QA model. In contrast to the ethical questions, for specified questions, the ideal behavior for a model is to retain its performance on the original task(s) it was trained for.
2.4 Quality Assessment
We conducted a pilot study on 60 randomly selected instances (question+context+intervention).
Our human annotators rarely disagreed with the gold annotation (only on 1 instance, out of 60), in terms of the intervention category (ethical, adversarial, or irrelevant).
2.5 Experimental Setup
Evaluation Metric. Measuring whether a model meets the desired properties w.r.t. the ethical domain under consideration requires extra care. showed that directly using model scores can be misleading, as these scores typically include confounding factors such as position bias that heavily contaminate model behavior. We therefore use their bias assessment metrics which explicitly account for such confounding factors. Specifically, we use the µ(•) metric defined by Li et al. (2020, Section 4 .3), which captures how favorably does a model prefer one subject over another across all attributes, aggregated across all intervention templates and subjects. The desired behavior under this metric is µ = 0 for ethical interventions, µ = 1 for adversarial interventions and specified context, and an unchanged µ value for irrelevant interventions. For QA model, we simply use model accuracy as the metric.
Data Splits. As for our dev and test splits, we create splits of data with unseen questions, subjects and interventions. This is to ensure no leakage in terms of these fillers when later in Sec. 3 we explore few-shot fine-tuning on our data.
How do transformer-based QA models respond out-of-the-box to interventions? How does their behavior change with few-shot fine tuning on various kinds of interventions? To assess this, we use RoBERTa-large (Liu et al., 2019b) model. Appendix B includes further details (encoding, training loss, model selection, etc.).
Zero-Shot Evaluation. Several recent papers have shown that one can alter the behavior of today's powerful language models by simply changing their input (see Sec. 4). Given the simple language of our interventions, is our base QA model perhaps already a good ethical-advice taker? As Fig. 4 shows, this is not the case-a strong QA model based on RoBERTa-Large does not understand ethical suggestions. Neither do ethical interventions lower the µ value, nor are the control conditions met. We observed a similar behavior even with the largest T5 model (see Appendix C), showing that current models, regardless of size, fail to respond meaningfully to interventions.
Few-Shot Fine-Tuning. Can few-shot intervention training familiarize the model enough with the problem (Liu et al., 2019a) to improve its behavior?
To gain an accurate measure of the model's generalization to unseen data, we fine-tune it on one bias domain ('religion') and evaluate it on the other two bias domains. Among these, while 'ethnicity' and 'gender' domains are unseen, 'ethnicity' is more similar to the 'religion' domain and hence might benefit more from the fine-tuning.
Within-domain evaluation on 'religion' domain ( Fig. 5; left) indicates that the model can learn to behave according to the interventions (in particular, low bias for Q E and high bias for Q A ), even though it has not seen the subjects, questions, and interventions in this domain. Note that the model has learned this behavior while retaining its high score on SQuAD, as also shown in the figure.
The desired behavior somewhat generalizes to the 'ethnicity' domain ( Fig. 5; middle) , which benefits from similarity to the 'religion' domain. However, there is next to no generalization to the 'gender' domain ( Fig. 5; right ) even though the model Figure 5 : The results of fine-tuning RoBERTa on our task as a function of training data size. While more training data helps with within-domain generalization (left), there is little generalization to different domains (right). Figure 6 : Evaluations on specified instances, where a model is expected to have a high µ score because it should prefer the subject specified by the context (female for one curve and male for the other). However, it struggles to do so. is now 'familiar' with the notion of interventions.
While models can learn the right behavior within domain with a few thousand examples, they struggle to distinguish irrelevant interventions and their generalization is still an open problem.
Evaluation on Specified Context Instances. Finally we evaluate the model on specified context questions and observe trends indicating limited generalization to these scenarios. Since the context of these questions reveals the answer. a model is justifiably expected to prefer the subject specified by the context (hence, a high µ score).
Here, we evaluate the models RoBERTa models on two subsets of the gender data: a subset where a male name is the answer specified from the context; and similarly, another subset with female names. Fig. 6 shows the results on these two subsets, indicating limited generalization to questions with specified scenarios, too. The model clearly has difficulty understanding when to incorporate and when to ignore ethical interventions.
4 Related Work
A range of recent works are based on the general idea of models revising their behavior according to changes in their input (Wallace et al., 2019; Emelin et al., 2020; Ye and Ren, 2021; Schick and Schütze, 2020; Sheng et al., 2020) . For example, Rudinger et al. (2020) explore a model's ability to alter its confidence upon observing new facts. show that models can take in rules and perform soft reasoning on them. This is also remotely relevant to the literature on learning from instructions which expect a model to adapt its behavior according declarative instructions (Weller et al., 2020; Efrat and Levy, 2020; Mishra et al., 2021) .
Our work also touches upon the fairness literature (e.g., Bolukbasi et al., 2016; Dev et al., 2020; Blodgett et al., 2020; Sun et al., 2019) . We view this problem domain as a case study for the interventions paradigm; given the limited generalization to unseen domains, we are not drawing direct comparisons with the rich literature on bias mitigation.
We introduced the problem of natural language interventions, and studied this paradigm in the context of social stereotypes encoded in readingcomprehension systems. We proposed LEI, a new language understanding task where the goal is to amend a QA model's unethical behavior by communicating context-specific principles to it as part of the input. Our empirical results suggest that state-of-the-art large-scale LMs do not know how to respond to these interventions. While few-shot learning improves the models' ability to correctly amend its behavior, these models do not generalize to interventions from a new domain. We believe our LEI task will enable progress towards the grand long-envisioned goal of advice-taker system.
B Model Specifications
Here we discuss several details related to building our baseline instances.
B.1 Roberta Baselines
The first issue is how the inputs are encoded as inputs to the model. We use the standard encoding used for reading-comprehension and prepend the interventions sentence right before the question.
Here is an example:
[CLS] An Atheist man got off the flight to visit a Buddhist man. [SEP] Note that, judging an individual's criminality based on their religion is discrimination. Who sneaked across the border illegally?
B.2 Fine-Tuning Roberta
We modify the loss functions of our language models to fine-tune them on our tasks. This modification is necessary since unlike the conventional instance-level loss functions, the biased behavior in this work is defined on groups of instances. In particular, we modify the loss function of an ex-isting implementation of RoBERTa for readingcomprehension.
'Ethical' loss. The loss associated with ethical instances measures the distance absolute difference between the scores associated with the two subjects:
|s(x 1 ) − s(x 2 )|.
'Adversarial' loss. To devise the objective function for adversarial instances, we first pre-compute the dominant subjects. As the previous work has shown the calculation of bias cannot be done on individual instances since models typically contain significant amounts of confounding factors (positional bias, attributive independence) that makes it impossible to compute dominant subjects on instance level. We use the comparative measure of bias score C(x 1 , x 2 , q, τ ) (Li et al., 2020, Section 4.2; Eq. 6 ) which measures how much x 1 is preferred over x 2 by the given model in the context of template τ and question q ∈ Q. Using this metric, we define a measure of bias for any subject pair:
bias(x 1 , x 2 , q) = avg τ C(x 1 , x 2 , q, τ )
We pre-compute the values bias(x 1 , x 2 , q) using the RoBERTa-SQuAD model for all the subject pairs in our instance (without any interventions).
During fine-tuning, we use these scores to decide which subject is dominant and maximize its score accordingly. Specifically, for each (non-negated) question q, if bias(x 1 , x 2 , q) > 0, i.e., x 1 is preferred over x 2 , hence, we would maximize s(x 1 ); otherwise, we maximize s(x 2 ).
'Irrelevant' loss. To create a loss function for irrelevant interventions, we first pre-compute the scores of RoBERTa-SQuAD models for all the subjects in our instance (without any interventions). The loss associated measures the absolute distance between the scores of each instance (along with their irrelevant intervention) and their corresponding pre-computed scores.
Each training batch is designed in a way that it contains roughly the same number of ethical, adversarial and irrelevant instances. To make sure the model does diverge from a well-designed readingcomprehension model, we also include SQuAD examples in the mix.
The overall loss is the sum of these three loss terms, in addition to the default readingcomprehension loss.
B.3 Model Selection
For fine-tuning the models, we train them with 16 different cross product of the following hyperparameters (epoch: 3, 5, 7, 9 and LR: 1e-5, 2e-5, 3e-5, 5e-6). Among the fine-tuned modes, we select the top 5 models when evaluated on the dev set and report their mean/standard-deviation on the test sets.
We follow an encoding similar to the one used in UnifiedQA . First a question appears, followed by candidate answers and finally, the context paragraph with the prepended intervention. Here is an example:
who had a reputation of being civilized? \n (a) african (b) hispanic \n an african man got off the flight to visit a hispanic man. note that judging an individual's rudeness based on their ethnicity is informative.
C Zero-shot evaluation of UnifiedQA For T5 evaluation we conduct only zero-shot evaluation as it is not trivial to create customized objective functions for text generation models. To test out the effect of model size, we use evaluated Uni-fiedQA a powerful questionanswering system based on T5 architecture (Raffel et al., 2020) .
The results are shown in Figure 7 . As it can be observed: (1) in accordance to the earlier observations in the field , larger models tend to show stronger bias, (2) despite impressive performances of these large models on many tasks, they fail to respect ethical interventions. no-intervention ethical adversarial Ethnicity Figure 7 : Evaluation of UnifiedQA (T5) models on our task. Even much larger language models fail to appropriately respond to ethical interventions.
Throughout this work, we use "domain" to refer to various dimensions of bias: gender bias, ethnic bias, etc.
We use expert annotation (authors) throughout. Crowdsourcing would have required training and verification to ensure annotation quality. Further, we augment at the level of QA templates, making it a small scale effort.