Human-Aware Loss Functions (HALOs)

Kawin Ethayarajh
Winnie Xu
Contextual Ai
Niklas Muennighoff
Dan Jurafsky

Abstract

From Kahneman & Tversky's seminal work on prospect theory (1992), we know that humans perceive random variables in a systematically distorted manner; for example, humans are famously loss-averse. We show that existing methods for aligning LLMs with human feedback implicitly model some of these distortions, making them human-aware loss functions (HALOs). However, the utility functions these methods impute to humans still differ in some ways from those in the prospect theory literature. By bridging this gap, we derive a HALO that directly maximizes the utility of LLM generations instead of maximizing the log-likelihood of preferences, as current methods do. We call our approach Kahneman-Tversky Optimization (KTO). KTO matches or exceeds the performance of direct preference optimization methods at scales from 1B to 30B. Moreover, because KTO does not need preference pairs-only knowledge of whether an output is desirable or undesirable for a given input-it is much easier to deploy in the real world, where the latter kind of data is far more abundant.

1. Introduction

Aligning models with human feedback has quickly become one of the most pressing questions in ML research. Yet the connection between this line of research and related work in behavioral economics has been under-explored. In this technical report, 1. We show that alignment methods work in part because they are human-aware loss functions (HALOs); they impute to humans a utility function that possess many qualities of the utility functions that have been empirically derived in prospect theory. Through a series of experiments on the Pythia (Biderman et al., 2023) and Llama (Touvron et al., 2023) model families, we identify which HALOs yield more performant models and at what scales the improvements emerge.

2. Based on prospect theory (1992), we derive a new HALO called the Kahneman-Tversky Optimization (KTO) loss. Unlike existing state-of-the-art methods, KTO does not require paired preference data ( , , )-only ( , ) and knowledge of whether is desirable or undesirable. KTOaligned models are as good or better than DPOaligned models at scales from 1B to 30B, despite not using paired preferences.

KTO is also far easier to use in the real world than preference optimization methods, as the kind of data it requires is far more abundant. For example, every retail company has a lot of customer interaction data and whether that interaction was successful (e.g., purchase made) or unsuccessful (e.g., no purchase made). They have little to no counterfactual data (i.e., what would have made an unsuccessful customer interaction into a successful one ).

3. To validate KTO and understand how alignment scales across model sizes, we are releasing Archangel, the largest-ever suite of humanfeedback aligned LLMs. It comprises 77 models: {7 pretrained models from 1B to 30B} x {11 different alignment methods}, all aligned on a mixture of the Anthropic HH (Ganguli et al., 2022) , Stanford Human Preferences (Ethayarajh et al., 2022) , and OpenAssistant (Köpf et al., 2023) datasets under nearly identical training settings.

2. Background

Large language models are traditionally trained in three stages:

1. Pretraining: Given some large corpus, train the model to predict the next token given the preceding text. The loss function is the cross-entropy loss (also called the "negative log-likelihood loss" or "standard loss"). Let's call the pretrained model .

2. Supervised Finetuning: Still using the standard loss, finetune the model to predict the next token on data that is more relevant to the downstream task. Let's call this version ref .

3. Reinforcment Learning From Human Feedback:

Given a dataset D of human preferences ( , , ) -where is an input, , are the preferred and dispreferred outputs, and * is the "true" reward function -first assume that the probability humans will prefer to can be captured with a Bradley-Terry model of preferences (Bradley and Terry, 1952) . Where is the logistic function:

EQUATION (1): Not extracted; please refer to original document.

Since getting the true reward from a human would be intractably expensive, we have to learn a reward model that can serve as a proxy, done by minimizing the negative log-likelihood of the human preference data.

L ( ) = E , , ∼ [− log ( ( , ) − ( , ))]

Now we have a human proxy whose judgments we can use to critique the generations of .

But solely maximizing the reward might come at the expense of things like generating grammatical text. To avoid such outcomes, we need a term to restrict how far the language model can drift from the useful version ref that already exists after finetuning. Where is the model we are optimizing and * is the model that optimally trades off these two concerns,

* = arg max E ∈ , ∈ [ ( , )] − KL ( ( | ) ∥ ∥ ref ( | )) (2)

where KL is the KL-divergence between the two distributions, and > 0 is a hyperparameter. Since this objective is not differentiable, we need to use an RL algorithm like PPO (Schulman et al., 2017) .

3. Do We Need Rl?

RLHF is not the only way to align LLMs, however. In fact, given the unstable nature of RLHF in a distributed setting, the research community is increasingly turning to closed-form loss functions that can be directly optimized on a dataset of human preferences. As we will see in the next section, these methods also have a connection to prospect theory (Tversky and Kahneman, 1992) .

3.1 Direct Preference Optimization

We know from earlier work (Peng et al., 2019) that the optimal language model for the objective in (2) would have the distribution:

* ( | ) = 1 ( ) ref ( | ) exp 1 * ( , )

where ( ) is a partition function that turns the right-hand side into a probability. In a recent paper, Rafailov+Sharma+Mitchell et al. (2023) rewrote the above in terms of the optimal reward:

EQUATION (3): Not extracted; please refer to original document.

They then plugged this back into equation (1) to express the preference probability only in terms of the optimal language model distribution * and reference distribution ref . This clever idea allows us to avoid calculating an explicit reward:

* ( ≻ | ) = 1 1 + exp − log * ( | ) ref ( | ) − log * ( | ) ref ( | )

Although we don't know what * is, we know that the more aligned our language model is with human preferences, the greater ( ≻ | ) will be. This means that we can directly optimize our language model to minimize the negative log-likelihood of the observed human preferences, which is called the direct preference optimization (DPO) loss:

EQUATION (4): Not extracted; please refer to original document.

According to the authors, their method works equally as well as traditional RLHF in theory and better in practice because it does not suffer from the former's training instabilities.

Figure 1: LLM alignment involves supervised finetuning followed by optimizing a human-aware loss (HALO). However, the paired preferences that existing approaches need are hard-to-get. Kahneman-Tversky Optimization (KTO) uses a far more abundant kind of data, making it much easier to use in the real world.

3.2 Sequence-Likelihood Calibration

Zhao et al. 2023took a simpler approach: just make sure that the log probability of the preferred output is greater than that of the dispreferred output by a margin of at least :

EQUATION (5): Not extracted; please refer to original document.

As mentioned before, we don't want to drift too far from the reference model, which the authors enforce by adding a -weighted cross-entropy term for samples generated from the reference model ref . This gives us the Sequence-Likelihood Calibration (SLiC) loss:

L SLiC ( , ref ) = L cal ( ) + reg E ∼ , ∼ ref ( ) [− log ( | )]

Notice that this doesn't have the neat equivalence to RLHF that DPO does; even if we only consider L cal ( ), the implied preference model looks like

* ( ≻ | ) = min (0, 1 * ( , ) − 1 * ( , )− − ref ( | ) ref ( | ) )

which does not look like any conventional preference model. Since sampling from ref is slow, for the experiments in this paper, we assume that reference distribution recovers the SFT distribution and treat theweighted term as a standard language modelling loss.

As the standard loss is already incorporated, we just do a single stage of alignment-otherwise, the models would effectively undergo 2 epochs of supervised finetuning, precluding an apples-to-apples comparison.

3.3 Ppo (Offline, One-Step)

The standard RLHF objective in (2) is typically optimized with a variant of Proximal Policy Optimization (PPO) (Schulman et al., 2017) , which works by "clipping" how far our language model can drift from the version old at the previous step. PPO is an online algorithm-generations are sampled from our current model, judged by a reward model, and then used to update the current version. However, this process is slow (largely due to sampling generations) and quite unstable in practice (especially in a distributed setting), so we can:

1. Never update old and keep it as ref , instead clipping less conservatively than we traditionally would.

2. Use preferences from an existing dataset instead of inferring them on-the-go. Baheti et al. (2023) found that these changes, along with treating the entire output sequence as a single actionas opposed to treating the generation of each token separately-greatly improves stability; they called their approach ALoL. However, since language model alignment has historically treated each token as a separate action, we omit the third change and only preserve the first two. To make this even simpler, we won't even bother learning a reward and just use +1 for w and -1 for l . The resulting loss looks like:

L PPO (offline) = −E , ∼ [ min( ( , < , ), clip( , 1 − , 1 + ) ( , < , ))] where = log ref and ( , < , )

Figure 2: Many alignment approaches work similarly well up to 7B parameters. Surprisingly, despite using +1/-1 dummy rewards, our offline PPO variant matches DPO at scales up to 13B. However, SFT+DPO is uniquely performant at the 30B scale, though it’s possible that using less noisy data might cause this to be apparent at smaller scales. The bars denotes the win rate - 0.5, with a 90% binomial confidence interval.

is the per-token advantage (i.e., the surplus benefit from producing a given token in a given state). Note that calling this method Figure 2 : Many alignment approaches work similarly well up to 7B parameters. Surprisingly, despite using +1/-1 dummy rewards, our offline PPO variant matches DPO at scales up to 13B. However, SFT+DPO is uniquely performant at the 30B scale, though it's possible that using less noisy data might cause this to be apparent at smaller scales. The bars denotes the win rate -0.5, with a 90% binomial confidence interval.

PPO is a misnomer, because of these changes. But to avoid introducing too many new terms, we will call this "PPO (offline)".

3.4 Which Existing Method Works Best?

To benchmark these methods, we aligned Pythia-{1.4, 2.8, 6.9, 12.0}B (Biderman et al., 2023) and Llama-{7, 13, 30}B (Touvron et al., 2023 ) models on three wellknown human-feedback datasets: Anthropic HH (Ganguli et al., 2022), OpenAssistant (Köpf et al., 2023) , and the subset of SHP recommended in the original release (Ethayarajh et al., 2022) . Because Pythia models were pretrained on 0.3T tokens compared to 1.0T tokens for LLama, they are categorically under-performant; any cross-family comparisons should keep this in mind. All models were aligned under identical settings (e.g., same effective batch size, same optimizer, etc.), save for configurations unique to them. When applicable, we also did supervised finetuning (SFT), where the SFT targets are a subset of the generations used to subsequently align the model, following the precedent set by Rafailov et al. (2023) . Then we used GPT-4 to judge whether the aligned model's response was better than SFT target SFT for the given context with respect to helpfulness, harmlessness, and conciseness. Note that while the SFT target is considered a desirable output for , it is by no means the best output, meaning that it can be improved upon by an aligned model.

As seen in Figure 2 , some of our findings are surprising:

1. At the ¡7B scale, aligning the model offers no advantage over doing SFT alone. The only models that show a significant improvement from being aligned after SFT are Llama-{13B, 30B}, and this is only true when aligned with either our PPO variant or DPO.

2. DPO does not offer a significant advantage over PPO (off-policy, offline, one-step) until the 30B scale. This is quite surprising because this PPO version does not use a learned reward model -it just uses dummy reward of +1 for and -1 for . The fact that it works so well suggests that learning a good reward model is not as crucial as previously thought, and a noisy reward may actually be helpful as an implicit regularizer.

Figure 3: Supervised finetuning makes LLM generations much shorter by preventing models from hallucinating several turns of a multi-turn conversation.

3. Both our PPO variant and DPO perform significantly better when you do SFT first, as is usually recommended. The biggest difference that SFT makes is that the outputs get a lot shorter because the LLM stops hallucinating an entire multi-turn conversation (Figure 3 ).

4. Human-Aware Losses

The economists Kahneman & Tversky are best known for their work on prospect theory, a theory of how humans make decisions about uncertain outcomes (Tversky and Kahneman, 1992). Most famously, this theory formalized notions such as loss aversion, the tendency of humans to be more sensitive to losses than gains of the same magnitude. The two points of prospect theory most relevant to this work are the findings that:

1. The utility of some outcome is always relative to some reference point (e.g., the money one has to begin with or is guaranteed to receive).

2. Human utility is not linear in the relative gain or loss; the rate of change in utility diminishes the further you move from the reference point.

Where is the monetary reward from an outcome and ref is the baseline, Tversky and Kahneman (1992) proposed the following functional form for human utility, also called the human value function:

EQUATION (6): Not extracted; please refer to original document.

where the median value of = 0.88 and = 2.25 across individuals. These values were determined via experiments that asked people for the certainty equivalent of a gamble (e.g., the minimum amount of guaranteed compensation someone would take in place of a particular gamble). For example, for a gamble that returned $100 with 80% probability and $0 with 20% probability, a person might say their certainty equivalent is $60, which is lower than the expected value because of humans' tendency to be loss-averse. There are other functional forms that have been proposed in later work as well (Gurevich et al., 2009) . The salient qualities of a human value function are:

1. the existence of a reference point that is added or subtracted to get the relative gain or loss 2. convexity of the value function in relative losses and concavity in gains (i.e., diminishing sensitivity the further you are from the reference point)

3. loss-aversion (a greater rate of change in utility in the loss regime)

Figure 4: The utility functions (a.k.a., human value functions) implied by alignment methods are similar to the those empirically derived by Tversky and Kahneman (1992) to describe the way people make decisions about uncertain monetary outcomes.

In Figure 4 , we plot the value functions that the alignment functions impute to humans:

ℎ RLHF ( , , ) = ( RLHF ( , ) − RLHF ( , )) ℎ DPO ( , , ) = [log ( DPO ( , ) − DPO ( , ))] ℎ SLiC ( , , ) = min (0, SLiC ( , ) − SLiC ( , ) − )

All of them have qualities of a Kahneman-Tvesky value function: all of them acknowledge the existence of a reference point (namely the reward of the dispreferred ); most are both concave in gains and convex in losses; most demonstrate loss-aversion. For this reason, we call these methods human-aware loss functions (HALOs). The fact that DPO performance can be matched with offline PPO on dummy rewards (up to 13B parameters), as discussed in section 3.4, challenges the conventional wisdom in LLM alignment that places heavy emphasis on reward learning, instead suggesting that the implicit modeling of human biases plays a significant role in the success of HALOs. Figure 4 : The utility functions (a.k.a., human value functions) implied by alignment methods are similar to the those empirically derived by Tversky and Kahneman (1992) to describe the way people make decisions about uncertain monetary outcomes.

5. Kahneman-Tversky Optimization

If the usefulness of alignment methods is largely predicated on them being HALOs, then preference pairs may not be required. Instead of maximizing the likelihood of preferences, we can directly maximize the utility of outputs instead. We can do so by adapting the Kahneman-Tversky human value function (6) to the LLM setting:

1. The exponent in the original function makes it difficult to optimize, so we set ℎ to be ℎ( ,

ref ) = ( − ref )

given that the logistic function is also concave in gains and convex in losses. We replace the loss-aversion coefficient with two hyperparameters , that weight the losses for desirble and undesirable examples respectively.

2. Since Llm Generations Do Not Have A Monetary

value associated with them, we replace the monetary reward with the implicit reward under the RLHF objective (3).

3. Humans have some sense of all the probable generations that can follow , not just , . Thus it makes more sense for the reference point ref to be the expected reward under the optimal policy, not just for generations following but following any input ′ :

E ′ ∼ , ′ ∼ * [ * ( ′ , ′ )].

Combining these changes, and assuming that ( ) in (3) is the same for all inputs, we get a new objective:

ℎ( , ; ) = ( * ( , ) − E ′ ∼ , ′ ∼ * [ * ( ′ , ′ )]) = log * ( | ) ref ( | ) − E ′ ∼ [ KL( * ∥ ref )]

Figure 5: Kahneman-Tversky Optimization (KTO) is as good or better than DPO at all scales, both when preceded or not preceded by supervised finetuning (SFT). For the Llama models, KTO does not need to be preceded by SFT to generate outputs that match SFT+DPO in quality. Error bars denote a 90% binomial confidence interval.

where

* , ref are shorthand for * ( | ), ref ( | ) re- spectively.

We do not know what * is, but we know that the more aligned our language model is, the greater the value ℎ( , ; ) will be. Therefore, based on whether a given generation is considered "desirable" or "undesirable", we can optimize the following loss:

EQUATION (7): Not extracted; please refer to original document.

where

EQUATION (8): Not extracted; please refer to original document.

Implementation In practice, we estimate the KL term by matching inputs ′ with unrelated outputs in the same batch (of size ) and then calculating max(0, 1 log ( | ′ ) ref ( | ′ ) ). However, we do not backpropagate through the KL term (i.e., we detach it from the computational graph), as it makes training much more stable. This means that the KL term purely serves to control how saturated the loss function is. Thus for a minibatch of examples with different inputs and a corresponding that is (un)desirable, we get losses that all share a KL term.

Figure 6: On Llama-7B, we can randomly discard up to 90% of the desirable data before aligning with KTO and still exceed DPO performance. This is made possible by the fact that you can upweight the losses of the more scarce kind of data using the hyperparameters 𝜆𝐷 , 𝜆𝑈 so that their effective impact on the total data is similar for both groups.

By default, the loss weights = = 1. However, if there is an imbalance (i.e, there is more desirable than Figure 6 : On Llama-7B, we can randomly discard up to 90% of the desirable data before aligning with KTO and still exceed DPO performance. This is made possible by the fact that you can upweight the losses of the more scarce kind of data using the hyperparameters , so that their effective impact on the total data is similar for both groups.

undesirable data or vice-versa), then the hyperparameters should be set such that

desirable undesirable ∈ [1, 4 3 ]

In other words, the effective ratio of desirable to undesirable losses in the data should be somewhere from 1:1 to 1.33:1. If we randomly discarded 90% of the desirable data for example, then desirable undesirable = 0.1, so should be between 10 and 13.33

Results We align the same suite of models as in section 3 on the same data with the KTO loss (see Figure 5 ). We find that:

1. SFT+KTO is competitive with SFT+DPO at all scales, despite not using pairs of preferences.

2. KTO alone is significantly better than DPO alone for the Llama-{7B, 13B, 30B} models. In fact, a KTO-aligned Llama-{13B, 30B} model is competitive with its SFT+KTO counterpart, despite not undergoing supervised finetuning first, and is the only alignment method of the ones we tested to show this behavior.

3. We can randomly discard up to 90% of the desirable data before aligning with KTO and still exceed DPO performance (the same holds for undesirable data, as seen in Figure 6 ).

It is worth noting that these results understate the practical improvement that KTO has over DPO. In realworld settings, KTO will have access to far more data than DPO-like methods because it does not rely on paired preference data. For example, a retail company will have a lot of customer interactions and knowledge of whether they went well or poorly (i.e.,

( , , 1[ is desirable])); they will have little counterfactual data of the type ( , , ).

6. Archangel

We are releasing all 77 models we trained as the Archangel suite: {4 Pythia models + 3 Llama mod-els} x {SFT, SLiC, SFT+SLiC, DPO, SFT+DPO, PPO (offline), SFT+PPO (offline), KTO, SFT+KTO (offline), CSFT, SFT+CSFT}. 1 The models were all trained and sampled under nearly identical settings (e.g., same random seed, same optimizer, same learning rate scheduler, effective batch size of 32, etc.). Hyperparameters unique to a model were set according to a sweep. Unsurprisingly, values of hyperparameters that had the same meaning across different loss functions (e.g., in KTO and DPO) ended up having the same value. Because some methods relied on pairs of preferences and others did not, the order in which the training data was seen was different across the two kinds of losses (e.g., preference-based vs. preference-free) but identical within the same type of loss. The prompts used to sample generations for GPT-4 judgments were identical across all models. The model prompts follow the chat format of TÜLU 2 (Ivison et al., 2023) . Additionally, models trained with conditional tokens should postpend either <|good|> or <|bad|> to the prompt. By aligning these 77 models in close-to-identical settings, we hope that the research community can better understand how the effectiveness of alignment evolves across different methods and across different scales.

7. Future Work

The existence of HALOs as a distinct class of functions raises many interesting questions:

• Is there a human value function -and corresponding HALO -that better describes how humans see language? The KTO loss is based on the median human value function for monetary gains and losses, which is almost certainly different from how humans perceive the relative goodness/badness of text. So what does a human value function for language specifically look like? What is its median form and how does it vary across individuals?

• What differences in helpfulness/harmfuless emerge at different scales? All else constant, are feedback-aligned LLMs more likely to be sycophantic when they are larger (Perez et al., 2022) , as some others have pointed out? Or 1 Models are available on Huggingface and our code is available on Github under ContextualAI/HALOs. is harmfulness more of an issue with smaller models, simply because they have a worse sense of what is good and bad?

• Given that the data that KTO needs is much more accessible, how far can we push synthetic data? For example, if we wanted to create a toxicity dataset to align our models to be less toxic, creating a tuple ( , , ) where is more toxic than is tricky. However, with KTO, we can easily create a dataset ( , , ⊮[ is desirable])) where desirability is determined by some black-box toxicity detection API. The ability to align models with score-based data is a huge appeal of PPO, and KTO permits a binary version of this.