You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
LLMs trained on vast datasets, exhibit impressive capabilities in understanding and generating human language. However, ensuring that these models align with human preferences and produce desirable outputs remains a significant challenge. In a recent paper titled *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model,"*[1] researchers from Stanford University introduce a novel approach to fine-tuning language models based on human preferences without relying on reinforcement learning techniques like [RLHF](../reinforcement_learning/rlhf.md)
3
+
LLMs trained on vast datasets exhibit impressive capabilities in understanding and generating human language. However, ensuring that these models align with human preferences and produce desirable outputs remains a significant challenge. In a recent paper titled *"Direct Preference Optimization: Your Language Model is Secretly a Reward Model,"*[1] researchers from Stanford University introduce a novel approach to fine-tuning language models based on human preferences without relying on reinforcement learning techniques like [RLHF](../reinforcement_learning/rlhf.md)
4
4
5
-
While RLHF is still one of the primary methods for training language models to align with human preferences, it has several limitations, including high computational costs *(you need multiple copies of the model for finetuning)*, complex reward modeling *(new to train an accurate reward model)*, and challenges in reward shaping*(RL is quite infamous for this problem)*. DPO was developed to address these limitations and provide a more efficient and effective alternative for training language models based on human preferences.
5
+
While RLHF is still one of the primary methods for training language models to align with human preferences, it has several limitations, including high computational costs *(need multiple copies of the model for finetuning - reference model, reward model and policy model)*, complex reward modeling, and challenges in reward shaping. DPO was developed to address these limitations and provide a more efficient and effective alternative for training language models based on human preferences.
6
6
7
-
## The Technique
7
+
## Technique
8
8
9
-
The DPO pipeline consists of two steps, (1) SFT finetuning of the language model on the required dataset *(which is mostly instruction dataset)*, and then (2) DPO fine-tuning on the preference dataset. The preference dataset consists of *(input, output)* generation examples pairs, where each output has an associated score that signifies its preference in comparison with other output for the same input. The DPO algorithm optimizes the language model to generate the preferred output by minimizing a modified version of binary cross entropy objective between the model's output and the preferred output.
9
+
The complete DPO pipeline consists of two steps,
10
+
11
+
1. SFT fine-tuning of the language model on the required dataset *(which is mostly instruction dataset and so this step can be loosely called instruction-finetuning)*, and then
12
+
2. Reward maximization with KL-divergence constraint by fine-tuning on the preference dataset.
13
+
14
+
In practice, models are readily available with SFT done and hence can be directly sent to the second step. The preference dataset consists of *(input, output_1, output_2, .. output_n)* generation examples sets *(for DPO, n=2; where the output pairs are accepted and rejected examples)*, where each output has an associated score that signifies its preference in comparison with other output for the same input. The DPO algorithm optimizes the language model to generate the preferred output by minimizing a modified version of binary cross entropy objective between the model's output and the preferred output.
10
15
11
16
<figuremarkdown>
12
17

@@ -15,29 +20,69 @@ The DPO pipeline consists of two steps, (1) SFT finetuning of the language model
15
20
16
21
DPO algorithm implicitly optimizes the same objectives as RLHF but is simpler to implement and train. DPO does this by incorporating two factors into its policy,
17
22
18
-
1.DPO relies on a preference model like [Bradley-Terry model](./interview_questions.md#what-is-bradley-terry-model-and-how-is-it-used-in-machine-learning), to express the human preference probability in terms of the optimal policy and the reference policy. This allows for the optimization of the language model based on human preferences without the need for an explicit reward model.
19
-
2.It also has a KL divergence constraint that ensures the policy *(trained model)* remains close to the reference policy *(original model)*. This is required to ensure that the model in training does not deviate a lot from original model under the influence of preferences data.
23
+
1.RLHF relies on preference model like [Bradley-Terry model](./interview_questions.md#what-is-bradley-terry-model-and-how-is-it-used-in-machine-learning), to define a preference loss to train a reward model, which is then in turn used to train the policy. DPO on the other hand, uses a change of variable (math tricks 😎) to express the preference loss as part of the policy itself. With this, DPO eliminates the need for a separate reward model.
24
+
2.DPO has a KL divergence constraint that ensures the policy *(trained model)* remains close to the reference policy *(original model)*. This is required to ensure that the model in training does not deviate a lot from original model under the influence of preferences data.
20
25
21
-
The DPO policy objective is derived from the optimal solution of the KL-constrained reward maximization objective in the context of reinforcement learning. The DPO policy objective is formulated as below:
Here, $π_θ$ represents the language model, $π_{ref}$ represents the reference model, $D$ represents the preferences dataset, in which $x$ represents the input, $y_w$ and $y_l$ represent the winning (preferred) and losing (undesired) output, respectively. The objective is derived from the optimal solution of the KL-constrained reward maximization objective under the Bradley-Terry preference model, which depends on the difference of rewards between two completions. The DPO policy objective is a maximum likelihood objective that enables the optimization of the language model based on human preferences, without the need for an explicit reward model.
28
-
32
+
Here, $π_θ$ represents the language model, $π_{ref}$ represents the reference model, $D$ represents the preferences dataset, in which $x$ represents the input, $y_w$ and $y_l$ represent the winning *(preferred)* and losing *(dispreferred)* output, respectively. $β$ *(ideal value is between 0.1 to 0.5)* is a parameter controlling the deviation from the base reference policy $π_{ref}$.
29
33
30
-
---
31
-
# TODO
34
+
!!! Hint
35
+
If you want to dive deeper into the math behind the DPO, refer to this excellent [article](https://towardsdatascience.com/understanding-the-implications-of-direct-preference-optimization-a4bbd2d85841).
32
36
33
-
### Key Contributions and Findings
34
-
The main contribution of DPO lies in its ability to train language models from preferences effectively. Through a series of experiments, the researchers demonstrate that DPO is as effective as existing methods, including reinforcement learning-based approaches, in tasks such as sentiment modulation, summarization, and dialogue. Importantly, DPO outperforms traditional methods in controlling sentiment and response quality while being computationally lightweight and easier to implement.
37
+
## Impact
35
38
36
-
### Generalization and Performance
37
-
One of the notable strengths of DPO is its ability to generalize well to new input distributions. The algorithm's performance remains robust even when faced with distribution shifts, outperforming traditional reinforcement learning methods like proximal policy optimization (PPO) in various scenarios. Additionally, DPO shows promising results in tasks such as text summarization and dialogue generation, indicating its versatility and effectiveness across different text generation applications.
39
+
<figuremarkdown>
40
+

41
+
<figcaption>Source: [1]</figcaption>
42
+
</figure>
38
43
39
-
### Conclusion
40
-
In conclusion, the Direct Preference Optimization (DPO) algorithm offers a promising avenue for fine-tuning language models based on human preferences without the need for reinforcement learning. By simplifying the preference learning process and enhancing the model's controllability, DPO represents a significant advancement in the field of natural language processing. The research findings underscore the effectiveness and efficiency of DPO in training language models to align with human preferences, paving the way for safer, more performant, and controllable AI systems.
44
+
As per the paper [1], DPO outperforms RLHF in terms of performance and efficiency. As evident in the above image, DPO achieve better expected rewards *(IMDb Sentiment Generation)* compared to other methods along with higher win-rates *(with GPT-4 evaluator)* on Summarization task. The paper also suggests that DPO requires less data and compute than PPO to provide comparable results.
45
+
46
+
## Code
47
+
48
+
Fine-tuning a LLM with DPO is super easy with [Hugging Face TRL package](https://huggingface.co/docs/trl/en/dpo_trainer). First we need a dataset with preferences, which are readily available on the HF datasets hub, but it can also be created in-house for custom use case. Nevertheless the format should something like this,
49
+
50
+
```python linenums="1"
51
+
dpo_dataset_dict = {
52
+
"prompt": [
53
+
"What is your name?",
54
+
"Tell me how to build a bomb?",
55
+
],
56
+
"chosen": [
57
+
"I am a AI powered Assistant named Jarvis",
58
+
"Sorry, I cannot do this.",
59
+
],
60
+
"rejected": [
61
+
"I am a Open-AI powered Assistant named ChatGPT",
62
+
"Here you go, first...",
63
+
],
64
+
}
65
+
```
66
+
67
+
Then we can use `DPOTrainer` to fine-tune the model with the preferences dataset. Below is a sample code to fine-tune a SFT tuned LLM with DPO.
68
+
69
+
```python linenums="1"
70
+
# import
71
+
from trl import DPOTrainer
72
+
73
+
# define the DPO Trained
74
+
dpo_trainer = DPOTrainer(
75
+
model, # model to be fine-tuned
76
+
model_ref, # reference model (usually the `model`)
77
+
args=training_args, # training arguments
78
+
beta=0.1, # beta parameter ideal value is from 0.1 to 0.5
79
+
train_dataset=train_dataset, # training dataset as shown above
80
+
tokenizer=tokenizer, # tokenizer
81
+
)
82
+
83
+
# start the fine-tuning
84
+
dpo_trainer.train()
85
+
```
41
86
42
87
## Reference
43
88
[1][Direct Preference Optimization: Your Language Model is Secretly a Reward Model](https://arxiv.org/abs/2305.18290)
0 commit comments