Skip to content

Commit 009712d

Browse files
Merge pull request #2042 from mrbullwinkle/mrb_12_17_2024_preference_fine_tuning
[Azure OpenAI] Direct Preference Optimization (fine-tuning)
2 parents 9bcb7c0 + d861c62 commit 009712d

File tree

4 files changed

+64
-1
lines changed

4 files changed

+64
-1
lines changed

articles/ai-services/openai/how-to/fine-tuning.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -97,6 +97,58 @@ Images containing the following will be excluded from your dataset and not used
9797

9898
Azure OpenAI fine-tuning supports prompt caching with select models. Prompt caching allows you to reduce overall request latency and cost for longer prompts that have identical content at the beginning of the prompt. To learn more about prompt caching, see [getting started with prompt caching](./prompt-caching.md).
9999

100+
## Direct preference optimization (DPO) (preview)
101+
102+
Direct preference optimization (DPO) is an alignment technique for large language models, used to adjust model weights based on human preferences. It differs from reinforcement learning from human feedback (RLHF) in that it does not require fitting a reward model and uses simpler binary data preferences for training. It is computationally lighter weight and faster than RLHF, while being equally effective at alignment.
103+
104+
### Why is DPO useful?
105+
106+
DPO is especially useful in scenarios where there's no clear-cut correct answer, and subjective elements like tone, style, or specific content preferences are important. This approach also enables the model to learn from both positive examples (what's considered correct or ideal) and negative examples (what's less desired or incorrect).
107+
108+
DPO is believed to be a technique that will make it easier for customers to generate high-quality training data sets. While many customers struggle to generate sufficient large data sets for supervised fine-tuning, they often have preference data already collected based on user logs, A/B tests, or smaller manual annotation efforts.
109+
110+
### Direct preference optimization dataset format
111+
112+
Direct preference optimization files have a different format than supervised fine-tuning. Customers provide a "conversation" containing the system message and the initial user message, and then "completions" with paired preference data. Users can only provide two completions.
113+
114+
Three top-level fields: `input`, `preferred_output` and `non_preferred_output`
115+
116+
- Each element in the preferred_output/non_preferred_output must contain at least one assistant message
117+
- Each element in the preferred_output/non_preferred_output can only have roles in (assistant, tool)
118+
119+
```json
120+
{
121+
"input": {
122+
"messages": {"role": "system", "content": ...},
123+
"tools": [...],
124+
"parallel_tool_calls": true
125+
},
126+
"preferred_output": [{"role": "assistant", "content": ...}],
127+
"non_preferred_output": [{"role": "assistant", "content": ...}]
128+
}
129+
```
130+
131+
Training datasets must be in `jsonl` format:
132+
133+
```jsonl
134+
{{"input": {"messages": [{"role": "system", "content": "You are a chatbot assistant. Given a user question with multiple choice answers, provide the correct answer."}, {"role": "user", "content": "Question: Janette conducts an investigation to see which foods make her feel more fatigued. She eats one of four different foods each day at the same time for four days and then records how she feels. She asks her friend Carmen to do the same investigation to see if she gets similar results. Which would make the investigation most difficult to replicate? Answer choices: A: measuring the amount of fatigue, B: making sure the same foods are eaten, C: recording observations in the same chart, D: making sure the foods are at the same temperature"}]}, "preferred_output": [{"role": "assistant", "content": "A: Measuring The Amount Of Fatigue"}], "non_preferred_output": [{"role": "assistant", "content": "D: making sure the foods are at the same temperature"}]}
135+
}
136+
```
137+
138+
### Direct preference optimization model support
139+
140+
- `gpt-4o-2024-08-06` supports direct preference optimization in its respective fine-tuning regions. Latest region availability is updated in the [models page](../concepts/models.md#fine-tuning-models)
141+
142+
Users can use preference fine tuning with base models as well as models that have already been fine-tuned using supervised fine-tuning as long as they are of a supported model/version.
143+
144+
### How to use direct preference optimization fine-tuning?
145+
146+
1. Prepare `jsonl` datasets in the [preference format](#direct-preference-optimization-dataset-format).
147+
2. Select the model and then select the method of customization **Direct Preference Optimization**.
148+
3. Upload datasets – training and validation. Preview as needed.
149+
4. Select hyperparameters, defaults are recommended for initial experimentation.
150+
5. Review the selections and create a fine tuning job.
151+
100152
## Troubleshooting
101153

102154
### How do I enable fine-tuning?

articles/ai-services/openai/includes/fine-tuning-openai-in-ai-studio.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -221,6 +221,7 @@ Optionally, configure parameters for your fine-tuning job. The following are ava
221221
| `learning_rate_multiplier` | number | The learning rate multiplier to use for training. The fine-tuning learning rate is the original learning rate used for pre-training multiplied by this value. Larger learning rates tend to perform better with larger batch sizes. We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results. A smaller learning rate may be useful to avoid overfitting. |
222222
|`n_epochs` | integer | The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset. If set to -1, the number of epochs is determined dynamically based on the input data. |
223223
|`seed` | integer | The seed controls the reproducibility of the job. Passing in the same seed and job parameters should produce the same results, but may differ in rare cases. If a seed isn't specified, one will be generated for you. |
224+
| `Beta`| integer | Temperature parameter for the dpo loss, typically in the range 0.1 to 0.5. This controls how much attention we pay to the reference model. The smaller the beta, the more we allow the model to drift away from the reference model. As beta gets smaller the more, we ignore the reference model. |
224225

225226
You can choose to leave the default configuration or customize the values to your preference. After you finish making your configurations, select **Next**.
226227

articles/ai-services/openai/includes/fine-tuning-studio.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -288,6 +288,9 @@ The **Create custom model** wizard shows the parameters for training your fine-t
288288
| `learning_rate_multiplier` | number | The learning rate multiplier to use for training. The fine-tuning learning rate is the original learning rate used for pre-training multiplied by this value. Larger learning rates tend to perform better with larger batch sizes. We recommend experimenting with values in the range 0.02 to 0.2 to see what produces the best results. A smaller learning rate may be useful to avoid overfitting. |
289289
|`n_epochs` | integer | The number of epochs to train the model for. An epoch refers to one full cycle through the training dataset. |
290290
| `seed` | integer | The seed controls the reproducibility of the job. Passing in the same seed and job parameters should produce the same results, but may differ in rare cases. If a seed isn't specified, one will be generated for you|
291+
| `Beta`| integer | Temperature parameter for the dpo loss, typically in the range 0.1 to 0.5. This controls how much attention we pay to the reference model. The smaller the beta, the more we allow the model to drift away from the reference model. As beta gets smaller the more, we ignore the reference model. |
292+
293+
291294

292295
:::image type="content" source="../media/fine-tuning/studio-advanced-options.png" alt-text="Screenshot of the Advanced options pane for the Create custom model wizard, with default options selected." lightbox="../media/fine-tuning/studio-advanced-options.png":::
293296

articles/ai-services/openai/whats-new.md

Lines changed: 8 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ ms.custom:
1111
- references_regions
1212
- ignite-2024
1313
ms.topic: whats-new
14-
ms.date: 11/16/2024
14+
ms.date: 11/17/2024
1515
recommendations: false
1616
---
1717

@@ -21,6 +21,13 @@ This article provides a summary of the latest releases and major documentation u
2121

2222
## December 2024
2323

24+
### Preference fine-tuning (preview)
25+
26+
[Direct preference optimization (DPO)](./how-to/fine-tuning.md#direct-preference-optimization-dpo-preview) is a new alignment technique for large language models, designed to adjust model weights based on human preferences. Unlike reinforcement learning from human feedback (RLHF), DPO does not require fitting a reward model and uses simpler data (binary preferences) for training. This method is computationally lighter and faster, making it equally effective at alignment while being more efficient. DPO is especially useful in scenarios where subjective elements like tone, style, or specific content preferences are important. We’re excited to announce the public preview of DPO in Azure OpenAI Service, starting with the `gpt-4o-2024-08-06` model.
27+
28+
For fine-tuning model region availability, see the [models page](./concepts/models.md#fine-tuning-models).
29+
30+
2431
### GPT-4o 2024-11-20
2532

2633
`gpt-4o-2024-11-20` is now available for [global standard deployment](./how-to/deployment-types.md) in:

0 commit comments

Comments
 (0)