You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com>
Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
Co-authored-by: Terry Kong <terryk@nvidia.com>
Each class representing a NeMo RL DPO dataset is expected to have the following attributes:
36
-
1.`formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below.
37
-
2.`task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
38
-
39
-
DPO datasets are expected to follow a specific format with three key fields:
40
-
-`prompt`: The input prompt/context
41
-
-`chosen_response`: The preferred/winning response
42
-
-`rejected_response`: The non-preferred/losing response
43
-
44
-
[data/hf_datasets/helpsteer3.py](../../nemo_rl/data/hf_datasets/helpsteer3.py) provides an example of how to format data for DPO:
45
-
46
-
```python
47
-
defformat_helpsteer3(data):
48
-
response_1 = data["response1"]
49
-
response_2 = data["response2"]
50
-
overall_preference = data["overall_preference"]
51
-
52
-
if overall_preference <0:
53
-
chosen = response_1
54
-
rejected = response_2
55
-
elif overall_preference ==0:
56
-
chosen = response_1
57
-
rejected = response_1
58
-
else:
59
-
chosen = response_2
60
-
rejected = response_1
61
-
62
-
return {
63
-
"prompt": data["context"],
64
-
"chosen_response": chosen,
65
-
"rejected_response": rejected,
35
+
Each DPO dataset class is expected to have the following attributes:
36
+
1.`formatted_ds`: The dictionary of formatted datasets, where each dataset should be formatted like
37
+
```json
38
+
{
39
+
"context": [], // list of dicts - The prompt message (including previous turns, if any)
40
+
"completions": [ // list of dicts — The list of completions
41
+
{
42
+
"rank": 0, // int — The rank of the completion (lower rank is preferred)
43
+
"completion": [] // list of dicts — The completion message(s)
44
+
},
45
+
{
46
+
"rank": 1, // int — The rank of the completion (lower rank is preferred)
47
+
"completion": [] // list of dicts — The completion message(s)
66
48
}
49
+
]
50
+
}
67
51
```
52
+
2.`task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
68
53
69
-
We also provide a [DPODataset](../../nemo_rl/data/hf_datasets/dpo.py) class that is compatible with jsonl-formatted preference datsets. This class assumes train and validation datasets have been split and processed into the expected format offline. The jsonl files should consist of examples with `prompt`, `chosen_response`, and `rejected_response` keys.
70
-
71
-
## Adding Custom DPO Datasets
72
-
73
-
Adding a new DPO dataset is straightforward. Your custom dataset class should:
74
-
1. Implement the required format conversion in the constructor
75
-
2. Set up the appropriate `task_spec`
76
-
77
-
Here's a minimal example which simply re-keys an existing jsonl dataset:
78
-
79
-
```{testcode}
80
-
from datasets import load_dataset
81
-
from nemo_rl.data.interfaces import TaskDataSpec
82
-
from docs.helpers import make_dpo_dataset
83
-
84
-
class CustomDPODataset:
85
-
def preprocess_dataset(
86
-
self,
87
-
data,
88
-
prompt_key: str = "context",
89
-
chosen_key: str = "chosen",
90
-
rejected_key: str = "rejected"
91
-
):
92
-
return {
93
-
"prompt": data[prompt_key],
94
-
"chosen_response": data[chosen_key],
95
-
"rejected_response": data[rejected_key],
54
+
DPO training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example:
55
+
```json
56
+
{
57
+
"context": [
58
+
{
59
+
"role": "user",
60
+
"content": "What's the capital of France?"
61
+
},
62
+
{
63
+
"role": "assistant",
64
+
"content": "The capital of France is Paris."
65
+
},
66
+
{
67
+
"role": "user",
68
+
"content": "Thanks! And what's the capital of Germany?"
print(f"First train example prompt: {dataset.formatted_ds['train'][0]['prompt']}")
146
-
print(f"First train example chosen response: {dataset.formatted_ds['train'][0]['chosen_response']}")
147
-
print(f"First train example rejected response: {dataset.formatted_ds['train'][0]['rejected_response']}")
90
+
]
91
+
}
148
92
```
149
93
150
-
```{testoutput}
151
-
Task name: custom_dpo
152
-
Train examples: 2
153
-
Validation examples: 2
154
-
First train example prompt: What is 2+2?
155
-
First train example chosen response: 4
156
-
First train example rejected response: 5
94
+
NeMo RL provides a DPO-compatible implementation of the [HelpSteer3](https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/data/hf_datasets/helpsteer3.py) dataset as an example. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
95
+
96
+
We also provide a [PreferenceDataset](../../nemo_rl/data/hf_datasets/preference_dataset.py) class that is compatible with JSONL-formatted preference datasets. You can modify your config as follows to use such a custom preference dataset:
- If you are using a logger, the prefix used for each validation set will be `validation-<NameOfValidationDataset>`. The total validation time, summed across all validation sets, is reported under `timing/validation/total_validation_time`.
115
+
- If you are doing checkpointing, the `metric_name` value in your `checkpointing` config should reflect the metric and validation set to be tracked. For example, `validation-<NameOfValidationDataset1>_loss`.
116
+
117
+
The older [DPODataset](../../nemo_rl/data/hf_datasets/dpo.py) class is deprecated. This class is also compatible with JSONL-formatted preference datsets. It assumes train and validation datasets have been split and processed into the expected format offline. The JSONL files should consist of examples with `prompt`, `chosen_response`, and `rejected_response` keys.
Copy file name to clipboardExpand all lines: docs/guides/rm.md
+81-1Lines changed: 81 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -21,4 +21,84 @@ The default YAML config shares the same base template as the SFT config but incl
21
21
22
22
## Datasets
23
23
24
-
By default, NeMo RL supports the `HelpSteer3` dataset. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
24
+
Each RM dataset class is expected to have the following attributes:
25
+
1.`formatted_ds`: The dictionary of formatted datasets, where each dataset should be formatted like
26
+
```json
27
+
{
28
+
"context": [], // list of dicts - The prompt message (including previous turns, if any)
29
+
"completions": [ // list of dicts — The list of completions
30
+
{
31
+
"rank": 0, // int — The rank of the completion (lower rank is preferred)
32
+
"completion": [] // list of dicts — The completion message(s)
33
+
},
34
+
{
35
+
"rank": 1, // int — The rank of the completion (lower rank is preferred)
36
+
"completion": [] // list of dicts — The completion message(s)
37
+
}
38
+
]
39
+
}
40
+
```
41
+
2.`task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
42
+
43
+
Currently, RM training supports only two completions (where the lowest rank is preferred and the highest one is rejected), with each completion being a single response. For example:
44
+
```json
45
+
{
46
+
"context": [
47
+
{
48
+
"role": "user",
49
+
"content": "What's the capital of France?"
50
+
},
51
+
{
52
+
"role": "assistant",
53
+
"content": "The capital of France is Paris."
54
+
},
55
+
{
56
+
"role": "user",
57
+
"content": "Thanks! And what's the capital of Germany?"
58
+
}
59
+
],
60
+
"completions": [
61
+
{
62
+
"rank": 0,
63
+
"completion": [
64
+
{
65
+
"role": "assistant",
66
+
"content": "The capital of Germany is Berlin."
67
+
}
68
+
]
69
+
},
70
+
{
71
+
"rank": 1,
72
+
"completion": [
73
+
{
74
+
"role": "assistant",
75
+
"content": "The capital of Germany is Munich."
76
+
}
77
+
]
78
+
}
79
+
]
80
+
}
81
+
```
82
+
83
+
NeMo RL provides a RM-compatible implementation of the [HelpSteer3](https://github.com/NVIDIA-NeMo/RL/blob/main/nemo_rl/data/hf_datasets/helpsteer3.py) dataset as an example. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.
84
+
85
+
We also provide a [PreferenceDataset](../../nemo_rl/data/hf_datasets/preference_dataset.py) class that is compatible with JSONL-formatted preference datasets. You can modify your config as follows to use such a custom preference dataset:
- If you are using a logger, the prefix used for each validation set will be `validation-<NameOfValidationDataset>`. The total validation time, summed across all validation sets, is reported under `timing/validation/total_validation_time`.
104
+
- If you are doing checkpointing, the `metric_name` value in your `checkpointing` config should reflect the metric and validation set to be tracked. For example, `validation-<NameOfValidationDataset1>_loss`.
0 commit comments