You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+84-22Lines changed: 84 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,12 +5,15 @@
5
5
-[Features](#features)
6
6
-[Prerequisuites](#prerequisuites)
7
7
-[Quick start](#quick-start)
8
-
-[SFT](#sft)
8
+
-[GRPO](#grpo)
9
9
-[Single Node](#single-node)
10
10
-[Multi-node](#multi-node)
11
-
-[GRPO](#grpo)
11
+
-[SFT](#sft)
12
12
-[Single Node](#single-node-1)
13
13
-[Multi-node](#multi-node-1)
14
+
-[DPO](#dpo)
15
+
-[Single Node](#single-node-2)
16
+
-[Multi-node](#multi-node-2)
14
17
-[Cluster Start](#cluster-start)
15
18
16
19
**Nemo-Reinforcer** is a scalable and efficient post-training library designed for models ranging from 1 GPU to thousands, and from tiny to over 100 billion parameters.
@@ -33,10 +36,10 @@ What you can expect:
33
36
- ✅ **Environment Support** - Support for multi-environment training.
- ✅ **Worker Isolation** - Process isolation between RL Actors (no worries about global state)
39
+
- ✅ **DPO Algorithm** - Direct Preference Optimization for alignment
36
40
- 🔜 **Larger Model Support** - Native PyTorch support for models up to 70B parameters
37
41
- 🔜 **Advanced Parallelism** - FSDP2, TP, SP, and sequence packing for efficient training
38
42
- 🔜 **Environment Isolation** - Dependency isolation between components
39
-
- 🔜 **DPO Algorithm** - Direct Preference Optimization for alignment
40
43
41
44
## Prerequisuites
42
45
@@ -59,6 +62,61 @@ pip install uv
59
62
60
63
**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
61
64
65
+
### GRPO
66
+
67
+
We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
68
+
69
+
#### Single Node
70
+
71
+
To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:
72
+
73
+
```sh
74
+
# Run the GRPO math example using a 1B parameter model
75
+
uv run python examples/run_grpo_math.py
76
+
```
77
+
78
+
By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
79
+
80
+
```sh
81
+
# Run the GRPO math example using a 1B parameter model using 8 GPUs
82
+
uv run python examples/run_grpo_math.py \
83
+
cluster.gpus_per_node=8
84
+
```
85
+
86
+
You can override any of the parameters listed in the yaml configuration file. For example,
We have a reference GRPO experiment config set up trained for math benchmarks using the [OpenInstructMath2](https://huggingface.co/datasets/nvidia/OpenMathInstruct-2) dataset.
169
+
We provide a sample DPO experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training.
115
170
116
171
#### Single Node
117
172
118
-
To run GRPO on a single GPU for `Llama-3.2-1B-Instruct`:
173
+
The default DPO experiment is configured to run on a single GPU. To launch the experiment:
119
174
120
175
```sh
121
-
# Run the GRPO math example using a 1B parameter model
122
-
uv run python examples/run_grpo_math.py
176
+
uv run python examples/run_dpo.py
123
177
```
124
178
125
-
By default, this uses the configuration in `examples/configs/grpo_math_1B.yaml`. You can customize parameters with command-line overrides. For example, to run on 8 gpus,
179
+
This trains `Llama3.2-1B-Instruct` on one GPU.
180
+
181
+
If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration and switch to an 8B Llama3.1 Instruct model:
126
182
127
183
```sh
128
-
# Run the GRPO math example using a 1B parameter model using 8 GPUs
Refer to [dpo.yaml](examples/configs/dpo.yaml) for a full list of parameters that can be overridden. For an in-depth explanation of how to add your own DPO dataset, refer to the [DPO documentation](docs/guides/dpo.md).
202
+
144
203
#### Multi-node
145
204
205
+
For distributed DPO training across multiple nodes, modify the following script for your use case:
206
+
146
207
```sh
147
208
# Run from the root of NeMo-Reinforcer repo
209
+
## number of nodes to use for your job
148
210
NUM_ACTOR_NODES=2
149
211
# Add a timestamp to make each job name unique
150
212
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
151
213
152
-
# grpo_math_8b uses Llama-3.1-8B-Instruct model
153
-
COMMAND="uv run ./examples/run_grpo_math.py --config examples/configs/grpo_math_8B.yaml cluster.num_nodes=2 checkpointing.checkpoint_dir='results/llama8b_2nodes' logger.wandb_enabled=True logger.wandb.name='grpo-llama8b_math'" \
[Direct Preference Optimization (DPO)](https://arxiv.org/pdf/2305.18290) is an RL-free alignment algorithm that operates on preference data. Given a prompt and a pair of chosen and rejected responses, DPO aims
4
+
to increase the probability of the chosen response and decrease the probability of the rejected response relative to a frozen reference model. The actor is initialized using the reference model. For more details, refer to the
5
+
[DPO paper](https://arxiv.org/pdf/2305.18290).
6
+
7
+
## Launch a DPO Run
8
+
9
+
The script [examples/run_dpo.py](../../examples/run_dpo.py) can be used to launch a DPO experiment. This script can either be launched locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).
10
+
11
+
Be sure to launch the job using `uv`. The command to launch a DPO job is as follows:
12
+
```bash
13
+
uv run examples/run_dpo.py --config <PATH TO YAML CONFIG><OVERRIDES>
14
+
```
15
+
If not specified, `config` will default to [examples/configs/dpo.yaml](../../examples/configs/dpo.yaml).
16
+
17
+
## Configuration
18
+
19
+
Reinforcer allows users to configure DPO experiments using `yaml` config files. An example DPO configuration file can be found [here](../../examples/configs/dpo.yaml).
20
+
21
+
To override a value in the config, either update the value in the `yaml` file directly, or pass the override via the command line. For example:
22
+
23
+
```bash
24
+
uv run examples/run_dpo.py \
25
+
cluster.gpus_per_node=8 \
26
+
dpo.sft_loss_weight=0.1 \
27
+
dpo.preference_average_log_probs=True \
28
+
logger.wandb.name="dpo-dev-8-gpu"
29
+
```
30
+
31
+
**Reminder**: Don't forget to set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). You'll need to do a `huggingface-cli login` as well for Llama models.
32
+
33
+
## Datasets
34
+
35
+
Each class representing a Reinforcer DPO dataset is expected to have the following attributes:
36
+
1.`formatted_ds`: The dictionary of formatted datasets. This dictionary should contain `train` and `validation` splits, and each split should conform to the format described below.
37
+
2.`task_spec`: The `TaskDataSpec` for this dataset. This should specify the name you choose for this dataset.
38
+
39
+
DPO datasets are expected to follow a specific format with three key fields:
40
+
-`prompt`: The input prompt/context
41
+
-`chosen_response`: The preferred/winning response
42
+
-`rejected_response`: The non-preferred/losing response
43
+
44
+
[data/hf_datasets/helpsteer3.py](../../nemo_reinforcer/data/hf_datasets/helpsteer3.py) provides an example of how to format data for DPO:
45
+
46
+
```python
47
+
defformat_helpsteer3(data):
48
+
response_1 = data["response1"]
49
+
response_2 = data["response2"]
50
+
overall_preference = data["overall_preference"]
51
+
52
+
if overall_preference <0:
53
+
chosen = response_1
54
+
rejected = response_2
55
+
elif overall_preference ==0:
56
+
chosen = response_1
57
+
rejected = response_1
58
+
else:
59
+
chosen = response_2
60
+
rejected = response_1
61
+
62
+
return {
63
+
"prompt": data["context"],
64
+
"chosen_response": chosen,
65
+
"rejected_response": rejected,
66
+
}
67
+
```
68
+
69
+
We also provide a [DPODataset](../../nemo_reinforcer/data/hf_datasets/dpo.py) class that is compatible with jsonl-formatted preference datsets. This class assumes train and validation datasets have been split and processed into the expected format offline. The jsonl files should consist of examples with `prompt`, `chosen_response`, and `rejected_response` keys.
70
+
71
+
## Adding Custom DPO Datasets
72
+
73
+
Adding a new DPO dataset is straightforward. Your custom dataset class should:
74
+
1. Implement the required format conversion in the constructor
75
+
2. Set up the appropriate `task_spec`
76
+
77
+
Here's a minimal example which simply re-keys an existing jsonl dataset:
78
+
79
+
```{testcode}
80
+
from datasets import load_dataset
81
+
from nemo_reinforcer.data.interfaces import TaskDataSpec
0 commit comments