Skip to content

Commit aaf4266

Browse files
jveronvialardodelalleauterrykong
authored
feat: adding support for Bradley-Terry reward model training (NVIDIA-NeMo#609)
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com> Signed-off-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Signed-off-by: Julien Veron Vialard <50602890+jveronvialard@users.noreply.github.com> Co-authored-by: Olivier Delalleau <507137+odelalleau@users.noreply.github.com> Co-authored-by: Terry Kong <terrycurtiskong@gmail.com>
1 parent 1f6aa5d commit aaf4266

File tree

14 files changed

+1479
-50
lines changed

14 files changed

+1479
-50
lines changed

.github/workflows/cicd-main.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -205,6 +205,7 @@ jobs:
205205
time uv run --no-sync bash ./tests/functional/grpo_multiturn.sh
206206
time uv run --no-sync bash ./tests/functional/grpo_non_colocated.sh
207207
time uv run --no-sync bash ./tests/functional/dpo.sh
208+
time uv run --no-sync bash ./tests/functional/rm.sh
208209
time uv run --no-sync bash ./tests/functional/eval.sh
209210
time uv run --no-sync bash ./tests/functional/eval_async.sh
210211
time uv run --no-sync bash ./tests/functional/test_mcore_extra_installed_correctly.sh

README.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,9 @@
2727
- [DPO](#dpo)
2828
- [DPO Single Node](#dpo-single-node)
2929
- [DPO Multi-node](#dpo-multi-node)
30+
- [RM](#rm)
31+
- [RM Single Node](#rm-single-node)
32+
- [RM Multi-node](#rm-multi-node)
3033
- [Evaluation](#evaluation)
3134
- [Convert Model Format (Optional)](#convert-model-format-optional)
3235
- [Run Evaluation](#run-evaluation)
@@ -338,7 +341,50 @@ For distributed DPO training across multiple nodes, modify the following script
338341
NUM_ACTOR_NODES=2
339342

340343
COMMAND="uv run ./examples/run_dpo.py --config examples/configs/dpo.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 dpo.val_global_batch_size=32 checkpointing.checkpoint_dir='results/dpo_llama81_2nodes' logger.wandb_enabled=True logger.wandb.name='dpo-llama1b'" \
341-
RAY_DEDUP_LOGS=0 \
344+
CONTAINER=YOUR_CONTAINER \
345+
MOUNTS="$PWD:$PWD" \
346+
sbatch \
347+
--nodes=${NUM_ACTOR_NODES} \
348+
--account=YOUR_ACCOUNT \
349+
--job-name=YOUR_JOBNAME \
350+
--partition=YOUR_PARTITION \
351+
--time=4:0:0 \
352+
--gres=gpu:8 \
353+
ray.sub
354+
```
355+
356+
## RM
357+
358+
We provide a sample RM experiment that uses the [HelpSteer3 dataset](https://huggingface.co/datasets/nvidia/HelpSteer3) for preference-based training.
359+
360+
### RM Single Node
361+
362+
The default RM experiment is configured to run on a single GPU. To launch the experiment:
363+
364+
```sh
365+
uv run python examples/run_rm.py
366+
```
367+
368+
This trains a RM based on `meta-llama/Llama-3.2-1B-Instruct` on one GPU.
369+
370+
If you have access to more GPUs, you can update the experiment accordingly. To run on 8 GPUs, we update the cluster configuration:
371+
372+
```sh
373+
uv run python examples/run_rm.py cluster.gpus_per_node=8
374+
```
375+
376+
Refer to the [RM documentation](docs/guides/rm.md) for more information.
377+
378+
### RM Multi-node
379+
380+
For distributed RM training across multiple nodes, modify the following script for your use case:
381+
382+
```sh
383+
# Run from the root of NeMo RL repo
384+
## number of nodes to use for your job
385+
NUM_ACTOR_NODES=2
386+
387+
COMMAND="uv run ./examples/run_rm.py --config examples/configs/rm.yaml cluster.num_nodes=2 cluster.gpus_per_node=8 checkpointing.checkpoint_dir='results/rm_llama1b_2nodes' logger.wandb_enabled=True logger.wandb.name='rm-llama1b-2nodes'" \
342388
CONTAINER=YOUR_CONTAINER \
343389
MOUNTS="$PWD:$PWD" \
344390
sbatch \

docs/guides/rm.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
# Reward Model Training in NeMo RL
2+
3+
This document explains how to train reward models (RM) within NeMo RL. Currently, only Bradley-Terry reward models are supported on the DTensor backend. Megatron backend support is tracked [here](https://github.com/NVIDIA-NeMo/RL/issues/720).
4+
5+
## Launch a Training Job
6+
7+
The script, [examples/run_rm.py](../../examples/run_rm.py), is used to train a Bradley-Terry reward model. This script can be launched either locally or via Slurm. For details on how to set up Ray and launch a job using Slurm, refer to the [cluster documentation](../cluster.md).
8+
9+
Be sure to launch the job using `uv`. The command to launch a training job is as follows:
10+
11+
```bash
12+
uv run examples/run_rm.py
13+
14+
# Can also add overrides on CLI, like changing the config or changing the model
15+
uv run examples/run_rm.py --config examples/configs/rm.yaml policy.model_name=Qwen/Qwen2.5-1.5B
16+
```
17+
18+
The default YAML config shares the same base template as the SFT config but includes a new `reward_model_cfg` section with `enabled: true` to load the model as a Reward Model. You can find an example RM config file at [examples/configs/rm.yaml](../../examples/configs/rm.yaml).
19+
20+
**Reminder**: Set your `HF_HOME`, `WANDB_API_KEY`, and `HF_DATASETS_CACHE` (if needed). Make sure to log in using `huggingface-cli` if you're working with Llama models.
21+
22+
## Datasets
23+
24+
By default, NeMo RL supports the `HelpSteer3` dataset. This dataset is downloaded from Hugging Face and preprocessed on-the-fly, so there's no need to provide a path to any datasets on disk.

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@ guides/sft.md
2828
guides/dpo.md
2929
guides/grpo.md
3030
guides/grpo-deepscaler.md
31+
guides/rm.md
3132
guides/eval.md
3233
guides/deepseek.md
3334
model-quirks.md

examples/configs/rm.yaml

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# Bradley-Terry (BT) Reward Model Training Configuration
2+
rm:
3+
## total number of steps to train will equal
4+
## min((max_num_epochs * len(train_dataloader)), max_num_steps)
5+
max_num_epochs: 1
6+
max_num_steps: -1 # by default, train for 1 epoch
7+
8+
val_period: 16
9+
val_batches: -1
10+
val_global_batch_size: 32
11+
val_micro_batch_size: 1
12+
val_at_start: false
13+
seed: 42
14+
15+
checkpointing:
16+
enabled: true
17+
checkpoint_dir: "results/rm"
18+
metric_name: "val_loss"
19+
higher_is_better: false
20+
keep_top_k: 3
21+
save_period: ${rm.val_period}
22+
23+
policy:
24+
model_name: "meta-llama/Llama-3.2-1B-Instruct"
25+
tokenizer:
26+
name: ${policy.model_name} ## specify if you'd like to use a tokenizer different from the model's default
27+
# We don't use the "default" chat template because the Llama tokenizer inserts the current
28+
# date in the system prompt, which could make the reward model's output date-dependent.
29+
chat_template: "{{- bos_token }}\n\n{#- This block extracts the system message, so we can slot it into the right place. #}\n{%- if messages[0]['role'] == 'system' %}\n {%- set system_message = messages[0]['content']|trim %}\n {%- set messages = messages[1:] %}\n{%- else %}\n {%- set system_message = '' %}\n{%- endif %}\n\n{#- System message #}\n{{- '<|start_header_id|>system<|end_header_id|>\n\n' }}\n{{- system_message }}\n{{- '<|eot_id|>' }}\n\n{%- for message in messages %}\n {{- '<|start_header_id|>' + message['role'] + '<|end_header_id|>\n\n' + message['content'] | trim + '<|eot_id|>' }}\n{%- endfor %}\n{%- if add_generation_prompt %}\n {{- '<|start_header_id|>assistant<|end_header_id>\n\n' }}\n{%- endif %}"
30+
train_global_batch_size: 128
31+
train_micro_batch_size: 1
32+
max_total_sequence_length: 8192
33+
precision: "bfloat16"
34+
fsdp_offload_enabled: false
35+
activation_checkpointing_enabled: false
36+
37+
reward_model_cfg:
38+
enabled: true # loads model as a Reward Model (do not change)
39+
reward_model_type: "bradley_terry" # only "bradley_terry" is currently supported
40+
41+
dtensor_cfg:
42+
enabled: true
43+
cpu_offload: false
44+
sequence_parallel: false
45+
activation_checkpointing: false
46+
tensor_parallel_size: 1
47+
context_parallel_size: 1
48+
custom_parallel_plan: null
49+
50+
dynamic_batching:
51+
enabled: false
52+
53+
sequence_packing:
54+
enabled: false
55+
56+
# makes the training sequence length divisible by the tensor parallel size
57+
# this is useful for sequence parallel training
58+
make_sequence_length_divisible_by: ${policy.dtensor_cfg.tensor_parallel_size}
59+
max_grad_norm: 1.0
60+
61+
optimizer:
62+
name: "torch.optim.AdamW"
63+
kwargs:
64+
lr: 2.0e-6
65+
weight_decay: 0.1
66+
betas: [0.9, 0.98]
67+
eps: 1e-5
68+
# when using Dtensor, we need to set `foreach` and `fused` to false
69+
foreach: false
70+
fused: false
71+
72+
## ignored since enabled=false, but needed for testing purposes
73+
megatron_cfg:
74+
enabled: false
75+
empty_unused_memory_level: 1
76+
activation_checkpointing: false
77+
tensor_model_parallel_size: 2
78+
pipeline_model_parallel_size: 2
79+
context_parallel_size: 1
80+
pipeline_dtype: ${policy.precision}
81+
num_layers_in_first_pipeline_stage: null
82+
num_layers_in_last_pipeline_stage: null
83+
sequence_parallel: false
84+
85+
optimizer:
86+
optimizer: "adam"
87+
lr: 2.0e-6
88+
min_lr: 1.9999e-6
89+
weight_decay: 0.1
90+
bf16: false
91+
fp16: false
92+
params_dtype: "float32"
93+
94+
#adam
95+
adam_beta1: 0.9
96+
adam_beta2: 0.98
97+
adam_eps: 1e-5
98+
99+
#sgd
100+
sgd_momentum: 0.9
101+
102+
#distributed optimizer
103+
use_distributed_optimizer: true
104+
use_precision_aware_optimizer: true
105+
106+
clip_grad: ${policy.max_grad_norm}
107+
108+
scheduler:
109+
start_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
110+
end_weight_decay: ${policy.megatron_cfg.optimizer.weight_decay}
111+
weight_decay_incr_style: "constant"
112+
lr_decay_style: "constant"
113+
lr_decay_iters: null
114+
lr_warmup_iters: 50
115+
lr_warmup_init: 1.9999e-6
116+
117+
distributed_data_parallel_config:
118+
grad_reduce_in_fp32: false
119+
overlap_grad_reduce: true
120+
overlap_param_gather: false
121+
average_in_collective: true
122+
data_parallel_sharding_strategy: "optim_grads_params"
123+
124+
125+
data:
126+
max_input_seq_length: ${policy.max_total_sequence_length}
127+
dataset_name: "HelpSteer3"
128+
129+
logger:
130+
log_dir: "logs" # Base directory for all logs
131+
wandb_enabled: true # Make sure you do a ``wandb login [Your API key]'' before running
132+
tensorboard_enabled: true
133+
mlflow_enabled: false
134+
monitor_gpus: true # If true, will monitor GPU usage and log to wandb and/or tensorboard
135+
wandb:
136+
project: "rm-dev"
137+
name: "rm-dev-${data.dataset_name}"
138+
tensorboard:
139+
log_dir: "tb_logs-rm-dev-${data.dataset_name}"
140+
gpu_monitoring:
141+
collection_interval: 10 # How often to collect GPU usage metrics (in seconds)
142+
flush_interval: 10 # How often to flush GPU usage metrics to the loggers (in seconds)
143+
144+
cluster:
145+
gpus_per_node: 1
146+
num_nodes: 1

0 commit comments

Comments
 (0)