Skip to content
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
36 commits
Select commit Hold shift + click to select a range
f5905f5
update prompt duplication logic
cmunley1 Feb 12, 2026
c650e15
switch to fsdp2
cmunley1 Feb 12, 2026
2e71cab
switch to fsdp2
cmunley1 Feb 12, 2026
b33f1df
update nemo gym docs with sudoku results
cmunley1 Feb 12, 2026
7142050
Merge branch 'main' into cmunley1/nemo-gym-fix
cmunley1 Feb 13, 2026
6c81017
Merge branch 'main' into cmunley1/nemo-gym-fix
cmunley1 Feb 16, 2026
80b1110
remove slurm from docs, rename to train.py
cmunley1 Feb 16, 2026
cd49530
Merge branch 'cmunley1/nemo-gym-fix' of github.com:cmunley1/trl into …
cmunley1 Feb 16, 2026
b3f2fc2
tighten doc
cmunley1 Feb 16, 2026
0a4475a
simplify
cmunley1 Feb 16, 2026
5a3115c
remove prompt from agent call
cmunley1 Feb 16, 2026
56da739
Merge remote-tracking branch 'upstream/main' into cmunley1/nemo_gym
cmunley1 Feb 19, 2026
7bdeb2c
simplify user experience
cmunley1 Feb 19, 2026
1b5e41c
updates
cmunley1 Feb 23, 2026
753d1a0
remove extras
cmunley1 Feb 23, 2026
051a13e
Merge remote-tracking branch 'upstream/main' into cmunley1/nemo_gym
cmunley1 Feb 23, 2026
28929e2
ruff
cmunley1 Feb 24, 2026
c1b87d4
update doc
cmunley1 Feb 24, 2026
d143486
update doc
cmunley1 Feb 24, 2026
127ab77
update nemo gym slurm script
cmunley1 Feb 24, 2026
9dff291
update nemo gym ref
cmunley1 Feb 24, 2026
bebc9a7
Merge branch 'main' into cmunley1/nemo_gym
cmunley1 Feb 24, 2026
682096d
update doc
cmunley1 Feb 24, 2026
89059a2
Merge branch 'cmunley1/nemo_gym' of github.com:cmunley1/trl into cmun…
cmunley1 Feb 24, 2026
05d81de
simplify doc
cmunley1 Feb 24, 2026
0d8c9a3
update recipe
cmunley1 Feb 24, 2026
2566327
remove slurm
cmunley1 Feb 24, 2026
aa63e43
remove extra links
cmunley1 Feb 24, 2026
ae4714f
remove extra logging
cmunley1 Feb 24, 2026
d0c21aa
remove retries
cmunley1 Feb 24, 2026
bbb830d
dont mask failures
cmunley1 Feb 24, 2026
a469cd4
remove extra check
cmunley1 Feb 24, 2026
4ef1446
remove hparam run name
cmunley1 Feb 24, 2026
69cbd97
remove unused
cmunley1 Feb 24, 2026
94a942e
utils
cmunley1 Feb 24, 2026
ebb3625
docs
cmunley1 Feb 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/example_overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,7 +62,7 @@ Scripts are maintained in the [`trl/scripts`](https://github.com/huggingface/trl
| [`examples/scripts/kto.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/kto.py) | This script shows how to use the [`experimental.kto.KTOTrainer`] to fine-tune a model. |
| [`examples/scripts/mpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/mpo_vlm.py) | This script shows how to use MPO via the [`DPOTrainer`] to align a model based on preferences using the [HuggingFaceH4/rlaif-v_formatted](https://huggingface.co/datasets/HuggingFaceH4/rlaif-v_formatted) dataset and a set of loss weights with weights. |
| [`examples/scripts/nash_md.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nash_md.py) | This script shows how to use the [`experimental.nash_md.NashMDTrainer`] to fine-tune a model. |
| [`examples/scripts/nemo_gym/train_multi_environment.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nemo_gym/train_multi_environment.py) | This script shows how to use the [`GRPOTrainer`] to train language models in NVIDIA NeMo-Gym environments. Supports multi-turn and tool calling environments, and multi-environment training. See the [NeMo-Gym Integration](nemo_gym) guide for setup and usage. |
| [`examples/scripts/nemo_gym/grpo_nemo_gym.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/nemo_gym/grpo_nemo_gym.py) | This script shows how to use the [`GRPOTrainer`] to train language models in NVIDIA NeMo Gym environments. Supports multi-turn and tool calling environments, and multi-environment training. See the [NeMo Gym Integration](nemo_gym) guide for setup and usage. |
| [`examples/scripts/online_dpo.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo.py) | This script shows how to use the [`experimental.online_dpo.OnlineDPOTrainer`] to fine-tune a model. |
| [`examples/scripts/online_dpo_vlm.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/online_dpo_vlm.py) | This script shows how to use the [`experimental.online_dpo.OnlineDPOTrainer`] to fine-tune a a Vision Language Model. |
| [`examples/scripts/openenv/browsergym.py`](https://github.com/huggingface/trl/blob/main/examples/scripts/openenv/browsergym.py) | Simple script to run GRPO training via the [`GRPOTrainer`] with OpenEnv's BrowserGym environment and vLLM for VLMs |
Expand Down
273 changes: 48 additions & 225 deletions docs/source/nemo_gym.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
# NeMo Gym Integration

NVIDIA NeMo Gym is a library for building RL environments for large language models. This integration enables training models in NeMo Gym environments using TRL's GRPOTrainer with vLLM server mode.
NVIDIA NeMo Gym is a library for building RL environments for large language models. This integration enables training models in NeMo Gym environments using TRL's GRPOTrainer. Multi-turn and multi-environment training are both supported!

The integration supports multi-step and multi-turn rollouts, multi-environment training, and any NeMo Gym environment (thoroughly tested: workplace assistant, reasoning gym, MCQA, and math with judge).
Note that a minimum of 2 GPUs is currently required, as this integration relies on TRL's vLLM server mode.

## Why NeMo Gym

- **Production-Ready Scale**: Tested for frontier model training with diverse environments running in parallel across math, coding, tool use, reasoning, and more.
- **Multi-Verifier Training**: Supports algorithmic verification, LLM-as-a-judge, and custom verification logic in a single training run.
- **Decoupled Architecture**: Build agents and environments independently from the training loopno RL framework expertise required.
- **Tested at scale**: Battle-tested RL infra used in Nemotron post-training.
- **Multi-environment training**: Supports parallel training of complex agents in diverse environments, such as coding agents, deep research, workplace tasks, math, science, and more.
- **Decoupled architecture**: Build agents and environments independently from the training loop, no RL framework expertise required.
- **OpenAI-Compatible API**: All environments use the standardized OpenAI Responses API for seamless integration with vLLM, OpenAI models, and other endpoints.

## Available Environments
Expand All @@ -24,270 +24,93 @@ NeMo Gym provides training-ready environments across multiple domains, including
| Instruction Following | Instruction Following | IFEval/IFBench style tasks |
| Reasoning Gym | Multiple | Single-step procedurally generated verifiable tasks across domains |

For a complete list of available training environments, refer to the [NeMo Gym repository](https://github.com/NVIDIA-NeMo/Gym#-available-resource-servers).
For a complete list of available training environments, refer to the [NeMo Gym repository](https://github.com/NVIDIA-NeMo/Gym).

## Before You Start
## Quickstart

Complete these one-time setup steps before running training.
First install TRL and NeMo Gym with some extra packages:

### Install TRL and NeMo Gym

1. **Install TRL with vLLM extras**
```bash
cd trl/
uv venv --python 3.12
source .venv/bin/activate
uv sync --extra vllm
uv pip install fastapi uvicorn accelerate deepspeed wandb omegaconf

```bash
cd trl/
uv venv
source .venv/bin/activate
uv sync --extra vllm
```

1. **Install NeMo Gym**

```bash
# deactivate trl venv
deactivate
git clone https://github.com/NVIDIA-NeMo/Gym.git
cd Gym
uv venv --python 3.12
source .venv/bin/activate
uv sync
```
git clone https://github.com/NVIDIA-NeMo/Gym
uv pip install -e Gym/
```

### Prepare a Dataset

Many NeMo Gym datasets used to train Nemotron models are available on Hugging Face. Use `ng_prepare_data` to download and prepare datasets. This command:

- Downloads the dataset from Hugging Face
- Validates the data format
- Adds an `agent_ref` field to each example that tells NeMo Gym which agent server should handle that example

> **Note**: `train_multi_environment.py` adds the `agent_ref` field when loading datasets, so this step is optional if datasets are created another way.
In this example we will train a model on the workplace assistant environment, a multi-step tool use environment for common office scenarios. The dataset is available on Hugging Face. Use `ng_prepare_data` to download and prepare it:

1. **Set Hugging Face Token**
```bash
cd Gym
echo 'hf_token: YOUR_HF_TOKEN' > env.yaml
ng_prepare_data \
"+config_paths=[responses_api_models/vllm_model/configs/vllm_model.yaml,resources_servers/workplace_assistant/configs/workplace_assistant.yaml]" \
+output_dirpath=resources_servers/workplace_assistant/data \
+mode=train_preparation \
+should_download=true \
+data_source=huggingface

Create `env.yaml` in `Gym/` with your HF token:

```yaml
hf_token: <your_hf_token>
```

1. **Prepare Dataset**

```bash
# Enter Gym and activate the venv
cd Gym
source .venv/bin/activate

# Set config paths
config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml"

# Download data and prep for training
ng_prepare_data "+config_paths=[${config_paths}]" \
+output_dirpath=data/workplace_assistant \
+mode=train_preparation \
+should_download=true \
+data_source=huggingface
```

This creates `train.jsonl` and `validation.jsonl` files in `data/workplace_assistant/`.

To create a new environment, refer to the [environment creation guide](https://docs.nvidia.com/nemo/gym/latest/contribute/environments/new-environment.html). We suggest running an existing one first!

#### Dataset Format

NeMo Gym datasets are stored as JSONL. Each line contains a task with input messages, tool definitions, metadata such as ground truth for verification, and an agent server reference. The following example shows the workplace dataset structure. Metadata fields can differ between datasets, as long as the corresponding resources server uses the fields appropriately.

```json
{
"responses_create_params": {
"input": [
{"role": "system", "content": "..."},
{"role": "user", "content": "Move any of jinsoo's tasks that are in review to completed"}
],
"tools": [...],
"parallel_tool_calls": false,
"temperature": 1
},
"ground_truth": [
{"name": "project_management_update_task", "arguments": "{...}"},
...
],
"category": "workbench_project_management",
"environment_name": "workbench",
"agent_ref": {
"type": "responses_api_agents",
"name": "workplace_assistant_simple_agent"
}
}
tail -n 100 resources_servers/workplace_assistant/data/validation.jsonl > resources_servers/workplace_assistant/data/validation_100.jsonl
```

## Interactive Training

For development and testing on a single node.
Make sure you have `train.jsonl` and `validation_100.jsonl`.

### Set Up

1. **Update Environment Config**

Update `env.yaml` in `Gym/` to include model information:

```yaml
policy_base_url: http://127.0.0.1:8000/v1
policy_api_key: EMPTY
policy_model_name: Qwen/Qwen2.5-1.5B-Instruct
hf_token: ...
```

2. **Update Training Config**

Update `examples/scripts/nemo_gym/config.yaml` to point to the dataset generated above, and any other optional modifications.

### Run Training

The following steps run in 3 terminals. It can also be ran with processes in the background, or using tmux.

1. **Start NeMo Gym Servers** (Terminal 1)
## Interactive Training

```bash
cd Gym/
source .venv/bin/activate
### Setup

config_paths="resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\
responses_api_models/vllm_model/configs/vllm_model_for_training.yaml"
Update path to the dataset in the config: `examples/scripts/nemo_gym/config.yaml`.

ng_run "+config_paths=[${config_paths}]"
```
### Run Training

This starts:
- **Agent server**: Orchestrates rollouts using resource servers and model servers
- **Resources server**: Supports environment logic such as state-management, tool implementations, and task verification
- **Model server**: Adapts vLLM server requests to support NeMo Gym agents and on-policy RL training while ensuring OpenAI API compatibility
- **Head server**: Manages servers used in training enabling their discovery
Training with NeMo Gym and TRL requires vLLM server mode. First, start the vLLM server:

1. **Start TRL vLLM Server on GPU 0** (Terminal 2)
1. **Start TRL vLLM Server on GPU 0**

```bash
cd trl/
source .venv/bin/activate
CUDA_VISIBLE_DEVICES=0 trl vllm-serve \
--model Qwen/Qwen2.5-1.5B-Instruct \
--max-model-len 16384 \
--host 0.0.0.0 \
--port 8000
```

1. **Run Training on GPU 1** (Terminal 3)

```bash
source trl/.venv/bin/activate
cd trl/examples/scripts/nemo_gym
export WANDB_API_KEY=...
uv add omegaconf

CUDA_VISIBLE_DEVICES=1 python train_multi_environment.py --config config.yaml
```
Now launch training!

## Multi-Node Training with Slurm

An example five-node training script is provided in `submit.sh`. Nodes one through four run the training algorithm, while node five runs vLLM inference for NeMo Gym agent rollouts.

1. **Configure the Script**

Update `submit.sh` with your Slurm account, partition, paths to your project directory, and updated training configs.

1. **Submit the Job**

```bash
sbatch submit.sh
```

1. **Monitor Training**

```bash
tail -f logs/<job_id>/*
```

> **Tip**: Set up wandb logging for detailed training metrics. For more details on TRL's vLLM integration, refer to the vLLM integration page.

## Multi-Environment Training

Train on multiple NeMo Gym environments simultaneously. This allows learning diverse capabilities, such as tool calling and math reasoning, in a single training run.

1. **Prepare Individual Datasets**

Prepare datasets for each environment. The workplace assistant dataset was prepared above. Now lets create a dataset for the mini sudoku environment implemented by the reasoning gym resources server in NeMo Gym:
1. **Run Training on GPU 1**

```bash
cd Gym
source .venv/bin/activate
uv add reasoning-gym
cd resources_servers/reasoning_gym
python scripts/create_dataset.py \
--task mini_sudoku \
--size 2000 \
--seed 42 \
--output data/reasoning_gym/train_mini_sudoku.jsonl

python scripts/create_dataset.py \
--task mini_sudoku \
--size 50 \
--seed 24 \
--output data/reasoning_gym/val_mini_sudoku.jsonl
CUDA_VISIBLE_DEVICES=1 python3 examples/scripts/nemo_gym/grpo_nemo_gym.py
```

1. **Create Combined Dataset**
You should see training progress with completions logged to the terminal! Set up WandB or Trackio to monitor detailed metrics.

Combine datasets into a single file with tasks from both environments:
Note that the workplace assistant environment is difficult for `Qwen/Qwen2.5-1.5B-Instruct`. To see quicker improvements, try an easier environment, such as `mini_sudoku` from the Reasoning Gym integration in NeMo Gym. Or, try using a larger model, such as `Qwen/Qwen3-4B-Instruct-2507` with a larger global batch size.

```bash
cat data/workplace_assistant/train_workplace.jsonl data/reasoning_gym/train_mini_sudoku.jsonl | shuf > train_multi_env.jsonl
```

> **Tip**: Ensure datasets are the same size before shuffling for an even blend of tasks. Repeat for the validation dataset.

1. **Update Training Config**

Update the config to point to the combined dataset:
## Using other environments

```yaml
model_name: "Qwen/Qwen3-4B-Instruct-2507"

dataset_path: "/path/to/data/train_multi_env.jsonl"
eval_dataset_path: "/path/to/data/val_multi_env.jsonl"

task: "workplace-sudoku" # used in wandb run name
output_dir: "outputs/nemo_gym_multi_env"

# ... rest of config same
```
Using other NeMo Gym environments in TRL is simple. First, update `gym_configs` in `config.yaml` to point to the new NeMo Gym config file. Next, download or create a new dataset. Note that NeMo Gym datasets require an `agent_ref` field so that rollouts are generated in the correct environment for each task. Visit the [NeMo Gym documentation](https://docs.nvidia.com/nemo/gym/latest/) to learn more about configuration files, datasets, and creating new NeMo Gym environments.

1. **Update ng_run**

Whether training interactively or via Slurm, update the `ng_run` command to include config files from each resources server:

```bash
cd Gym
source .venv/bin/activate

config_paths="responses_api_models/vllm_model/configs/vllm_model.yaml,\
resources_servers/workplace_assistant/configs/workplace_assistant.yaml,\
resources_servers/reasoning_gym/configs/reasoning_gym.yaml"

ng_run "+config_paths=[${config_paths}]"
```
## Multi-Environment Training

This starts servers for both environments. The training script automatically routes each example to the correct agent server based on its `agent_ref` field.
To train on multiple environments simultaneously, create a dataset with tasks from both environments. Add each environment config to the `gym_configs` list in your training config. NeMo Gym automatically routes each example to the correct agent server based on its `agent_ref` field for effortless and scalable multi-environment training.

1. **Run Training**
Visit the NeMo Gym documentation to learn more about existing environments and how to build a new one!

Update the Slurm submission script to use the new training config and both `ng_run` resources server configs, then submit the job as before.
## Multi-Node Training with Slurm

The training script reads `agent_ref` from each example's metadata, routes requests to the correct NeMo Gym agent server, and handles different agents and environments in the same batch.
An example Slurm submission script is provided in `submit.sh`. Update it with your Slurm account, partition, and local paths, then submit with `sbatch submit.sh`.

## Resources

- [NeMo Gym GitHub](https://github.com/NVIDIA-NeMo/Gym)
- [NeMo Gym Documentation](https://docs.nvidia.com/nemo/gym/latest/)
- [Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/nemo_gym/train_multi_environment.py)
- [Training Script](https://github.com/huggingface/trl/blob/main/examples/scripts/nemo_gym/grpo_nemo_gym.py)
- [TRL GRPO Trainer](grpo_trainer)
31 changes: 17 additions & 14 deletions examples/scripts/nemo_gym/config.yaml
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
# Model
model_name: "Qwen/Qwen2.5-1.5B-Instruct"

# Data
dataset_path: "/home/ubuntu/Gym/resources_servers/workplace_assistant/data/train.jsonl"
eval_dataset_path: "/home/ubuntu/Gym/resources_servers/workplace_assistant/data/validation.jsonl"
# Data - Update these to your own paths!
dataset_path: "/home/ubuntu/trl/Gym/resources_servers/workplace_assistant/data/train.jsonl"
eval_dataset_path: "/home/ubuntu/trl/Gym/resources_servers/workplace_assistant/data/validation_100.jsonl"

# NeMo Gym server configs (relative to Gym repo root)
gym_configs:
- resources_servers/workplace_assistant/configs/workplace_assistant.yaml
- responses_api_models/vllm_model/configs/vllm_model_for_training.yaml

# Logging
output_dir: "outputs/nemo_gym"
task: "workplace" # just used in wandb run name
report_to: "wandb"
task: "workplace"
report_to: "none"
project_name: "trl-nemo-gym"
log_completions: true
num_completions_to_print: 2
Expand All @@ -18,20 +23,18 @@ learning_rate: 1.0e-5
max_steps: 1000
num_generations: 8
per_device_train_batch_size: 1
gradient_accumulation_steps: 32
max_completion_length: 16384
warmup_steps: 5
per_device_eval_batch_size: 1
gradient_accumulation_steps: 64
max_completion_length: 10000
warmup_steps: 3
lr_scheduler_type: "linear"
optim: "adamw_torch_fused"
weight_decay: 0.0
weight_decay: 0.01
vllm_importance_sampling_correction: true

# Inference sampling parameters
temperature: 1.0
top_p: 0.999

# Checkpointing and Eval
# Checkpointing and eval
save_steps: 10
eval_strategy: "steps"
eval_steps: 10

eval_steps: 10
Loading