Skip to content

Commit 9ea9f1a

Browse files
committed
[docs] Add gym + rl design integration
Signed-off-by: Ananth Subramaniam <[email protected]>
1 parent dacac7e commit 9ea9f1a

File tree

2 files changed

+244
-0
lines changed

2 files changed

+244
-0
lines changed
Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# NeMo Gym Integration
2+
3+
This document describes how NeMo RL integrates with [NeMo Gym](https://docs.nvidia.com/nemo/gym/latest/index.html) for multi-step and multi-turn reinforcement learning training.
4+
5+
## Overview
6+
7+
NeMo Gym provides HTTP-based training environments for LLMs. NeMo RL exposes its vLLM generation engine as an OpenAI-compatible HTTP server, which NeMo Gym calls during rollouts, enabling:
8+
9+
- **Decoupled architecture**: Environments don't need direct access to model internals
10+
- **Multi-step/multi-turn support**: Agents can orchestrate complex interactions with tools
11+
- **Refit compatibility**: NeMo RL's weight synchronization works transparently
12+
13+
## Configuration
14+
15+
To enable NeMo Gym integration, add the following to your NeMo RL config:
16+
17+
```yaml
18+
policy:
19+
generation:
20+
backend: vllm
21+
vllm_cfg:
22+
async_engine: true # Required for HTTP server support
23+
expose_http_server: true # Exposes /v1/chat/completions endpoint
24+
25+
env:
26+
should_use_nemo_gym: true # Enables NeMo Gym integration
27+
nemo_gym:
28+
# NeMo Gym config paths and settings
29+
config_paths:
30+
- resources_servers/math/configs/math.yaml
31+
- responses_api_agents/simple_agent/configs/simple_agent.yaml
32+
```
33+
34+
### Version Requirements
35+
36+
NeMo Gym runs as a Ray actor within NeMo RL's Ray cluster, so the same Ray and Python versions must be used in both environments.
37+
38+
## Architecture Overview
39+
40+
```{mermaid}
41+
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
42+
flowchart LR
43+
subgraph RL["NeMo RL"]
44+
GRPO["GRPO Loop"]
45+
vLLM["vLLM + HTTP"]
46+
Bridge["NemoGym<br/>Actor"]
47+
end
48+
49+
subgraph Gym["NeMo Gym"]
50+
Agent["Agent"]
51+
Model["Model<br/>(Proxy)"]
52+
Resources["Resources"]
53+
end
54+
55+
GRPO -->|refit| vLLM
56+
GRPO -->|run_rollouts| Bridge
57+
Bridge -->|spawns| Gym
58+
Agent <--> Model
59+
Agent <--> Resources
60+
Model -->|HTTP| vLLM
61+
62+
style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
63+
style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
64+
```
65+
66+
**Color coding**: Blue = NeMo RL code (`nemo_rl/`), Orange = NeMo Gym code (`nemo_gym/`)
67+
68+
## The NemoGym Actor
69+
70+
The integration is handled by the `NemoGym` Ray actor at `nemo_rl/environments/nemo_gym.py`:
71+
72+
1. **Created by NeMo RL** during training setup via `NemoGym.remote(config)`
73+
2. **Joins the existing Ray cluster** that NeMo RL already initialized
74+
3. **Spawns NeMo Gym servers** as OS subprocesses (Head, Agent, Model, Resources)
75+
4. **Injects vLLM base URLs** so NeMo Gym's Model Server knows where to proxy requests
76+
5. **Exposes `run_rollouts()`** as the entry point for the training loop
77+
78+
```{mermaid}
79+
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
80+
flowchart LR
81+
subgraph RL["NeMo RL"]
82+
GRPO["GRPO Loop<br/><i>grpo.py</i>"]
83+
Actor["NemoGym Actor<br/><i>nemo_rl/environments/nemo_gym.py</i>"]
84+
end
85+
86+
subgraph Gym["NeMo Gym"]
87+
RCH["RolloutCollectionHelper<br/><i>nemo_gym/rollout_collection.py</i>"]
88+
Agent["Agent Server<br/><i>responses_api_agents/*/app.py</i>"]
89+
end
90+
91+
GRPO -->|"1. run_rollouts.remote(batch)"| Actor
92+
Actor -->|"2. POST /run"| Agent
93+
Agent -->|"3. orchestrates rollout"| RCH
94+
RCH -->|"4. returns results"| Actor
95+
Actor -->|"5. returns to training"| GRPO
96+
97+
style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
98+
style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
99+
```
100+
101+
## vLLM HTTP Server
102+
103+
**NeMo Gym does not run its own vLLM engine.** The Model Server is purely an HTTP proxy:
104+
105+
| Aspect | NeMo RL vLLM Worker | NeMo Gym Model Server |
106+
|--------|---------------------|----------------------|
107+
| **Engine** | Runs actual vLLM `AsyncLLM` | No engine - HTTP proxy only |
108+
| **GPU** | Holds model weights | No GPU required |
109+
| **Endpoints** | `/v1/chat/completions`, `/tokenize` | `/v1/responses` |
110+
| **Role** | Inference | API translation, forwards requests |
111+
112+
Data parallel vLLM workers each expose their own HTTP server. NeMo Gym's Model Server load-balances requests across them.
113+
114+
## Initialization Sequence
115+
116+
```{mermaid}
117+
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
118+
sequenceDiagram
119+
autonumber
120+
box rgb(227, 242, 253) NeMo RL
121+
participant RL as Training Script
122+
participant Ray as Ray Cluster
123+
participant vLLM as vLLM Workers
124+
participant Bridge as NemoGym Actor
125+
end
126+
box rgb(255, 243, 224) NeMo Gym
127+
participant Servers as NeMo Gym Servers
128+
end
129+
130+
RL->>Ray: Initialize Ray cluster
131+
RL->>vLLM: Create vLLM workers with HTTP servers
132+
vLLM-->>RL: Return base URLs (one per DP rank)
133+
RL->>Bridge: NemoGym.remote(config, base_urls)
134+
Note over Bridge: Reuses existing Ray cluster
135+
Bridge->>Servers: Spawn subprocess servers
136+
Servers-->>Bridge: Health check OK
137+
Bridge-->>RL: Ready for rollouts
138+
```
139+
140+
## Training Loop Control Flow
141+
142+
```{mermaid}
143+
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
144+
sequenceDiagram
145+
autonumber
146+
box rgb(227, 242, 253) NeMo RL
147+
participant GRPO as GRPO Loop
148+
participant Policy as Policy Workers
149+
participant vLLM as vLLM HTTP
150+
participant Bridge as NemoGym Actor
151+
end
152+
box rgb(255, 243, 224) NeMo Gym
153+
participant Agent as Agent Server
154+
participant Model as Model Server
155+
end
156+
157+
GRPO->>vLLM: Refit (sync weights)
158+
GRPO->>Bridge: run_rollouts.remote(batch)
159+
Bridge->>Agent: POST /run
160+
Agent->>Model: POST /v1/responses
161+
Model->>vLLM: POST /v1/chat/completions
162+
vLLM-->>Model: Response
163+
Model-->>Agent: Responses API format
164+
Agent-->>Bridge: Results + rewards
165+
Bridge-->>GRPO: Token IDs, logprobs, rewards
166+
GRPO->>Policy: Compute loss and train
167+
```
168+
169+
### Key Steps
170+
171+
| Step | Location | Description |
172+
|------|----------|-------------|
173+
| **Refit** | NeMo RL | Synchronizes policy weights to vLLM workers. For async RL, refit timing may differ—see {doc}`generation` for details. |
174+
| **run_rollouts.remote()** | NeMo RL | Ray remote call from GRPO loop to the NemoGym actor |
175+
| **POST /run** | NeMo RL → NeMo Gym | HTTP request from NemoGym actor to Agent Server subprocess |
176+
| **Rollout orchestration** | NeMo Gym | Agent calls Model Server and Resources Server via HTTP |
177+
| **POST /v1/chat/completions** | NeMo Gym → NeMo RL | Model Server proxies to NeMo RL's vLLM HTTP endpoint |
178+
| **Result processing** | NeMo RL | NemoGym actor extracts token IDs, logprobs, rewards |
179+
180+
### Async Result Processing
181+
182+
The NemoGym actor uses an **as-completed** pattern to overlap waiting with post-processing:
183+
184+
1. **Results return out of order**: Rollouts complete at different times depending on conversation length and tool calls. Rather than waiting for all results, the actor processes each result as soon as it completes.
185+
186+
2. **Immediate post-processing**: As each rollout completes, the actor immediately extracts token IDs and logprobs. This overlaps CPU work with network I/O from slower rollouts still in flight.
187+
188+
3. **Reordering at the end**: Each example carries an index. After all results are collected, results are reordered to match the original batch order before returning to the training loop.
189+
190+
This pattern maximizes throughput by keeping the CPU busy while waiting for network responses.
191+
192+
## Data Format Translation
193+
194+
```{mermaid}
195+
%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%%
196+
flowchart LR
197+
subgraph RL1["NeMo RL (Input)"]
198+
Datum["DatumSpec"]
199+
end
200+
201+
subgraph Gym["NeMo Gym"]
202+
Example["Example Dict"]
203+
ReqResp["Responses API"]
204+
ReqChat["Chat Completions"]
205+
end
206+
207+
subgraph RL2["NeMo RL (Output)"]
208+
Result["Result"]
209+
end
210+
211+
Datum -->|"convert"| Example
212+
Example --> ReqResp
213+
ReqResp -->|"translate"| ReqChat
214+
ReqChat -->|"vLLM"| ReqResp
215+
ReqResp --> Example
216+
Example -->|"extract"| Result
217+
218+
style RL1 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
219+
style RL2 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px
220+
style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px
221+
```
222+
223+
### Format Differences
224+
225+
| Format | Owner | Contents |
226+
|--------|-------|----------|
227+
| **DatumSpec** | NeMo RL | Training-focused: `prompt`, `prompt_token_ids`, task metadata for loss computation |
228+
| **Example Dict** | NeMo Gym | Environment-focused: `responses_create_params` (OpenAI format), `expected` answer for verification |
229+
| **Responses API** | NeMo Gym | OpenAI Responses API format with `input`, `tools`, multi-turn conversation |
230+
| **Chat Completions** | NeMo RL vLLM | OpenAI Chat Completions format, the actual inference call |
231+
232+
The Model Server handles Responses API ↔ Chat Completions translation, including:
233+
- Converting message formats
234+
- Extracting reasoning content from think tags
235+
- Attaching token ID information for training
236+
237+
## Tokenization and On-Policy Corrections
238+
239+
Token IDs are extracted at the NeMo RL vLLM layer via the `/tokenize` endpoint. This ensures:
240+
- Tokenization matches the exact model and tokenizer used for generation
241+
- No re-tokenization drift between generation and training
242+
243+
For details on on-policy token ID handling, see {doc}`../guides/environments` and the [NeMo Gym on-policy corrections documentation](https://docs.nvidia.com/nemo/gym/latest/contribute/rl-framework-integration/openai-compatible-http-server-on-policy-correction.html).

docs/index.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -258,6 +258,7 @@ design-docs/fsdp2-parallel-plan.md
258258
design-docs/training-backends.md
259259
design-docs/sequence-packing-and-dynamic-batching.md
260260
design-docs/env-vars.md
261+
design-docs/nemo-gym-integration.md
261262
```
262263

263264
```{toctree}

0 commit comments

Comments
 (0)