|
| 1 | +# NeMo Gym Integration |
| 2 | + |
| 3 | +This document describes how NeMo RL integrates with [NeMo Gym](https://docs.nvidia.com/nemo/gym/latest/index.html) for multi-step and multi-turn reinforcement learning training. |
| 4 | + |
| 5 | +## Overview |
| 6 | + |
| 7 | +NeMo Gym provides HTTP-based training environments for LLMs. NeMo RL exposes its vLLM generation engine as an OpenAI-compatible HTTP server, which NeMo Gym calls during rollouts, enabling: |
| 8 | + |
| 9 | +- **Decoupled architecture**: Environments don't need direct access to model internals |
| 10 | +- **Multi-step/multi-turn support**: Agents can orchestrate complex interactions with tools |
| 11 | +- **Refit compatibility**: NeMo RL's weight synchronization works transparently |
| 12 | + |
| 13 | +## Configuration |
| 14 | + |
| 15 | +To enable NeMo Gym integration, add the following to your NeMo RL config: |
| 16 | + |
| 17 | +```yaml |
| 18 | +policy: |
| 19 | + generation: |
| 20 | + backend: vllm |
| 21 | + vllm_cfg: |
| 22 | + async_engine: true # Required for HTTP server support |
| 23 | + expose_http_server: true # Exposes /v1/chat/completions endpoint |
| 24 | + |
| 25 | +env: |
| 26 | + should_use_nemo_gym: true # Enables NeMo Gym integration |
| 27 | + nemo_gym: |
| 28 | + # NeMo Gym config paths and settings |
| 29 | + config_paths: |
| 30 | + - resources_servers/math/configs/math.yaml |
| 31 | + - responses_api_agents/simple_agent/configs/simple_agent.yaml |
| 32 | +``` |
| 33 | +
|
| 34 | +### Version Requirements |
| 35 | +
|
| 36 | +NeMo Gym runs as a Ray actor within NeMo RL's Ray cluster, so the same Ray and Python versions must be used in both environments. |
| 37 | +
|
| 38 | +## Architecture Overview |
| 39 | +
|
| 40 | +```{mermaid} |
| 41 | +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% |
| 42 | +flowchart LR |
| 43 | + subgraph RL["NeMo RL"] |
| 44 | + GRPO["GRPO Loop"] |
| 45 | + vLLM["vLLM + HTTP"] |
| 46 | + Bridge["NemoGym<br/>Actor"] |
| 47 | + end |
| 48 | + |
| 49 | + subgraph Gym["NeMo Gym"] |
| 50 | + Agent["Agent"] |
| 51 | + Model["Model<br/>(Proxy)"] |
| 52 | + Resources["Resources"] |
| 53 | + end |
| 54 | + |
| 55 | + GRPO -->|refit| vLLM |
| 56 | + GRPO -->|run_rollouts| Bridge |
| 57 | + Bridge -->|spawns| Gym |
| 58 | + Agent <--> Model |
| 59 | + Agent <--> Resources |
| 60 | + Model -->|HTTP| vLLM |
| 61 | + |
| 62 | + style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px |
| 63 | + style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px |
| 64 | +``` |
| 65 | + |
| 66 | +**Color coding**: Blue = NeMo RL code (`nemo_rl/`), Orange = NeMo Gym code (`nemo_gym/`) |
| 67 | + |
| 68 | +## The NemoGym Actor |
| 69 | + |
| 70 | +The integration is handled by the `NemoGym` Ray actor at `nemo_rl/environments/nemo_gym.py`: |
| 71 | + |
| 72 | +1. **Created by NeMo RL** during training setup via `NemoGym.remote(config)` |
| 73 | +2. **Joins the existing Ray cluster** that NeMo RL already initialized |
| 74 | +3. **Spawns NeMo Gym servers** as OS subprocesses (Head, Agent, Model, Resources) |
| 75 | +4. **Injects vLLM base URLs** so NeMo Gym's Model Server knows where to proxy requests |
| 76 | +5. **Exposes `run_rollouts()`** as the entry point for the training loop |
| 77 | + |
| 78 | +```{mermaid} |
| 79 | +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% |
| 80 | +flowchart LR |
| 81 | + subgraph RL["NeMo RL"] |
| 82 | + GRPO["GRPO Loop<br/><i>grpo.py</i>"] |
| 83 | + Actor["NemoGym Actor<br/><i>nemo_rl/environments/nemo_gym.py</i>"] |
| 84 | + end |
| 85 | + |
| 86 | + subgraph Gym["NeMo Gym"] |
| 87 | + RCH["RolloutCollectionHelper<br/><i>nemo_gym/rollout_collection.py</i>"] |
| 88 | + Agent["Agent Server<br/><i>responses_api_agents/*/app.py</i>"] |
| 89 | + end |
| 90 | + |
| 91 | + GRPO -->|"1. run_rollouts.remote(batch)"| Actor |
| 92 | + Actor -->|"2. POST /run"| Agent |
| 93 | + Agent -->|"3. orchestrates rollout"| RCH |
| 94 | + RCH -->|"4. returns results"| Actor |
| 95 | + Actor -->|"5. returns to training"| GRPO |
| 96 | +
|
| 97 | + style RL fill:#e3f2fd,stroke:#1565c0,stroke-width:2px |
| 98 | + style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px |
| 99 | +``` |
| 100 | + |
| 101 | +## vLLM HTTP Server |
| 102 | + |
| 103 | +**NeMo Gym does not run its own vLLM engine.** The Model Server is purely an HTTP proxy: |
| 104 | + |
| 105 | +| Aspect | NeMo RL vLLM Worker | NeMo Gym Model Server | |
| 106 | +|--------|---------------------|----------------------| |
| 107 | +| **Engine** | Runs actual vLLM `AsyncLLM` | No engine - HTTP proxy only | |
| 108 | +| **GPU** | Holds model weights | No GPU required | |
| 109 | +| **Endpoints** | `/v1/chat/completions`, `/tokenize` | `/v1/responses` | |
| 110 | +| **Role** | Inference | API translation, forwards requests | |
| 111 | + |
| 112 | +Data parallel vLLM workers each expose their own HTTP server. NeMo Gym's Model Server load-balances requests across them. |
| 113 | + |
| 114 | +## Initialization Sequence |
| 115 | + |
| 116 | +```{mermaid} |
| 117 | +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% |
| 118 | +sequenceDiagram |
| 119 | + autonumber |
| 120 | + box rgb(227, 242, 253) NeMo RL |
| 121 | + participant RL as Training Script |
| 122 | + participant Ray as Ray Cluster |
| 123 | + participant vLLM as vLLM Workers |
| 124 | + participant Bridge as NemoGym Actor |
| 125 | + end |
| 126 | + box rgb(255, 243, 224) NeMo Gym |
| 127 | + participant Servers as NeMo Gym Servers |
| 128 | + end |
| 129 | + |
| 130 | + RL->>Ray: Initialize Ray cluster |
| 131 | + RL->>vLLM: Create vLLM workers with HTTP servers |
| 132 | + vLLM-->>RL: Return base URLs (one per DP rank) |
| 133 | + RL->>Bridge: NemoGym.remote(config, base_urls) |
| 134 | + Note over Bridge: Reuses existing Ray cluster |
| 135 | + Bridge->>Servers: Spawn subprocess servers |
| 136 | + Servers-->>Bridge: Health check OK |
| 137 | + Bridge-->>RL: Ready for rollouts |
| 138 | +``` |
| 139 | + |
| 140 | +## Training Loop Control Flow |
| 141 | + |
| 142 | +```{mermaid} |
| 143 | +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% |
| 144 | +sequenceDiagram |
| 145 | + autonumber |
| 146 | + box rgb(227, 242, 253) NeMo RL |
| 147 | + participant GRPO as GRPO Loop |
| 148 | + participant Policy as Policy Workers |
| 149 | + participant vLLM as vLLM HTTP |
| 150 | + participant Bridge as NemoGym Actor |
| 151 | + end |
| 152 | + box rgb(255, 243, 224) NeMo Gym |
| 153 | + participant Agent as Agent Server |
| 154 | + participant Model as Model Server |
| 155 | + end |
| 156 | + |
| 157 | + GRPO->>vLLM: Refit (sync weights) |
| 158 | + GRPO->>Bridge: run_rollouts.remote(batch) |
| 159 | + Bridge->>Agent: POST /run |
| 160 | + Agent->>Model: POST /v1/responses |
| 161 | + Model->>vLLM: POST /v1/chat/completions |
| 162 | + vLLM-->>Model: Response |
| 163 | + Model-->>Agent: Responses API format |
| 164 | + Agent-->>Bridge: Results + rewards |
| 165 | + Bridge-->>GRPO: Token IDs, logprobs, rewards |
| 166 | + GRPO->>Policy: Compute loss and train |
| 167 | +``` |
| 168 | + |
| 169 | +### Key Steps |
| 170 | + |
| 171 | +| Step | Location | Description | |
| 172 | +|------|----------|-------------| |
| 173 | +| **Refit** | NeMo RL | Synchronizes policy weights to vLLM workers. For async RL, refit timing may differ—see {doc}`generation` for details. | |
| 174 | +| **run_rollouts.remote()** | NeMo RL | Ray remote call from GRPO loop to the NemoGym actor | |
| 175 | +| **POST /run** | NeMo RL → NeMo Gym | HTTP request from NemoGym actor to Agent Server subprocess | |
| 176 | +| **Rollout orchestration** | NeMo Gym | Agent calls Model Server and Resources Server via HTTP | |
| 177 | +| **POST /v1/chat/completions** | NeMo Gym → NeMo RL | Model Server proxies to NeMo RL's vLLM HTTP endpoint | |
| 178 | +| **Result processing** | NeMo RL | NemoGym actor extracts token IDs, logprobs, rewards | |
| 179 | + |
| 180 | +### Async Result Processing |
| 181 | + |
| 182 | +The NemoGym actor uses an **as-completed** pattern to overlap waiting with post-processing: |
| 183 | + |
| 184 | +1. **Results return out of order**: Rollouts complete at different times depending on conversation length and tool calls. Rather than waiting for all results, the actor processes each result as soon as it completes. |
| 185 | + |
| 186 | +2. **Immediate post-processing**: As each rollout completes, the actor immediately extracts token IDs and logprobs. This overlaps CPU work with network I/O from slower rollouts still in flight. |
| 187 | + |
| 188 | +3. **Reordering at the end**: Each example carries an index. After all results are collected, results are reordered to match the original batch order before returning to the training loop. |
| 189 | + |
| 190 | +This pattern maximizes throughput by keeping the CPU busy while waiting for network responses. |
| 191 | + |
| 192 | +## Data Format Translation |
| 193 | + |
| 194 | +```{mermaid} |
| 195 | +%%{init: {'theme': 'default', 'themeVariables': { 'lineColor': '#5c6bc0', 'primaryTextColor': '#333'}}}%% |
| 196 | +flowchart LR |
| 197 | + subgraph RL1["NeMo RL (Input)"] |
| 198 | + Datum["DatumSpec"] |
| 199 | + end |
| 200 | + |
| 201 | + subgraph Gym["NeMo Gym"] |
| 202 | + Example["Example Dict"] |
| 203 | + ReqResp["Responses API"] |
| 204 | + ReqChat["Chat Completions"] |
| 205 | + end |
| 206 | + |
| 207 | + subgraph RL2["NeMo RL (Output)"] |
| 208 | + Result["Result"] |
| 209 | + end |
| 210 | + |
| 211 | + Datum -->|"convert"| Example |
| 212 | + Example --> ReqResp |
| 213 | + ReqResp -->|"translate"| ReqChat |
| 214 | + ReqChat -->|"vLLM"| ReqResp |
| 215 | + ReqResp --> Example |
| 216 | + Example -->|"extract"| Result |
| 217 | +
|
| 218 | + style RL1 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px |
| 219 | + style RL2 fill:#e3f2fd,stroke:#1565c0,stroke-width:2px |
| 220 | + style Gym fill:#fff3e0,stroke:#ef6c00,stroke-width:2px |
| 221 | +``` |
| 222 | + |
| 223 | +### Format Differences |
| 224 | + |
| 225 | +| Format | Owner | Contents | |
| 226 | +|--------|-------|----------| |
| 227 | +| **DatumSpec** | NeMo RL | Training-focused: `prompt`, `prompt_token_ids`, task metadata for loss computation | |
| 228 | +| **Example Dict** | NeMo Gym | Environment-focused: `responses_create_params` (OpenAI format), `expected` answer for verification | |
| 229 | +| **Responses API** | NeMo Gym | OpenAI Responses API format with `input`, `tools`, multi-turn conversation | |
| 230 | +| **Chat Completions** | NeMo RL vLLM | OpenAI Chat Completions format, the actual inference call | |
| 231 | + |
| 232 | +The Model Server handles Responses API ↔ Chat Completions translation, including: |
| 233 | +- Converting message formats |
| 234 | +- Extracting reasoning content from think tags |
| 235 | +- Attaching token ID information for training |
| 236 | + |
| 237 | +## Tokenization and On-Policy Corrections |
| 238 | + |
| 239 | +Token IDs are extracted at the NeMo RL vLLM layer via the `/tokenize` endpoint. This ensures: |
| 240 | +- Tokenization matches the exact model and tokenizer used for generation |
| 241 | +- No re-tokenization drift between generation and training |
| 242 | + |
| 243 | +For details on on-policy token ID handling, see {doc}`../guides/environments` and the [NeMo Gym on-policy corrections documentation](https://docs.nvidia.com/nemo/gym/latest/contribute/rl-framework-integration/openai-compatible-http-server-on-policy-correction.html). |
0 commit comments