Skip to content

Nemo gym integration#3516

Open
winglian wants to merge 10 commits intomainfrom
nemo-gym-integration
Open

Nemo gym integration#3516
winglian wants to merge 10 commits intomainfrom
nemo-gym-integration

Conversation

@winglian
Copy link
Collaborator

@winglian winglian commented Mar 19, 2026

Summary by CodeRabbit

Release Notes

  • New Features

    • Added NeMo Gym integration for reinforcement learning training with customizable reward environments and multi-turn support
    • Added OpenAI-compatible API endpoints (/v1/models, /v1/chat/completions) to vLLM server
  • Bug Fixes

    • Fixed GRPO configuration handling to prevent invalid parameters in standard (non-async) training mode
    • Improved LoRA adapter synchronization with automatic fallback mechanism
    • Fixed entropy computation in fused decoding scenarios
  • Documentation

    • Added comprehensive NeMo Gym integration guide with setup instructions and example configurations

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 19, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cb486e25-2331-4f84-8af1-a6393db8abcf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR introduces a comprehensive NeMo Gym integration for GRPO RL training with argument validation, data producer orchestration, reward functions, and server lifecycle management. It also fixes GRPO builder field filtering for async configs, improves LoRA synchronization and entropy handling in async trainers, and adds OpenAI-compatible API endpoints to vLLM serving.

Changes

Cohort / File(s) Summary
GRPO Configuration & Builder
src/axolotl/core/builders/rl.py
Conditionally filters FastAsyncGRPOConfig-only dataclass fields from standard GRPO training args kwargs in non-async case to prevent unsupported parameters.
GRPO Async Trainer Updates
src/axolotl/core/trainers/grpo/async_trainer.py
Enhanced LoRA synchronization with fallback from standardized /v1/load_lora_adapter endpoint to legacy /set_lora_adapter/ endpoint; wrapped reset_prefix_cache() in exception handling; fixed entropy tensor handling when using Liger fused softmax path.
NeMo Gym Integration - Core Package
src/axolotl/integrations/nemo_gym/__init__.py, args.py, plugin.py
New integration package with argument validation (enforcing config paths and datasets when enabled), and main plugin orchestrating server lifecycle, dataset loading, reward injection, and trainer integration.
NeMo Gym Integration - Data & Rewards
data_producer.py, dataset.py, rewards.py
Data producer extending GRPODataProducer to generate rollouts via NeMo Gym agent /run endpoints; dataset loader parsing JSONL with server/agent routing; dual reward functions (reward_env for multi-turn passthrough, reward_nemo_gym_verify for single-turn verification).
NeMo Gym Integration - Server & Multi-turn
server.py, multi_turn.py
Server lifecycle management (repo setup, venv initialization, process startup, readiness checks, endpoint discovery); multi-turn rollout orchestration via async agent HTTP calls with response parsing and tensor padding.
NeMo Gym Documentation & Examples
README.md, nemo_gym_*.yaml
Integration guide with setup/execution steps, dependency pinning, data format specs, and three example configurations (multi-environment, multi-turn, Sudoku).
vLLM Server Script Enhancements
src/axolotl/scripts/vllm_serve_lora.py
Added OpenAI-compatible HTTP endpoints (GET /v1/models, POST /v1/chat/completions) with logprobs mapping; made GuidedDecodingParams optional with fallback for older vLLM versions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~90 minutes

Possibly related PRs

  • Support for Async GRPO #3486: Addresses FastAsyncGRPOConfig dataclass field filtering in GRPO builder for async training argument construction.

Suggested reviewers

  • djsaunde
  • NanoCode012
🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Nemo gym integration' directly describes the main change—adding a comprehensive NeMo Gym integration to the Axolotl GRPO training framework.
Docstring Coverage ✅ Passed Docstring coverage is 80.43% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch nemo-gym-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@winglian winglian requested a review from NanoCode012 March 19, 2026 09:28
Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this work an addition ontop of async GRPO?

I'm not so clear on the arch, is it a replacement for vllm generation?

Also, how does the uv packages interact with axolotl, if we were to use pip?

@winglian winglian added the scheduled_release This PR is slated for the upcoming release label Mar 21, 2026
@winglian winglian force-pushed the nemo-gym-integration branch from 4eaaf8a to d8da708 Compare March 22, 2026 02:48
@winglian winglian marked this pull request as ready for review March 22, 2026 04:48
@github-actions
Copy link
Contributor

github-actions bot commented Mar 22, 2026

📖 Documentation Preview: https://69c02f066e224299aeaf1509--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 64746db

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

🧹 Nitpick comments (3)
src/axolotl/integrations/nemo_gym/README.md (1)

231-252: Consider adding language specifiers to fenced code blocks for diagram sections.

The ASCII diagrams at lines 231-235 and 238-252 could use text as the language identifier to satisfy markdown linters, though this is a minor stylistic preference.

♻️ Optional fix
-```
+```text
 axolotl train → GRPO Trainer generates completions
-```
+```text
 ┌─────────────┐     ┌──────────────┐     ┌──────────────────┐
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/README.md` around lines 231 - 252, Update
the two fenced code blocks that render the ASCII diagrams (the block containing
"axolotl train → GRPO Trainer generates completions" and the multi-turn diagram
block beginning with the box diagram) to include a language specifier by
changing the opening backticks to ```text so markdown linters recognize them as
plain text; keep the diagram content unchanged and only add the "text" tag to
the existing triple-backtick delimiters.
src/axolotl/core/trainers/grpo/async_trainer.py (1)

985-988: Consider logging the suppressed exception for debuggability.

The static analysis correctly flags the bare try-except-pass pattern. While reset_prefix_cache() failure may not be critical, silently swallowing all exceptions makes debugging harder when issues do occur.

♻️ Proposed fix to log the exception
             # Reset prefix cache after adapter update
             try:
                 vllm_client.reset_prefix_cache()
-            except Exception:
-                pass  # Not critical
+            except Exception as exc:
+                logger.debug("reset_prefix_cache failed (non-critical): %s", exc)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/core/trainers/grpo/async_trainer.py` around lines 985 - 988, The
bare except/pass around vllm_client.reset_prefix_cache() in async_trainer.py
hides errors; change it to catch Exception as e and log the failure (use the
module/class logger, e.g., self.logger or the existing logger in AsyncTrainer) —
call logger.exception or logger.warning with the exception details and a short
context message referencing reset_prefix_cache() so failures are visible for
debugging.
src/axolotl/integrations/nemo_gym/dataset.py (1)

80-81: Specify explicit UTF-8 encoding for cross-platform compatibility.

Opening files without explicit encoding uses the system default, which may cause issues on Windows systems or with non-ASCII content.

♻️ Proposed fix
-        with open(path) as f:
+        with open(path, encoding="utf-8") as f:
             lines = f.readlines()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/dataset.py` around lines 80 - 81, The file
read uses bare open(path) which relies on system default encoding; update the
call to explicitly set UTF-8 (e.g. change with open(path) as f: to with
open(path, encoding="utf-8") as f:) to ensure cross-platform handling of
non-ASCII content; optionally include errors="replace" if you want to tolerate
malformed bytes. Ensure you modify the open call that produces lines =
f.readlines() in src/axolotl/integrations/nemo_gym/dataset.py.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/nemo_gym/args.py`:
- Around line 17-40: The validator runs too early/late leaving nullable fields
unresolved; add a root validator that runs before field validation to compute
effective defaults: create a `@root_validator`(pre=True) method (e.g.,
resolve_nemo_gym_defaults) in the same Pydantic model that checks incoming
values and sets nemo_gym_enabled=True if None, and sets nemo_gym_auto_clone=True
when nemo_gym_auto_clone is None && effective nemo_gym_enabled is True (and
ensure nemo_gym_dir stays None if unset). Apply the same root_validator pattern
to the other similar block (the fields around 120-136) so the effective defaults
are resolved consistently before validation.

In `@src/axolotl/integrations/nemo_gym/data_producer.py`:
- Around line 99-107: The code is dropping item["agent_ref"] by taking only
verify_extra for a prompt, causing _call_agents() to lose routing metadata;
change the logic in data_producer.py so you first fetch the full dataset entry
(e.g., full_item = self._dataset_lookup.get(prompt_text, {})), then set item =
full_item.get("verify_extra", {}) and if full_item contains "agent_ref" attach
it to item (item["agent_ref"] = full_item["agent_ref"]); when using the fallback
default payload for responses_create_params also ensure you preserve or include
agent_ref from full_item so _call_agents() can correctly choose the /run
endpoint.

In `@src/axolotl/integrations/nemo_gym/multi_turn.py`:
- Around line 66-84: The loop uses fuzzy substring matching and then strips
agent routing info by appending item.get("verify_extra", item); change the
matching to exact/full match (replace "if isinstance(key, str) and key in
prompt_str" with equality or full-text comparison such as "if isinstance(key,
str) and prompt_str == key") to avoid selecting wrong dataset rows, and when
building expanded_items preserve agent routing by appending the original item
(or merging item["agent_ref"] into the verify_extra payload) instead of blindly
using item.get("verify_extra", item) so that agent_ref stays in the dispatched
payload used by _call_agents().

In `@src/axolotl/integrations/nemo_gym/plugin.py`:
- Around line 149-154: The dataset lookup currently keys entries by
row["prompt"][0]["content"], but NemoGymDataProducer.produce() looks up using
the last prompt message, causing misses for multi-message prompts and loss of
verify_extra/agent_ref; update the keying in the __init__ (the _dataset_lookup
population loop that uses dataset and row) to use the last prompt message
content (e.g., row["prompt"][-1]["content"] or the same extraction logic
NemoGymDataProducer.produce() uses) instead of prompt[0], and make the identical
change for the second occurrence (the block around lines 175-179) so both lookup
tables use the same key format as produce().
- Around line 325-337: The code incorrectly wires cfg.nemo_gym_verify_timeout
into the NemoGymDataProducer.request_timeout causing multi-turn /run calls to be
aborted by the short verify timeout; change the constructor call in plugin.py
where NemoGymDataProducer is instantiated so it does not use
nemo_gym_verify_timeout for request_timeout—either remove the request_timeout
argument to let NemoGymDataProducer use its 10800s default, or use a dedicated
config key (e.g., cfg.nemo_gym_run_timeout) with fallback to 10800; update the
argument passed to request_timeout accordingly.
- Around line 121-123: start_servers() reads nemo_gym_server_timeout into the
global config but the subsequent call to
wait_for_resource_servers(self._global_config) still uses the helper's 180s
default; update the call to pass the configured timeout from the global config
(e.g. extract nemo_gym_server_timeout from the object returned by
get_server_configs or self._global_config and call
wait_for_resource_servers(self._global_config, timeout=nemo_gym_server_timeout)
or the equivalent parameter name expected by wait_for_resource_servers) so the
explicit server timeout is honored.

In `@src/axolotl/integrations/nemo_gym/rewards.py`:
- Around line 66-67: The public hook reward_nemo_gym_verify and helper
_get_verify_urls currently hard-code the head port (11000), post payload
("model": "axolotl"), and a 30s timeout; update both functions to read and use
the configured values (nemo_gym_head_port, nemo_gym_model_name,
nemo_gym_verify_timeout) passed into or available to reward_nemo_gym_verify so
discovery and POST use the configured head port, the configured model name in
the JSON body, and the configured timeout for HTTP requests instead of the
hard-coded defaults; ensure _get_verify_urls accepts/use the provided head_port
and reward_nemo_gym_verify passes the configured timeout into the requests call
and uses nemo_gym_model_name for payload construction.

In `@src/axolotl/integrations/nemo_gym/server.py`:
- Around line 186-200: Update get_server_base_url (and thus get_verify_endpoint)
to normalize bind-all and loopback addresses the same way get_agent_servers
does: if srv_cfg['host'] is "0.0.0.0" replace with the external host (e.g.,
global_config["host"] or the agent/cluster-facing host used elsewhere) and if it
is "localhost" or "127.0.0.1" map it to the machine-facing hostname/IP used for
reachability; ensure wait_for_resource_servers uses this normalized URL for its
readiness probes. Locate get_server_base_url and get_verify_endpoint and apply
the same host-rewrite logic/utility used by get_agent_servers (or extract that
normalization into a helper and reuse it) so /verify URLs are routable when
services advertise 0.0.0.0 or localhost.
- Around line 101-108: The current code builds a shell command using "bash -c"
with a spliced config_paths string (vulnerable to spaces/quotes and shell
injection); instead construct a subprocess argument list and invoke the venv's
ng_run binary directly so no shell interpretation occurs. Locate where
_ng_process is created and replace the ["bash", "-c", ...] invocation by
computing the ng_run executable from the venv under gym_dir (the same venv used
earlier), then call subprocess.Popen with shell=False and an argument list that
passes config paths as separate arguments (or a single safely-constructed
argument) rather than interpolating them into a shell string; keep _ng_log_file
for stdout/stderr and preserve existing nosec/permission handling.

In `@src/axolotl/scripts/vllm_serve_lora.py`:
- Around line 520-528: The usage.total_tokens is hardcoded to 0; compute it from
the actual counts instead. Inside the usage dict in vllm_serve_lora.py (the
block that builds "usage"), assign prompt_tokens =
len(all_outputs[0].prompt_token_ids) if all_outputs else 0, compute
completion_tokens by summing len(out.token_ids) for o in all_outputs for out in
o.outputs, and set total_tokens = prompt_tokens + completion_tokens so the
"usage" object reflects the real totals.

---

Nitpick comments:
In `@src/axolotl/core/trainers/grpo/async_trainer.py`:
- Around line 985-988: The bare except/pass around
vllm_client.reset_prefix_cache() in async_trainer.py hides errors; change it to
catch Exception as e and log the failure (use the module/class logger, e.g.,
self.logger or the existing logger in AsyncTrainer) — call logger.exception or
logger.warning with the exception details and a short context message
referencing reset_prefix_cache() so failures are visible for debugging.

In `@src/axolotl/integrations/nemo_gym/dataset.py`:
- Around line 80-81: The file read uses bare open(path) which relies on system
default encoding; update the call to explicitly set UTF-8 (e.g. change with
open(path) as f: to with open(path, encoding="utf-8") as f:) to ensure
cross-platform handling of non-ASCII content; optionally include
errors="replace" if you want to tolerate malformed bytes. Ensure you modify the
open call that produces lines = f.readlines() in
src/axolotl/integrations/nemo_gym/dataset.py.

In `@src/axolotl/integrations/nemo_gym/README.md`:
- Around line 231-252: Update the two fenced code blocks that render the ASCII
diagrams (the block containing "axolotl train → GRPO Trainer generates
completions" and the multi-turn diagram block beginning with the box diagram) to
include a language specifier by changing the opening backticks to ```text so
markdown linters recognize them as plain text; keep the diagram content
unchanged and only add the "text" tag to the existing triple-backtick
delimiters.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 66d9db69-40e4-426d-8f72-9160a41eba31

📥 Commits

Reviewing files that changed from the base of the PR and between fc3b3d1 and d8da708.

📒 Files selected for processing (15)
  • src/axolotl/core/builders/rl.py
  • src/axolotl/core/trainers/grpo/async_trainer.py
  • src/axolotl/integrations/nemo_gym/README.md
  • src/axolotl/integrations/nemo_gym/__init__.py
  • src/axolotl/integrations/nemo_gym/args.py
  • src/axolotl/integrations/nemo_gym/data_producer.py
  • src/axolotl/integrations/nemo_gym/dataset.py
  • src/axolotl/integrations/nemo_gym/examples/nemo_gym_multi_env.yaml
  • src/axolotl/integrations/nemo_gym/examples/nemo_gym_multi_turn.yaml
  • src/axolotl/integrations/nemo_gym/examples/nemo_gym_sudoku.yaml
  • src/axolotl/integrations/nemo_gym/multi_turn.py
  • src/axolotl/integrations/nemo_gym/plugin.py
  • src/axolotl/integrations/nemo_gym/rewards.py
  • src/axolotl/integrations/nemo_gym/server.py
  • src/axolotl/scripts/vllm_serve_lora.py

Comment on lines +17 to +40
nemo_gym_enabled: bool | None = Field(
default=True,
json_schema_extra={
"description": "Enable NeMo Gym integration for environment-based RL rewards."
},
)
nemo_gym_dir: str | None = Field(
default=None,
json_schema_extra={
"description": (
"Path to the NeMo Gym repository clone. "
"If not set and nemo_gym_auto_clone is True, clones to ~/Gym."
)
},
)
nemo_gym_auto_clone: bool | None = Field(
default=None,
json_schema_extra={
"description": (
"Automatically clone the NeMo Gym repository if not present. "
"Defaults to True when nemo_gym_enabled is set."
)
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Resolve the effective defaults before validating this config.

With mode="before", omitting nemo_gym_enabled skips this validator, but the model still finalizes to nemo_gym_enabled=True later. The same nullable-default pattern leaves flags like nemo_gym_auto_clone as None, which downstream code treats as false, so NeMo Gym can come up enabled without its required settings and without the documented auto-clone behavior.

One way to make the defaults consistent
 class NemoGymArgs(BaseModel):
@@
-    nemo_gym_enabled: bool | None = Field(
-        default=True,
+    nemo_gym_enabled: bool = Field(
+        default=False,
@@
-    nemo_gym_auto_clone: bool | None = Field(
-        default=None,
+    nemo_gym_auto_clone: bool = Field(
+        default=True,
@@
-    nemo_gym_auto_start: bool | None = Field(
+    nemo_gym_auto_start: bool = Field(
         default=True,
@@
-    `@model_validator`(mode="before")
-    `@classmethod`
-    def check_nemo_gym_config(cls, data):
-        if data.get("nemo_gym_enabled"):
-            if not data.get("nemo_gym_config_paths") and data.get(
-                "nemo_gym_auto_start", True
-            ):
+    `@model_validator`(mode="after")
+    def check_nemo_gym_config(self):
+        if self.nemo_gym_enabled:
+            if not self.nemo_gym_config_paths and self.nemo_gym_auto_start:
                 raise ValueError(
                     "nemo_gym_config_paths is required when nemo_gym_enabled=True "
                     "and nemo_gym_auto_start is not False."
                 )
-            if not data.get("nemo_gym_datasets"):
+            if not self.nemo_gym_datasets:
                 raise ValueError(
                     "nemo_gym_datasets is required when nemo_gym_enabled=True. "
                     "Provide at least one dataset with 'path' and 'server_name'."
                 )
-        return data
+        return self

Also applies to: 120-136

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/args.py` around lines 17 - 40, The
validator runs too early/late leaving nullable fields unresolved; add a root
validator that runs before field validation to compute effective defaults:
create a `@root_validator`(pre=True) method (e.g., resolve_nemo_gym_defaults) in
the same Pydantic model that checks incoming values and sets
nemo_gym_enabled=True if None, and sets nemo_gym_auto_clone=True when
nemo_gym_auto_clone is None && effective nemo_gym_enabled is True (and ensure
nemo_gym_dir stays None if unset). Apply the same root_validator pattern to the
other similar block (the fields around 120-136) so the effective defaults are
resolved consistently before validation.

Comment on lines +99 to +107
# Find the full dataset item
item = self._dataset_lookup.get(prompt_text, {}).get("verify_extra", {})
if not item:
item = {
"responses_create_params": {
"input": [{"role": "user", "content": prompt_text}]
}
}
dataset_items.append(item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Preserve agent_ref when building the agent request payload.

This path keeps only verify_extra, but _call_agents() picks the /run endpoint from item["agent_ref"]["name"]. In multi-environment datasets that strips away the routing metadata and causes samples to fall back to the first agent server.

Suggested fix
-            item = self._dataset_lookup.get(prompt_text, {}).get("verify_extra", {})
-            if not item:
-                item = {
-                    "responses_create_params": {
-                        "input": [{"role": "user", "content": prompt_text}]
-                    }
-                }
+            row = self._dataset_lookup.get(prompt_text, {})
+            item = dict(row.get("verify_extra", {}))
+            if "agent_ref" in row:
+                item["agent_ref"] = row["agent_ref"]
+            if not item:
+                item = {
+                    "responses_create_params": {
+                        "input": [{"role": "user", "content": prompt_text}]
+                    }
+                }
             dataset_items.append(item)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Find the full dataset item
item = self._dataset_lookup.get(prompt_text, {}).get("verify_extra", {})
if not item:
item = {
"responses_create_params": {
"input": [{"role": "user", "content": prompt_text}]
}
}
dataset_items.append(item)
# Find the full dataset item
row = self._dataset_lookup.get(prompt_text, {})
item = dict(row.get("verify_extra", {}))
if "agent_ref" in row:
item["agent_ref"] = row["agent_ref"]
if not item:
item = {
"responses_create_params": {
"input": [{"role": "user", "content": prompt_text}]
}
}
dataset_items.append(item)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/data_producer.py` around lines 99 - 107,
The code is dropping item["agent_ref"] by taking only verify_extra for a prompt,
causing _call_agents() to lose routing metadata; change the logic in
data_producer.py so you first fetch the full dataset entry (e.g., full_item =
self._dataset_lookup.get(prompt_text, {})), then set item =
full_item.get("verify_extra", {}) and if full_item contains "agent_ref" attach
it to item (item["agent_ref"] = full_item["agent_ref"]); when using the fallback
default payload for responses_create_params also ensure you preserve or include
agent_ref from full_item so _call_agents() can correctly choose the /run
endpoint.

Comment on lines +66 to +84
for prompt_str in prompts:
# Prompts from TRL are chat-templated strings. Find the dataset item
# by matching against dataset_lookup keys (raw user message text).
item = None
for key, val in dataset_lookup.items():
if isinstance(key, str) and key in prompt_str:
item = val
break

if item is None:
item = {
"responses_create_params": {
"input": [{"role": "user", "content": prompt_str}]
}
}

for _ in range(num_generations):
expanded_items.append(item.get("verify_extra", item))
expanded_prompt_indices.append(prompt_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid fuzzy prompt matching and keep agent_ref in the dispatched payload.

if key in prompt_str can select the wrong dataset row when prompts share a system or user prefix, and item.get("verify_extra", item) then strips the agent_ref that _call_agents() uses for routing. In multi-environment runs that silently sends requests to the wrong agent.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/multi_turn.py` around lines 66 - 84, The
loop uses fuzzy substring matching and then strips agent routing info by
appending item.get("verify_extra", item); change the matching to exact/full
match (replace "if isinstance(key, str) and key in prompt_str" with equality or
full-text comparison such as "if isinstance(key, str) and prompt_str == key") to
avoid selecting wrong dataset rows, and when building expanded_items preserve
agent routing by appending the original item (or merging item["agent_ref"] into
the verify_extra payload) instead of blindly using item.get("verify_extra",
item) so that agent_ref stays in the dispatched payload used by _call_agents().

Comment on lines +121 to +123
self._global_config = get_server_configs(head_port=head_port)
wait_for_resource_servers(self._global_config)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pass the configured timeout into wait_for_resource_servers().

start_servers() honors nemo_gym_server_timeout, but this follow-up wait still uses the helper’s 180s default. Slow boots will fail after 3 minutes even when the config explicitly allows longer.

Suggested fix
         self._global_config = get_server_configs(head_port=head_port)
-        wait_for_resource_servers(self._global_config)
+        wait_for_resource_servers(self._global_config, timeout=server_timeout)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
self._global_config = get_server_configs(head_port=head_port)
wait_for_resource_servers(self._global_config)
self._global_config = get_server_configs(head_port=head_port)
wait_for_resource_servers(self._global_config, timeout=server_timeout)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/plugin.py` around lines 121 - 123,
start_servers() reads nemo_gym_server_timeout into the global config but the
subsequent call to wait_for_resource_servers(self._global_config) still uses the
helper's 180s default; update the call to pass the configured timeout from the
global config (e.g. extract nemo_gym_server_timeout from the object returned by
get_server_configs or self._global_config and call
wait_for_resource_servers(self._global_config, timeout=nemo_gym_server_timeout)
or the equivalent parameter name expected by wait_for_resource_servers) so the
explicit server timeout is honored.

Comment on lines +149 to +154
self._dataset_lookup = {}
for i in range(len(dataset)):
row = dataset[i]
prompt_text = row["prompt"][0]["content"]
self._dataset_lookup[prompt_text] = row
LOG.info(f"Built dataset lookup with {len(self._dataset_lookup)} entries")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Use the same lookup key that the producer searches for.

These maps are keyed by row["prompt"][0]["content"], but NemoGymDataProducer.produce() looks them up with the last prompt message. Any prompt shaped like [system, user] will miss the lookup, drop its verify_extra/agent_ref, and then fall back to the wrong agent or resource server.

Also applies to: 175-179

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/plugin.py` around lines 149 - 154, The
dataset lookup currently keys entries by row["prompt"][0]["content"], but
NemoGymDataProducer.produce() looks up using the last prompt message, causing
misses for multi-message prompts and loss of verify_extra/agent_ref; update the
keying in the __init__ (the _dataset_lookup population loop that uses dataset
and row) to use the last prompt message content (e.g.,
row["prompt"][-1]["content"] or the same extraction logic
NemoGymDataProducer.produce() uses) instead of prompt[0], and make the identical
change for the second occurrence (the block around lines 175-179) so both lookup
tables use the same key format as produce().

Comment on lines +325 to +337
nemo_producer = NemoGymDataProducer(
config=inner.config,
prompt_dataset=inner._dataset,
num_generations=inner._num_generations,
generation_batch_size=inner._generation_batch_size,
train_batch_size=inner._train_batch_size,
steps_per_generation=inner._steps_per_generation,
shuffle_dataset=inner._shuffle_dataset,
seed=inner._seed,
agent_servers=self._agent_servers,
dataset_lookup=self._dataset_lookup or {},
request_timeout=float(cfg.nemo_gym_verify_timeout or 10800),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Do not reuse the /verify timeout for multi-turn /run calls.

nemo_gym_verify_timeout defaults to 30s, but agent /run requests can legitimately take much longer than that. Wiring it into NemoGymDataProducer.request_timeout will abort long rollouts even though the producer’s own default is 10800s.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/plugin.py` around lines 325 - 337, The code
incorrectly wires cfg.nemo_gym_verify_timeout into the
NemoGymDataProducer.request_timeout causing multi-turn /run calls to be aborted
by the short verify timeout; change the constructor call in plugin.py where
NemoGymDataProducer is instantiated so it does not use nemo_gym_verify_timeout
for request_timeout—either remove the request_timeout argument to let
NemoGymDataProducer use its 10800s default, or use a dedicated config key (e.g.,
cfg.nemo_gym_run_timeout) with fallback to 10800; update the argument passed to
request_timeout accordingly.

Comment on lines +66 to +67
def _get_verify_urls(head_port: int = 11000) -> dict[str, str]:
"""Discover verify endpoints from the NeMo Gym head server.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Make the public reward hook honor the configured NeMo Gym settings.

reward_nemo_gym_verify() is documented as a ready-to-use entrypoint, but it always discovers endpoints from port 11000, posts "model": "axolotl", and uses a hard-coded 30s timeout. Any run that overrides nemo_gym_head_port, nemo_gym_model_name, or nemo_gym_verify_timeout will silently talk to the wrong backend or return 0.0 on slow verifies.

Also applies to: 103-167

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/rewards.py` around lines 66 - 67, The
public hook reward_nemo_gym_verify and helper _get_verify_urls currently
hard-code the head port (11000), post payload ("model": "axolotl"), and a 30s
timeout; update both functions to read and use the configured values
(nemo_gym_head_port, nemo_gym_model_name, nemo_gym_verify_timeout) passed into
or available to reward_nemo_gym_verify so discovery and POST use the configured
head port, the configured model name in the JSON body, and the configured
timeout for HTTP requests instead of the hard-coded defaults; ensure
_get_verify_urls accepts/use the provided head_port and reward_nemo_gym_verify
passes the configured timeout into the requests call and uses
nemo_gym_model_name for payload construction.

Comment on lines +101 to +108
config_str = ",".join(config_paths)
_ng_log_file = open(os.path.join(gym_dir, "ng_run.log"), "w") # noqa: SIM115
_ng_process = subprocess.Popen( # nosec
[
"bash",
"-c",
f'source .venv/bin/activate && ng_run "+config_paths=[{config_str}]"',
],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Build the ng_run command without bash -c.

config_paths comes from config and is spliced into a shell string here. That makes paths with spaces or quotes brittle, and shell metacharacters become executable input. Calling the venv’s ng_run binary directly avoids both problems.

Suggested fix
     config_str = ",".join(config_paths)
     _ng_log_file = open(os.path.join(gym_dir, "ng_run.log"), "w")  # noqa: SIM115
+    ng_run = os.path.join(gym_dir, ".venv", "bin", "ng_run")
     _ng_process = subprocess.Popen(  # nosec
-        [
-            "bash",
-            "-c",
-            f'source .venv/bin/activate && ng_run "+config_paths=[{config_str}]"',
-        ],
+        [ng_run, f"+config_paths=[{config_str}]"],
         cwd=gym_dir,
         stdout=_ng_log_file,
         stderr=subprocess.STDOUT,
     )
🧰 Tools
🪛 Ruff (0.15.6)

[error] 103-103: subprocess call: check for execution of untrusted input

(S603)


[error] 104-108: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/server.py` around lines 101 - 108, The
current code builds a shell command using "bash -c" with a spliced config_paths
string (vulnerable to spaces/quotes and shell injection); instead construct a
subprocess argument list and invoke the venv's ng_run binary directly so no
shell interpretation occurs. Locate where _ng_process is created and replace the
["bash", "-c", ...] invocation by computing the ng_run executable from the venv
under gym_dir (the same venv used earlier), then call subprocess.Popen with
shell=False and an argument list that passes config paths as separate arguments
(or a single safely-constructed argument) rather than interpolating them into a
shell string; keep _ng_log_file for stdout/stderr and preserve existing
nosec/permission handling.

Comment on lines +186 to +200
def get_server_base_url(global_config: dict, server_name: str) -> str:
"""Get the base URL for a given resource server."""
try:
srv_cfg = global_config[server_name]["resources_servers"][server_name]
return f"http://{srv_cfg['host']}:{srv_cfg['port']}"
except (KeyError, TypeError) as exc:
raise ValueError(
f"Could not find resource server config for '{server_name}' in NeMo Gym. "
f"Available servers: {list(global_config.keys())}"
) from exc


def get_verify_endpoint(global_config: dict, server_name: str) -> str:
"""Get the /verify endpoint URL for a given resource server."""
return f"{get_server_base_url(global_config, server_name)}/verify"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Normalize bind-all and loopback hosts for resource servers too.

get_server_base_url() and wait_for_resource_servers() use the raw host from NeMo Gym config. If a resource server advertises 0.0.0.0 or localhost, the generated /verify URL can be unroutable and the readiness probe can time out even though the service is up. This should mirror the host-rewrite logic already used in get_agent_servers().

Also applies to: 211-215

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/server.py` around lines 186 - 200, Update
get_server_base_url (and thus get_verify_endpoint) to normalize bind-all and
loopback addresses the same way get_agent_servers does: if srv_cfg['host'] is
"0.0.0.0" replace with the external host (e.g., global_config["host"] or the
agent/cluster-facing host used elsewhere) and if it is "localhost" or
"127.0.0.1" map it to the machine-facing hostname/IP used for reachability;
ensure wait_for_resource_servers uses this normalized URL for its readiness
probes. Locate get_server_base_url and get_verify_endpoint and apply the same
host-rewrite logic/utility used by get_agent_servers (or extract that
normalization into a helper and reuse it) so /verify URLs are routable when
services advertise 0.0.0.0 or localhost.

Comment on lines +520 to +528
"usage": {
"prompt_tokens": len(all_outputs[0].prompt_token_ids)
if all_outputs
else 0,
"completion_tokens": sum(
len(out.token_ids) for o in all_outputs for out in o.outputs
),
"total_tokens": 0,
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Populate usage.total_tokens from the actual counts.

Line 527 hardcodes total_tokens to 0, so any downstream accounting that reads the usage block will undercount every response.

Suggested fix
         return {
             "id": f"chatcmpl-{uuid.uuid4().hex[:8]}",
             "object": "chat.completion",
             "model": script_args.model,
             "choices": choices,
             "usage": {
                 "prompt_tokens": len(all_outputs[0].prompt_token_ids)
                 if all_outputs
                 else 0,
                 "completion_tokens": sum(
                     len(out.token_ids) for o in all_outputs for out in o.outputs
                 ),
-                "total_tokens": 0,
+                "total_tokens": (
+                    (len(all_outputs[0].prompt_token_ids) if all_outputs else 0)
+                    + sum(len(out.token_ids) for o in all_outputs for out in o.outputs)
+                ),
             },
         }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
"usage": {
"prompt_tokens": len(all_outputs[0].prompt_token_ids)
if all_outputs
else 0,
"completion_tokens": sum(
len(out.token_ids) for o in all_outputs for out in o.outputs
),
"total_tokens": 0,
},
"usage": {
"prompt_tokens": len(all_outputs[0].prompt_token_ids)
if all_outputs
else 0,
"completion_tokens": sum(
len(out.token_ids) for o in all_outputs for out in o.outputs
),
"total_tokens": (
(len(all_outputs[0].prompt_token_ids) if all_outputs else 0)
sum(len(out.token_ids) for o in all_outputs for out in o.outputs)
),
},
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/scripts/vllm_serve_lora.py` around lines 520 - 528, The
usage.total_tokens is hardcoded to 0; compute it from the actual counts instead.
Inside the usage dict in vllm_serve_lora.py (the block that builds "usage"),
assign prompt_tokens = len(all_outputs[0].prompt_token_ids) if all_outputs else
0, compute completion_tokens by summing len(out.token_ids) for o in all_outputs
for out in o.outputs, and set total_tokens = prompt_tokens + completion_tokens
so the "usage" object reflects the real totals.

@winglian winglian force-pushed the nemo-gym-integration branch from f717e46 to 64746db Compare March 22, 2026 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

scheduled_release This PR is slated for the upcoming release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants