Nemo gym integration by winglian · Pull Request #3516 · axolotl-ai-cloud/axolotl

winglian · 2026-03-19T09:27:56Z

Summary by CodeRabbit

Release Notes

New Features
- Added NeMo Gym integration for reinforcement learning training with customizable reward environments and multi-turn support
- Added OpenAI-compatible API endpoints (/v1/models, /v1/chat/completions) to vLLM server
Bug Fixes
- Fixed GRPO configuration handling to prevent invalid parameters in standard (non-async) training mode
- Improved LoRA adapter synchronization with automatic fallback mechanism
- Fixed entropy computation in fused decoding scenarios
Documentation
- Added comprehensive NeMo Gym integration guide with setup instructions and example configurations

coderabbitai · 2026-03-19T09:28:04Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: cb486e25-2331-4f84-8af1-a6393db8abcf

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

📝 Walkthrough

Walkthrough

This PR introduces a comprehensive NeMo Gym integration for GRPO RL training with argument validation, data producer orchestration, reward functions, and server lifecycle management. It also fixes GRPO builder field filtering for async configs, improves LoRA synchronization and entropy handling in async trainers, and adds OpenAI-compatible API endpoints to vLLM serving.

Changes

Cohort / File(s)	Summary
GRPO Configuration & Builder `src/axolotl/core/builders/rl.py`	Conditionally filters `FastAsyncGRPOConfig`-only dataclass fields from standard GRPO training args kwargs in non-async case to prevent unsupported parameters.
GRPO Async Trainer Updates `src/axolotl/core/trainers/grpo/async_trainer.py`	Enhanced LoRA synchronization with fallback from standardized `/v1/load_lora_adapter` endpoint to legacy `/set_lora_adapter/` endpoint; wrapped `reset_prefix_cache()` in exception handling; fixed entropy tensor handling when using Liger fused softmax path.
NeMo Gym Integration - Core Package `src/axolotl/integrations/nemo_gym/__init__.py`, `args.py`, `plugin.py`	New integration package with argument validation (enforcing config paths and datasets when enabled), and main plugin orchestrating server lifecycle, dataset loading, reward injection, and trainer integration.
NeMo Gym Integration - Data & Rewards `data_producer.py`, `dataset.py`, `rewards.py`	Data producer extending `GRPODataProducer` to generate rollouts via NeMo Gym agent `/run` endpoints; dataset loader parsing JSONL with server/agent routing; dual reward functions (`reward_env` for multi-turn passthrough, `reward_nemo_gym_verify` for single-turn verification).
NeMo Gym Integration - Server & Multi-turn `server.py`, `multi_turn.py`	Server lifecycle management (repo setup, venv initialization, process startup, readiness checks, endpoint discovery); multi-turn rollout orchestration via async agent HTTP calls with response parsing and tensor padding.
NeMo Gym Documentation & Examples `README.md`, `nemo_gym_*.yaml`	Integration guide with setup/execution steps, dependency pinning, data format specs, and three example configurations (multi-environment, multi-turn, Sudoku).
vLLM Server Script Enhancements `src/axolotl/scripts/vllm_serve_lora.py`	Added OpenAI-compatible HTTP endpoints (`GET /v1/models`, `POST /v1/chat/completions`) with logprobs mapping; made `GuidedDecodingParams` optional with fallback for older vLLM versions.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~90 minutes

Possibly related PRs

Support for Async GRPO #3486: Addresses FastAsyncGRPOConfig dataclass field filtering in GRPO builder for async training argument construction.

Suggested reviewers

djsaunde
NanoCode012

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Nemo gym integration' directly describes the main change—adding a comprehensive NeMo Gym integration to the Axolotl GRPO training framework.
Docstring Coverage	✅ Passed	Docstring coverage is 80.43% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch nemo-gym-integration

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

NanoCode012

Is this work an addition ontop of async GRPO?

I'm not so clear on the arch, is it a replacement for vllm generation?

Also, how does the uv packages interact with axolotl, if we were to use pip?

github-actions · 2026-03-22T04:55:54Z

📖 Documentation Preview: https://69c02f066e224299aeaf1509--resonant-treacle-0fd729.netlify.app

Deployed on Netlify from commit 64746db

coderabbitai

Actionable comments posted: 10

🧹 Nitpick comments (3)

src/axolotl/integrations/nemo_gym/README.md (1)

231-252: Consider adding language specifiers to fenced code blocks for diagram sections.

The ASCII diagrams at lines 231-235 and 238-252 could use text as the language identifier to satisfy markdown linters, though this is a minor stylistic preference.

♻️ Optional fix

-```
+```text
 axolotl train → GRPO Trainer generates completions

-```
+```text
 ┌─────────────┐     ┌──────────────┐     ┌──────────────────┐

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/README.md` around lines 231 - 252, Update
the two fenced code blocks that render the ASCII diagrams (the block containing
"axolotl train → GRPO Trainer generates completions" and the multi-turn diagram
block beginning with the box diagram) to include a language specifier by
changing the opening backticks to ```text so markdown linters recognize them as
plain text; keep the diagram content unchanged and only add the "text" tag to
the existing triple-backtick delimiters.

src/axolotl/core/trainers/grpo/async_trainer.py (1)

985-988: Consider logging the suppressed exception for debuggability.

The static analysis correctly flags the bare try-except-pass pattern. While reset_prefix_cache() failure may not be critical, silently swallowing all exceptions makes debugging harder when issues do occur.

♻️ Proposed fix to log the exception

             # Reset prefix cache after adapter update
             try:
                 vllm_client.reset_prefix_cache()
-            except Exception:
-                pass  # Not critical
+            except Exception as exc:
+                logger.debug("reset_prefix_cache failed (non-critical): %s", exc)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/core/trainers/grpo/async_trainer.py` around lines 985 - 988, The
bare except/pass around vllm_client.reset_prefix_cache() in async_trainer.py
hides errors; change it to catch Exception as e and log the failure (use the
module/class logger, e.g., self.logger or the existing logger in AsyncTrainer) —
call logger.exception or logger.warning with the exception details and a short
context message referencing reset_prefix_cache() so failures are visible for
debugging.

src/axolotl/integrations/nemo_gym/dataset.py (1)

80-81: Specify explicit UTF-8 encoding for cross-platform compatibility.

Opening files without explicit encoding uses the system default, which may cause issues on Windows systems or with non-ASCII content.

♻️ Proposed fix

-        with open(path) as f:
+        with open(path, encoding="utf-8") as f:
             lines = f.readlines()

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@src/axolotl/integrations/nemo_gym/dataset.py` around lines 80 - 81, The file
read uses bare open(path) which relies on system default encoding; update the
call to explicitly set UTF-8 (e.g. change with open(path) as f: to with
open(path, encoding="utf-8") as f:) to ensure cross-platform handling of
non-ASCII content; optionally include errors="replace" if you want to tolerate
malformed bytes. Ensure you modify the open call that produces lines =
f.readlines() in src/axolotl/integrations/nemo_gym/dataset.py.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/axolotl/integrations/nemo_gym/args.py`:
- Around line 17-40: The validator runs too early/late leaving nullable fields
unresolved; add a root validator that runs before field validation to compute
effective defaults: create a `@root_validator`(pre=True) method (e.g.,
resolve_nemo_gym_defaults) in the same Pydantic model that checks incoming
values and sets nemo_gym_enabled=True if None, and sets nemo_gym_auto_clone=True
when nemo_gym_auto_clone is None && effective nemo_gym_enabled is True (and
ensure nemo_gym_dir stays None if unset). Apply the same root_validator pattern
to the other similar block (the fields around 120-136) so the effective defaults
are resolved consistently before validation.

In `@src/axolotl/integrations/nemo_gym/data_producer.py`:
- Around line 99-107: The code is dropping item["agent_ref"] by taking only
verify_extra for a prompt, causing _call_agents() to lose routing metadata;
change the logic in data_producer.py so you first fetch the full dataset entry
(e.g., full_item = self._dataset_lookup.get(prompt_text, {})), then set item =
full_item.get("verify_extra", {}) and if full_item contains "agent_ref" attach
it to item (item["agent_ref"] = full_item["agent_ref"]); when using the fallback
default payload for responses_create_params also ensure you preserve or include
agent_ref from full_item so _call_agents() can correctly choose the /run
endpoint.

In `@src/axolotl/integrations/nemo_gym/multi_turn.py`:
- Around line 66-84: The loop uses fuzzy substring matching and then strips
agent routing info by appending item.get("verify_extra", item); change the
matching to exact/full match (replace "if isinstance(key, str) and key in
prompt_str" with equality or full-text comparison such as "if isinstance(key,
str) and prompt_str == key") to avoid selecting wrong dataset rows, and when
building expanded_items preserve agent routing by appending the original item
(or merging item["agent_ref"] into the verify_extra payload) instead of blindly
using item.get("verify_extra", item) so that agent_ref stays in the dispatched
payload used by _call_agents().

In `@src/axolotl/integrations/nemo_gym/plugin.py`:
- Around line 149-154: The dataset lookup currently keys entries by
row["prompt"][0]["content"], but NemoGymDataProducer.produce() looks up using
the last prompt message, causing misses for multi-message prompts and loss of
verify_extra/agent_ref; update the keying in the __init__ (the _dataset_lookup
population loop that uses dataset and row) to use the last prompt message
content (e.g., row["prompt"][-1]["content"] or the same extraction logic
NemoGymDataProducer.produce() uses) instead of prompt[0], and make the identical
change for the second occurrence (the block around lines 175-179) so both lookup
tables use the same key format as produce().
- Around line 325-337: The code incorrectly wires cfg.nemo_gym_verify_timeout
into the NemoGymDataProducer.request_timeout causing multi-turn /run calls to be
aborted by the short verify timeout; change the constructor call in plugin.py
where NemoGymDataProducer is instantiated so it does not use
nemo_gym_verify_timeout for request_timeout—either remove the request_timeout
argument to let NemoGymDataProducer use its 10800s default, or use a dedicated
config key (e.g., cfg.nemo_gym_run_timeout) with fallback to 10800; update the
argument passed to request_timeout accordingly.
- Around line 121-123: start_servers() reads nemo_gym_server_timeout into the
global config but the subsequent call to
wait_for_resource_servers(self._global_config) still uses the helper's 180s
default; update the call to pass the configured timeout from the global config
(e.g. extract nemo_gym_server_timeout from the object returned by
get_server_configs or self._global_config and call
wait_for_resource_servers(self._global_config, timeout=nemo_gym_server_timeout)
or the equivalent parameter name expected by wait_for_resource_servers) so the
explicit server timeout is honored.

In `@src/axolotl/integrations/nemo_gym/rewards.py`:
- Around line 66-67: The public hook reward_nemo_gym_verify and helper
_get_verify_urls currently hard-code the head port (11000), post payload
("model": "axolotl"), and a 30s timeout; update both functions to read and use
the configured values (nemo_gym_head_port, nemo_gym_model_name,
nemo_gym_verify_timeout) passed into or available to reward_nemo_gym_verify so
discovery and POST use the configured head port, the configured model name in
the JSON body, and the configured timeout for HTTP requests instead of the
hard-coded defaults; ensure _get_verify_urls accepts/use the provided head_port
and reward_nemo_gym_verify passes the configured timeout into the requests call
and uses nemo_gym_model_name for payload construction.

In `@src/axolotl/integrations/nemo_gym/server.py`:
- Around line 186-200: Update get_server_base_url (and thus get_verify_endpoint)
to normalize bind-all and loopback addresses the same way get_agent_servers
does: if srv_cfg['host'] is "0.0.0.0" replace with the external host (e.g.,
global_config["host"] or the agent/cluster-facing host used elsewhere) and if it
is "localhost" or "127.0.0.1" map it to the machine-facing hostname/IP used for
reachability; ensure wait_for_resource_servers uses this normalized URL for its
readiness probes. Locate get_server_base_url and get_verify_endpoint and apply
the same host-rewrite logic/utility used by get_agent_servers (or extract that
normalization into a helper and reuse it) so /verify URLs are routable when
services advertise 0.0.0.0 or localhost.
- Around line 101-108: The current code builds a shell command using "bash -c"
with a spliced config_paths string (vulnerable to spaces/quotes and shell
injection); instead construct a subprocess argument list and invoke the venv's
ng_run binary directly so no shell interpretation occurs. Locate where
_ng_process is created and replace the ["bash", "-c", ...] invocation by
computing the ng_run executable from the venv under gym_dir (the same venv used
earlier), then call subprocess.Popen with shell=False and an argument list that
passes config paths as separate arguments (or a single safely-constructed
argument) rather than interpolating them into a shell string; keep _ng_log_file
for stdout/stderr and preserve existing nosec/permission handling.

In `@src/axolotl/scripts/vllm_serve_lora.py`:
- Around line 520-528: The usage.total_tokens is hardcoded to 0; compute it from
the actual counts instead. Inside the usage dict in vllm_serve_lora.py (the
block that builds "usage"), assign prompt_tokens =
len(all_outputs[0].prompt_token_ids) if all_outputs else 0, compute
completion_tokens by summing len(out.token_ids) for o in all_outputs for out in
o.outputs, and set total_tokens = prompt_tokens + completion_tokens so the
"usage" object reflects the real totals.

---

Nitpick comments:
In `@src/axolotl/core/trainers/grpo/async_trainer.py`:
- Around line 985-988: The bare except/pass around
vllm_client.reset_prefix_cache() in async_trainer.py hides errors; change it to
catch Exception as e and log the failure (use the module/class logger, e.g.,
self.logger or the existing logger in AsyncTrainer) — call logger.exception or
logger.warning with the exception details and a short context message
referencing reset_prefix_cache() so failures are visible for debugging.

In `@src/axolotl/integrations/nemo_gym/dataset.py`:
- Around line 80-81: The file read uses bare open(path) which relies on system
default encoding; update the call to explicitly set UTF-8 (e.g. change with
open(path) as f: to with open(path, encoding="utf-8") as f:) to ensure
cross-platform handling of non-ASCII content; optionally include
errors="replace" if you want to tolerate malformed bytes. Ensure you modify the
open call that produces lines = f.readlines() in
src/axolotl/integrations/nemo_gym/dataset.py.

In `@src/axolotl/integrations/nemo_gym/README.md`:
- Around line 231-252: Update the two fenced code blocks that render the ASCII
diagrams (the block containing "axolotl train → GRPO Trainer generates
completions" and the multi-turn diagram block beginning with the box diagram) to
include a language specifier by changing the opening backticks to ```text so
markdown linters recognize them as plain text; keep the diagram content
unchanged and only add the "text" tag to the existing triple-backtick
delimiters.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 66d9db69-40e4-426d-8f72-9160a41eba31

📥 Commits

Reviewing files that changed from the base of the PR and between fc3b3d1 and d8da708.

📒 Files selected for processing (15)

src/axolotl/core/builders/rl.py
src/axolotl/core/trainers/grpo/async_trainer.py
src/axolotl/integrations/nemo_gym/README.md
src/axolotl/integrations/nemo_gym/__init__.py
src/axolotl/integrations/nemo_gym/args.py
src/axolotl/integrations/nemo_gym/data_producer.py
src/axolotl/integrations/nemo_gym/dataset.py
src/axolotl/integrations/nemo_gym/examples/nemo_gym_multi_env.yaml
src/axolotl/integrations/nemo_gym/examples/nemo_gym_multi_turn.yaml
src/axolotl/integrations/nemo_gym/examples/nemo_gym_sudoku.yaml
src/axolotl/integrations/nemo_gym/multi_turn.py
src/axolotl/integrations/nemo_gym/plugin.py
src/axolotl/integrations/nemo_gym/rewards.py
src/axolotl/integrations/nemo_gym/server.py
src/axolotl/scripts/vllm_serve_lora.py

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/args.py

+    nemo_gym_enabled: bool | None = Field(
+        default=True,
+        json_schema_extra={
+            "description": "Enable NeMo Gym integration for environment-based RL rewards."
+        },
+    )
+    nemo_gym_dir: str | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": (
+                "Path to the NeMo Gym repository clone. "
+                "If not set and nemo_gym_auto_clone is True, clones to ~/Gym."
+            )
+        },
+    )
+    nemo_gym_auto_clone: bool | None = Field(
+        default=None,
+        json_schema_extra={
+            "description": (
+                "Automatically clone the NeMo Gym repository if not present. "
+                "Defaults to True when nemo_gym_enabled is set."
+            )
+        },
+    )


⚠️ Potential issue | 🟠 Major

Resolve the effective defaults before validating this config.

With mode="before", omitting nemo_gym_enabled skips this validator, but the model still finalizes to nemo_gym_enabled=True later. The same nullable-default pattern leaves flags like nemo_gym_auto_clone as None, which downstream code treats as false, so NeMo Gym can come up enabled without its required settings and without the documented auto-clone behavior.

One way to make the defaults consistent

class NemoGymArgs(BaseModel): @@ - nemo_gym_enabled: bool | None = Field( - default=True, + nemo_gym_enabled: bool = Field( + default=False, @@ - nemo_gym_auto_clone: bool | None = Field( - default=None, + nemo_gym_auto_clone: bool = Field( + default=True, @@ - nemo_gym_auto_start: bool | None = Field( + nemo_gym_auto_start: bool = Field( default=True, @@ - `@model_validator`(mode="before") - `@classmethod` - def check_nemo_gym_config(cls, data): - if data.get("nemo_gym_enabled"): - if not data.get("nemo_gym_config_paths") and data.get( - "nemo_gym_auto_start", True - ): + `@model_validator`(mode="after") + def check_nemo_gym_config(self): + if self.nemo_gym_enabled: + if not self.nemo_gym_config_paths and self.nemo_gym_auto_start: raise ValueError( "nemo_gym_config_paths is required when nemo_gym_enabled=True " "and nemo_gym_auto_start is not False." ) - if not data.get("nemo_gym_datasets"): + if not self.nemo_gym_datasets: raise ValueError( "nemo_gym_datasets is required when nemo_gym_enabled=True. " "Provide at least one dataset with 'path' and 'server_name'." ) - return data + return self

Also applies to: 120-136

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/args.py` around lines 17 - 40, The validator runs too early/late leaving nullable fields unresolved; add a root validator that runs before field validation to compute effective defaults: create a `@root_validator`(pre=True) method (e.g., resolve_nemo_gym_defaults) in the same Pydantic model that checks incoming values and sets nemo_gym_enabled=True if None, and sets nemo_gym_auto_clone=True when nemo_gym_auto_clone is None && effective nemo_gym_enabled is True (and ensure nemo_gym_dir stays None if unset). Apply the same root_validator pattern to the other similar block (the fields around 120-136) so the effective defaults are resolved consistently before validation.

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/data_producer.py

+            # Find the full dataset item
+            item = self._dataset_lookup.get(prompt_text, {}).get("verify_extra", {})
+            if not item:
+                item = {
+                    "responses_create_params": {
+                        "input": [{"role": "user", "content": prompt_text}]
+                    }
+                }
+            dataset_items.append(item)


⚠️ Potential issue | 🟠 Major

Preserve agent_ref when building the agent request payload.

This path keeps only verify_extra, but _call_agents() picks the /run endpoint from item["agent_ref"]["name"]. In multi-environment datasets that strips away the routing metadata and causes samples to fall back to the first agent server.

Suggested fix

- item = self._dataset_lookup.get(prompt_text, {}).get("verify_extra", {}) - if not item: - item = { - "responses_create_params": { - "input": [{"role": "user", "content": prompt_text}] - } - } + row = self._dataset_lookup.get(prompt_text, {}) + item = dict(row.get("verify_extra", {})) + if "agent_ref" in row: + item["agent_ref"] = row["agent_ref"] + if not item: + item = { + "responses_create_params": { + "input": [{"role": "user", "content": prompt_text}] + } + } dataset_items.append(item)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

# Find the full dataset item

item = self._dataset_lookup.get(prompt_text, {}).get("verify_extra", {})

if not item:

item = {

"responses_create_params": {

"input": [{"role": "user", "content": prompt_text}]

}

}

dataset_items.append(item)

# Find the full dataset item

row = self._dataset_lookup.get(prompt_text, {})

item = dict(row.get("verify_extra", {}))

if "agent_ref" in row:

item["agent_ref"] = row["agent_ref"]

if not item:

item = {

"responses_create_params": {

"input": [{"role": "user", "content": prompt_text}]

}

}

dataset_items.append(item)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/data_producer.py` around lines 99 - 107, The code is dropping item["agent_ref"] by taking only verify_extra for a prompt, causing _call_agents() to lose routing metadata; change the logic in data_producer.py so you first fetch the full dataset entry (e.g., full_item = self._dataset_lookup.get(prompt_text, {})), then set item = full_item.get("verify_extra", {}) and if full_item contains "agent_ref" attach it to item (item["agent_ref"] = full_item["agent_ref"]); when using the fallback default payload for responses_create_params also ensure you preserve or include agent_ref from full_item so _call_agents() can correctly choose the /run endpoint.

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/multi_turn.py

+        for prompt_str in prompts:
+            # Prompts from TRL are chat-templated strings. Find the dataset item
+            # by matching against dataset_lookup keys (raw user message text).
+            item = None
+            for key, val in dataset_lookup.items():
+                if isinstance(key, str) and key in prompt_str:
+                    item = val
+                    break
+
+            if item is None:
+                item = {
+                    "responses_create_params": {
+                        "input": [{"role": "user", "content": prompt_str}]
+                    }
+                }
+
+            for _ in range(num_generations):
+                expanded_items.append(item.get("verify_extra", item))
+                expanded_prompt_indices.append(prompt_str)


⚠️ Potential issue | 🟠 Major

Avoid fuzzy prompt matching and keep agent_ref in the dispatched payload.

if key in prompt_str can select the wrong dataset row when prompts share a system or user prefix, and item.get("verify_extra", item) then strips the agent_ref that _call_agents() uses for routing. In multi-environment runs that silently sends requests to the wrong agent.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/multi_turn.py` around lines 66 - 84, The loop uses fuzzy substring matching and then strips agent routing info by appending item.get("verify_extra", item); change the matching to exact/full match (replace "if isinstance(key, str) and key in prompt_str" with equality or full-text comparison such as "if isinstance(key, str) and prompt_str == key") to avoid selecting wrong dataset rows, and when building expanded_items preserve agent routing by appending the original item (or merging item["agent_ref"] into the verify_extra payload) instead of blindly using item.get("verify_extra", item) so that agent_ref stays in the dispatched payload used by _call_agents().

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/plugin.py

+        self._global_config = get_server_configs(head_port=head_port)
+        wait_for_resource_servers(self._global_config)
+


⚠️ Potential issue | 🟠 Major

Pass the configured timeout into wait_for_resource_servers().

start_servers() honors nemo_gym_server_timeout, but this follow-up wait still uses the helper’s 180s default. Slow boots will fail after 3 minutes even when the config explicitly allows longer.

Suggested fix

self._global_config = get_server_configs(head_port=head_port) - wait_for_resource_servers(self._global_config) + wait_for_resource_servers(self._global_config, timeout=server_timeout)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

self._global_config = get_server_configs(head_port=head_port)

wait_for_resource_servers(self._global_config)

self._global_config = get_server_configs(head_port=head_port)

wait_for_resource_servers(self._global_config, timeout=server_timeout)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/plugin.py` around lines 121 - 123, start_servers() reads nemo_gym_server_timeout into the global config but the subsequent call to wait_for_resource_servers(self._global_config) still uses the helper's 180s default; update the call to pass the configured timeout from the global config (e.g. extract nemo_gym_server_timeout from the object returned by get_server_configs or self._global_config and call wait_for_resource_servers(self._global_config, timeout=nemo_gym_server_timeout) or the equivalent parameter name expected by wait_for_resource_servers) so the explicit server timeout is honored.

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/plugin.py

+            self._dataset_lookup = {}
+            for i in range(len(dataset)):
+                row = dataset[i]
+                prompt_text = row["prompt"][0]["content"]
+                self._dataset_lookup[prompt_text] = row
+            LOG.info(f"Built dataset lookup with {len(self._dataset_lookup)} entries")


⚠️ Potential issue | 🟠 Major

Use the same lookup key that the producer searches for.

These maps are keyed by row["prompt"][0]["content"], but NemoGymDataProducer.produce() looks them up with the last prompt message. Any prompt shaped like [system, user] will miss the lookup, drop its verify_extra/agent_ref, and then fall back to the wrong agent or resource server.

Also applies to: 175-179

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/plugin.py` around lines 149 - 154, The dataset lookup currently keys entries by row["prompt"][0]["content"], but NemoGymDataProducer.produce() looks up using the last prompt message, causing misses for multi-message prompts and loss of verify_extra/agent_ref; update the keying in the __init__ (the _dataset_lookup population loop that uses dataset and row) to use the last prompt message content (e.g., row["prompt"][-1]["content"] or the same extraction logic NemoGymDataProducer.produce() uses) instead of prompt[0], and make the identical change for the second occurrence (the block around lines 175-179) so both lookup tables use the same key format as produce().

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/plugin.py

+        nemo_producer = NemoGymDataProducer(
+            config=inner.config,
+            prompt_dataset=inner._dataset,
+            num_generations=inner._num_generations,
+            generation_batch_size=inner._generation_batch_size,
+            train_batch_size=inner._train_batch_size,
+            steps_per_generation=inner._steps_per_generation,
+            shuffle_dataset=inner._shuffle_dataset,
+            seed=inner._seed,
+            agent_servers=self._agent_servers,
+            dataset_lookup=self._dataset_lookup or {},
+            request_timeout=float(cfg.nemo_gym_verify_timeout or 10800),
+        )


⚠️ Potential issue | 🟠 Major

Do not reuse the /verify timeout for multi-turn /run calls.

nemo_gym_verify_timeout defaults to 30s, but agent /run requests can legitimately take much longer than that. Wiring it into NemoGymDataProducer.request_timeout will abort long rollouts even though the producer’s own default is 10800s.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/plugin.py` around lines 325 - 337, The code incorrectly wires cfg.nemo_gym_verify_timeout into the NemoGymDataProducer.request_timeout causing multi-turn /run calls to be aborted by the short verify timeout; change the constructor call in plugin.py where NemoGymDataProducer is instantiated so it does not use nemo_gym_verify_timeout for request_timeout—either remove the request_timeout argument to let NemoGymDataProducer use its 10800s default, or use a dedicated config key (e.g., cfg.nemo_gym_run_timeout) with fallback to 10800; update the argument passed to request_timeout accordingly.

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/rewards.py

+def _get_verify_urls(head_port: int = 11000) -> dict[str, str]:
+    """Discover verify endpoints from the NeMo Gym head server.


⚠️ Potential issue | 🟠 Major

Make the public reward hook honor the configured NeMo Gym settings.

reward_nemo_gym_verify() is documented as a ready-to-use entrypoint, but it always discovers endpoints from port 11000, posts "model": "axolotl", and uses a hard-coded 30s timeout. Any run that overrides nemo_gym_head_port, nemo_gym_model_name, or nemo_gym_verify_timeout will silently talk to the wrong backend or return 0.0 on slow verifies.

Also applies to: 103-167

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/rewards.py` around lines 66 - 67, The public hook reward_nemo_gym_verify and helper _get_verify_urls currently hard-code the head port (11000), post payload ("model": "axolotl"), and a 30s timeout; update both functions to read and use the configured values (nemo_gym_head_port, nemo_gym_model_name, nemo_gym_verify_timeout) passed into or available to reward_nemo_gym_verify so discovery and POST use the configured head port, the configured model name in the JSON body, and the configured timeout for HTTP requests instead of the hard-coded defaults; ensure _get_verify_urls accepts/use the provided head_port and reward_nemo_gym_verify passes the configured timeout into the requests call and uses nemo_gym_model_name for payload construction.

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/server.py

+    config_str = ",".join(config_paths)
+    _ng_log_file = open(os.path.join(gym_dir, "ng_run.log"), "w")  # noqa: SIM115
+    _ng_process = subprocess.Popen(  # nosec
+        [
+            "bash",
+            "-c",
+            f'source .venv/bin/activate && ng_run "+config_paths=[{config_str}]"',
+        ],


⚠️ Potential issue | 🟠 Major

Build the ng_run command without bash -c.

config_paths comes from config and is spliced into a shell string here. That makes paths with spaces or quotes brittle, and shell metacharacters become executable input. Calling the venv’s ng_run binary directly avoids both problems.

Suggested fix

config_str = ",".join(config_paths) _ng_log_file = open(os.path.join(gym_dir, "ng_run.log"), "w") # noqa: SIM115 + ng_run = os.path.join(gym_dir, ".venv", "bin", "ng_run") _ng_process = subprocess.Popen( # nosec - [ - "bash", - "-c", - f'source .venv/bin/activate && ng_run "+config_paths=[{config_str}]"', - ], + [ng_run, f"+config_paths=[{config_str}]"], cwd=gym_dir, stdout=_ng_log_file, stderr=subprocess.STDOUT, )

🧰 Tools

🪛 Ruff (0.15.6)

[error] 103-103: subprocess call: check for execution of untrusted input

(S603)

[error] 104-108: Starting a process with a partial executable path

(S607)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/server.py` around lines 101 - 108, The current code builds a shell command using "bash -c" with a spliced config_paths string (vulnerable to spaces/quotes and shell injection); instead construct a subprocess argument list and invoke the venv's ng_run binary directly so no shell interpretation occurs. Locate where _ng_process is created and replace the ["bash", "-c", ...] invocation by computing the ng_run executable from the venv under gym_dir (the same venv used earlier), then call subprocess.Popen with shell=False and an argument list that passes config paths as separate arguments (or a single safely-constructed argument) rather than interpolating them into a shell string; keep _ng_log_file for stdout/stderr and preserve existing nosec/permission handling.

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/integrations/nemo_gym/server.py

+def get_server_base_url(global_config: dict, server_name: str) -> str:
+    """Get the base URL for a given resource server."""
+    try:
+        srv_cfg = global_config[server_name]["resources_servers"][server_name]
+        return f"http://{srv_cfg['host']}:{srv_cfg['port']}"
+    except (KeyError, TypeError) as exc:
+        raise ValueError(
+            f"Could not find resource server config for '{server_name}' in NeMo Gym. "
+            f"Available servers: {list(global_config.keys())}"
+        ) from exc
+
+
+def get_verify_endpoint(global_config: dict, server_name: str) -> str:
+    """Get the /verify endpoint URL for a given resource server."""
+    return f"{get_server_base_url(global_config, server_name)}/verify"


⚠️ Potential issue | 🟠 Major

Normalize bind-all and loopback hosts for resource servers too.

get_server_base_url() and wait_for_resource_servers() use the raw host from NeMo Gym config. If a resource server advertises 0.0.0.0 or localhost, the generated /verify URL can be unroutable and the readiness probe can time out even though the service is up. This should mirror the host-rewrite logic already used in get_agent_servers().

Also applies to: 211-215

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/integrations/nemo_gym/server.py` around lines 186 - 200, Update get_server_base_url (and thus get_verify_endpoint) to normalize bind-all and loopback addresses the same way get_agent_servers does: if srv_cfg['host'] is "0.0.0.0" replace with the external host (e.g., global_config["host"] or the agent/cluster-facing host used elsewhere) and if it is "localhost" or "127.0.0.1" map it to the machine-facing hostname/IP used for reachability; ensure wait_for_resource_servers uses this normalized URL for its readiness probes. Locate get_server_base_url and get_verify_endpoint and apply the same host-rewrite logic/utility used by get_agent_servers (or extract that normalization into a helper and reuse it) so /verify URLs are routable when services advertise 0.0.0.0 or localhost.

coderabbitai · 2026-03-22T04:57:19Z

src/axolotl/scripts/vllm_serve_lora.py

+            "usage": {
+                "prompt_tokens": len(all_outputs[0].prompt_token_ids)
+                if all_outputs
+                else 0,
+                "completion_tokens": sum(
+                    len(out.token_ids) for o in all_outputs for out in o.outputs
+                ),
+                "total_tokens": 0,
+            },


⚠️ Potential issue | 🟡 Minor

Populate usage.total_tokens from the actual counts.

Line 527 hardcodes total_tokens to 0, so any downstream accounting that reads the usage block will undercount every response.

Suggested fix

return { "id": f"chatcmpl-{uuid.uuid4().hex[:8]}", "object": "chat.completion", "model": script_args.model, "choices": choices, "usage": { "prompt_tokens": len(all_outputs[0].prompt_token_ids) if all_outputs else 0, "completion_tokens": sum( len(out.token_ids) for o in all_outputs for out in o.outputs ), - "total_tokens": 0, + "total_tokens": ( + (len(all_outputs[0].prompt_token_ids) if all_outputs else 0) + + sum(len(out.token_ids) for o in all_outputs for out in o.outputs) + ), }, }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

"usage": {

"prompt_tokens": len(all_outputs[0].prompt_token_ids)

if all_outputs

else 0,

"completion_tokens": sum(

len(out.token_ids) for o in all_outputs for out in o.outputs

),

"total_tokens": 0,

},

"usage": {

"prompt_tokens": len(all_outputs[0].prompt_token_ids)

if all_outputs

else 0,

"completion_tokens": sum(

len(out.token_ids) for o in all_outputs for out in o.outputs

),

"total_tokens": (

(len(all_outputs[0].prompt_token_ids) if all_outputs else 0)

sum(len(out.token_ids) for o in all_outputs for out in o.outputs)

),

},

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@src/axolotl/scripts/vllm_serve_lora.py` around lines 520 - 528, The usage.total_tokens is hardcoded to 0; compute it from the actual counts instead. Inside the usage dict in vllm_serve_lora.py (the block that builds "usage"), assign prompt_tokens = len(all_outputs[0].prompt_token_ids) if all_outputs else 0, compute completion_tokens by summing len(out.token_ids) for o in all_outputs for out in o.outputs, and set total_tokens = prompt_tokens + completion_tokens so the "usage" object reflects the real totals.

codecov · 2026-03-22T05:07:48Z

Codecov Report

❌ Patch coverage is 31.31313% with 544 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/axolotl/integrations/nemo_gym/plugin.py	10.13%	195 Missing ⚠️
src/axolotl/integrations/nemo_gym/server.py	20.51%	93 Missing ⚠️
src/axolotl/integrations/nemo_gym/multi_turn.py	39.86%	89 Missing ⚠️
src/axolotl/integrations/nemo_gym/rewards.py	44.18%	48 Missing ⚠️
src/axolotl/integrations/nemo_gym/dataset.py	0.00%	47 Missing ⚠️
src/axolotl/scripts/vllm_serve_lora.py	0.00%	44 Missing ⚠️
src/axolotl/core/trainers/grpo/async_trainer.py	0.00%	12 Missing ⚠️
src/axolotl/integrations/nemo_gym/data_producer.py	88.63%	10 Missing ⚠️
src/axolotl/integrations/nemo_gym/args.py	73.91%	6 Missing ⚠️

📢 Thoughts on this report? Let us know!

winglian requested a review from NanoCode012 March 19, 2026 09:28

NanoCode012 reviewed Mar 20, 2026

View reviewed changes

winglian added the scheduled_release This PR is slated for the upcoming release label Mar 21, 2026

winglian force-pushed the nemo-gym-integration branch from 4eaaf8a to d8da708 Compare March 22, 2026 02:48

winglian marked this pull request as ready for review March 22, 2026 04:48

coderabbitai bot reviewed Mar 22, 2026

View reviewed changes

winglian and others added 10 commits March 22, 2026 13:55

nemo gym integration with grpo wip

659a699

mostly working

820a2b9

cleanup

8e1fd54

simplify

00a59d2

update docs

8f398cd

nemo gym support wip

8becea3

cleanup

549b379

chore: lint

7d1eb4c

address PR review and add more tests

39930fe

chore: lint

64746db

winglian force-pushed the nemo-gym-integration branch from f717e46 to 64746db Compare March 22, 2026 17:56

		self._global_config = get_server_configs(head_port=head_port)
		wait_for_resource_servers(self._global_config)

		def _get_verify_urls(head_port: int = 11000) -> dict[str, str]:
		"""Discover verify endpoints from the NeMo Gym head server.

Uh oh!

Conversation

winglian commented Mar 19, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai bot commented Mar 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Uh oh!

NanoCode012 left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 22, 2026

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

winglian commented Mar 19, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 19, 2026 •

edited

Loading

github-actions bot commented Mar 22, 2026 •

edited

Loading

codecov bot commented Mar 22, 2026 •

edited

Loading