Skip to content

fix: colocated.resources.gpus_per_node is now required for colocated setups#1273

Merged
terrykong merged 4 commits intomainfrom
tk/noncolocated-grpo-test-typo
Oct 5, 2025
Merged

fix: colocated.resources.gpus_per_node is now required for colocated setups#1273
terrykong merged 4 commits intomainfrom
tk/noncolocated-grpo-test-typo

Conversation

@terrykong
Copy link
Copy Markdown
Collaborator

@terrykong terrykong commented Oct 4, 2025

Initially I set out to fix the failing config (examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml) since it was now missing colocated.resources.gpus_per_node, which is now mandatory.

I also cleaned up the code to properly give an error for this improper configuration and added tests for it

Summary by CodeRabbit

  • Bug Fixes

    • Enforce explicit gpus_per_node for non-colocated inference. Single-node requires >0; multi-node must equal cluster gpus_per_node. Improved, clearer error messages on misconfiguration.
  • Documentation

    • Updated example recipe to include gpus_per_node under colocated resources to clarify GPU allocation per node.
  • Tests

    • Added unit tests for single- and multi-node scenarios verifying the explicit gpus_per_node requirement and corresponding error messages.

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong requested review from a team as code owners October 4, 2025 07:09
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests r0.4.0 labels Oct 4, 2025
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Oct 4, 2025

📝 Walkthrough

Walkthrough

The PR tightens validation for non-colocated inference GPU allocation by requiring an explicit gpus_per_node value matching cluster gpus_per_node in multi-node cases and >0 in single-node cases. It updates error messages, removes implicit defaults, adds corresponding unit tests, and adds gpus_per_node to a vLLM colocated config example.

Changes

Cohort / File(s) Summary
Algorithms: validation tightening
nemo_rl/algorithms/distillation.py, nemo_rl/algorithms/grpo.py
Enforce explicit inference_gpus_per_node: for single-node, must be >0; for multi-node, must equal cluster gpus_per_node. Remove implicit defaults and update error messages with actual vs expected values.
Unit tests: non-colocated inference requirements
tests/unit/algorithms/test_distillation.py, tests/unit/algorithms/test_grpo.py
Add tests asserting failures when gpus_per_node is None for non-colocated inference in single-node and multi-node setups, checking specific error messages.
Config example: colocated resources
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
Add policy.vllm_cfg.colocated.resources.gpus_per_node: 8 to the example config.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Trainer as Trainer.setup()
  participant Algo as GRPO/Distillation.setup
  participant Cfg as master_config

  User->>Trainer: start()
  Trainer->>Algo: setup(master_config)
  Alt non-colocated inference
    Algo->>Cfg: read cluster.num_nodes, cluster.gpus_per_node, policy.generation.colocated.resources.gpus_per_node
    Alt cluster.num_nodes == 1
      Note over Algo: Require explicit gpus_per_node > 0
      Alt gpus_per_node valid
        Algo-->>Trainer: proceed
      Else invalid/missing
        Algo-->>User: AssertionError (must be explicitly set > 0)
      End
    Else cluster.num_nodes > 1
      Note over Algo: Require explicit gpus_per_node == cluster.gpus_per_node
      Alt equals
        Algo-->>Trainer: proceed
      Else mismatch/missing
        Algo-->>User: AssertionError (actual vs expected)
      End
    End
  Else colocated inference
    Algo-->>Trainer: proceed (unchanged path)
  End
Loading

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

CI:L0

Suggested reviewers

  • chtruong814

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Test Results For Major Changes ⚠️ Warning After reviewing PR #1273’s description via gh pr view, the summary notes the configuration fix, code cleanup, and newly added tests but does not report any executed test results or evidence of testing. Given that the change tightens configuration validation and enforces a previously optional parameter—which constitutes a notable behavioral change—the absence of documented test execution means the requirement for major changes is not satisfied. Please update the PR description to include the relevant test results or other verification evidence demonstrating the change has been validated, then rerun this check.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title succinctly highlights the core change—making the colocated.resources.gpus_per_node field mandatory—and directly reflects the adjustments in both configuration and code enforcement, using clear and specific language without extraneous detail.
Docstring Coverage ✅ Passed Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch tk/noncolocated-grpo-test-typo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
tests/unit/algorithms/test_distillation.py (2)

354-416: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when cluster.num_nodes=1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:

     with (
-        patch("nemo_rl.algorithms.distillation.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.distillation.Logger"),
         patch("nemo_rl.algorithms.distillation.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.distillation.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):

418-479: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when cluster.num_nodes>1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:

     with (
-        patch("nemo_rl.algorithms.distillation.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.distillation.Logger"),
         patch("nemo_rl.algorithms.distillation.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.distillation.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):
tests/unit/algorithms/test_grpo.py (2)

215-269: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when policy_nodes=1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:

     with (
-        patch("nemo_rl.algorithms.grpo.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.grpo.Logger"),
         patch("nemo_rl.algorithms.grpo.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.grpo.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):

271-324: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when policy_nodes>1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:

     with (
-        patch("nemo_rl.algorithms.grpo.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.grpo.Logger"),
         patch("nemo_rl.algorithms.grpo.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.grpo.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0e203c and 2b16d3a.

📒 Files selected for processing (5)
  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1 hunks)
  • nemo_rl/algorithms/distillation.py (2 hunks)
  • nemo_rl/algorithms/grpo.py (2 hunks)
  • tests/unit/algorithms/test_distillation.py (1 hunks)
  • tests/unit/algorithms/test_grpo.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
examples/configs/recipes/**/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name

Files:

  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
examples/configs/recipes/llm/*.yaml

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

LLM recipe YAML filenames must follow: --ng-[-modifiers][-long][.vN].yaml

Files:

  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
examples/configs/recipes/**/*.{yaml,sh}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script

Files:

  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
examples/configs/recipes/**

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

Place recipe YAMLs under examples/configs/recipes//

Files:

  • examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts

Files:

  • tests/unit/algorithms/test_distillation.py
  • nemo_rl/algorithms/distillation.py
  • tests/unit/algorithms/test_grpo.py
  • nemo_rl/algorithms/grpo.py
nemo_rl/**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)

Files:

  • nemo_rl/algorithms/distillation.py
  • nemo_rl/algorithms/grpo.py
🧬 Code graph analysis (2)
tests/unit/algorithms/test_distillation.py (2)
tests/unit/algorithms/test_grpo.py (2)
  • test_noncolocated_inference_requires_explicit_gpus_per_node_single_node (215-268)
  • test_noncolocated_inference_requires_explicit_gpus_per_node_multi_node (271-324)
nemo_rl/algorithms/distillation.py (1)
  • setup (150-450)
tests/unit/algorithms/test_grpo.py (2)
tests/unit/algorithms/test_distillation.py (2)
  • test_noncolocated_inference_requires_explicit_gpus_per_node_single_node (354-415)
  • test_noncolocated_inference_requires_explicit_gpus_per_node_multi_node (418-479)
nemo_rl/algorithms/grpo.py (1)
  • setup (144-462)
🪛 Ruff (0.13.3)
tests/unit/algorithms/test_distillation.py

405-405: Local variable mock_logger is assigned to but never used

Remove assignment to unused variable mock_logger

(F841)


410-410: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)


469-469: Local variable mock_logger is assigned to but never used

Remove assignment to unused variable mock_logger

(F841)


474-474: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

tests/unit/algorithms/test_grpo.py

258-258: Local variable mock_logger is assigned to but never used

Remove assignment to unused variable mock_logger

(F841)


263-263: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)


314-314: Local variable mock_logger is assigned to but never used

Remove assignment to unused variable mock_logger

(F841)


319-319: Pattern passed to match= contains metacharacters but is neither escaped nor raw

(RUF043)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Lint check
  • GitHub Check: Lint check
  • GitHub Check: Post automodel integration comment / Comment on PR
  • GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (5)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1)

45-45: LGTM!

The addition of gpus_per_node: 8 correctly aligns with the new explicit configuration requirement for non-colocated setups and matches the cluster-level gpus_per_node value.

nemo_rl/algorithms/grpo.py (2)

308-312: Validation logic correctly enforces explicit configuration.

The assertion now requires inference_gpus_per_node to be explicitly set and greater than 0 for single-node non-colocated setups, preventing implicit defaults and making configuration intent clear.


341-345: Validation logic correctly enforces explicit configuration and exact equality.

The assertion now requires inference_gpus_per_node to be explicitly set and exactly equal to cluster.gpus_per_node for multi-node non-colocated setups. The error message helpfully includes both actual and expected values for debugging.

nemo_rl/algorithms/distillation.py (2)

307-311: Validation logic correctly enforces explicit configuration.

The assertion now requires inference_gpus_per_node to be explicitly set and greater than 0 for single-node non-colocated distillation setups, aligning with the GRPO implementation and preventing implicit defaults.


325-329: Validation logic correctly enforces explicit configuration and exact equality.

The assertion now requires inference_gpus_per_node to be explicitly set and exactly equal to cluster.gpus_per_node for multi-node non-colocated distillation setups. The error message helpfully includes both actual and expected values for debugging, consistent with the GRPO implementation.

Signed-off-by: Terry Kong <terryk@nvidia.com>
@terrykong terrykong added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Oct 5, 2025
@terrykong terrykong enabled auto-merge (squash) October 5, 2025 02:28
@terrykong terrykong merged commit 7aabb81 into main Oct 5, 2025
40 of 41 checks passed
@terrykong terrykong deleted the tk/noncolocated-grpo-test-typo branch October 5, 2025 07:26
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…setups (NVIDIA-NeMo#1273)

Signed-off-by: Terry Kong <terryk@nvidia.com>
yuanhangsu1986 pushed a commit to yuanhangsu1986/RL-Nemontron-Edge-Omni that referenced this pull request Feb 21, 2026
…setups (NVIDIA-NeMo#1273)

Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:L1 Run doctests, unit tests, and functional tests r0.4.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants