fix: colocated.resources.gpus_per_node is now required for colocated setups by terrykong · Pull Request #1273 · NVIDIA-NeMo/RL

terrykong · 2025-10-04T07:09:43Z

Initially I set out to fix the failing config (examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml) since it was now missing colocated.resources.gpus_per_node, which is now mandatory.

I also cleaned up the code to properly give an error for this improper configuration and added tests for it

Summary by CodeRabbit

Bug Fixes
- Enforce explicit gpus_per_node for non-colocated inference. Single-node requires >0; multi-node must equal cluster gpus_per_node. Improved, clearer error messages on misconfiguration.
Documentation
- Updated example recipe to include gpus_per_node under colocated resources to clarify GPU allocation per node.
Tests
- Added unit tests for single- and multi-node scenarios verifying the explicit gpus_per_node requirement and corresponding error messages.

Signed-off-by: Terry Kong <terryk@nvidia.com>

coderabbitai · 2025-10-04T07:13:27Z

📝 Walkthrough

Walkthrough

The PR tightens validation for non-colocated inference GPU allocation by requiring an explicit gpus_per_node value matching cluster gpus_per_node in multi-node cases and >0 in single-node cases. It updates error messages, removes implicit defaults, adds corresponding unit tests, and adds gpus_per_node to a vLLM colocated config example.

Changes

Cohort / File(s)	Summary
Algorithms: validation tightening `nemo_rl/algorithms/distillation.py`, `nemo_rl/algorithms/grpo.py`	Enforce explicit inference_gpus_per_node: for single-node, must be >0; for multi-node, must equal cluster gpus_per_node. Remove implicit defaults and update error messages with actual vs expected values.
Unit tests: non-colocated inference requirements `tests/unit/algorithms/test_distillation.py`, `tests/unit/algorithms/test_grpo.py`	Add tests asserting failures when gpus_per_node is None for non-colocated inference in single-node and multi-node setups, checking specific error messages.
Config example: colocated resources `examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml`	Add policy.vllm_cfg.colocated.resources.gpus_per_node: 8 to the example config.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Trainer as Trainer.setup()
  participant Algo as GRPO/Distillation.setup
  participant Cfg as master_config

  User->>Trainer: start()
  Trainer->>Algo: setup(master_config)
  Alt non-colocated inference
    Algo->>Cfg: read cluster.num_nodes, cluster.gpus_per_node, policy.generation.colocated.resources.gpus_per_node
    Alt cluster.num_nodes == 1
      Note over Algo: Require explicit gpus_per_node > 0
      Alt gpus_per_node valid
        Algo-->>Trainer: proceed
      Else invalid/missing
        Algo-->>User: AssertionError (must be explicitly set > 0)
      End
    Else cluster.num_nodes > 1
      Note over Algo: Require explicit gpus_per_node == cluster.gpus_per_node
      Alt equals
        Algo-->>Trainer: proceed
      Else mismatch/missing
        Algo-->>User: AssertionError (actual vs expected)
      End
    End
  Else colocated inference
    Algo-->>Trainer: proceed (unchanged path)
  End

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

CI:L0

Suggested reviewers

chtruong814

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	After reviewing PR #1273’s description via `gh pr view`, the summary notes the configuration fix, code cleanup, and newly added tests but does not report any executed test results or evidence of testing. Given that the change tightens configuration validation and enforces a previously optional parameter—which constitutes a notable behavioral change—the absence of documented test execution means the requirement for major changes is not satisfied.	Please update the PR description to include the relevant test results or other verification evidence demonstrating the change has been validated, then rerun this check.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title succinctly highlights the core change—making the colocated.resources.gpus_per_node field mandatory—and directly reflects the adjustments in both configuration and code enforcement, using clear and specific language without extraneous detail.
Docstring Coverage	✅ Passed	Docstring coverage is 87.50% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch tk/noncolocated-grpo-test-typo

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

tests/unit/algorithms/test_distillation.py (2)
354-416: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when cluster.num_nodes=1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
     with (
-        patch("nemo_rl.algorithms.distillation.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.distillation.Logger"),
         patch("nemo_rl.algorithms.distillation.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.distillation.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):
418-479: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when cluster.num_nodes>1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
     with (
-        patch("nemo_rl.algorithms.distillation.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.distillation.Logger"),
         patch("nemo_rl.algorithms.distillation.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.distillation.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):
tests/unit/algorithms/test_grpo.py (2)
215-269: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when policy_nodes=1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
     with (
-        patch("nemo_rl.algorithms.grpo.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.grpo.Logger"),
         patch("nemo_rl.algorithms.grpo.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.grpo.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):
271-324: Test correctly validates explicit gpus_per_node requirement.

The test properly verifies that non-colocated inference requires explicit gpus_per_node configuration when policy_nodes>1.

Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
     with (
-        patch("nemo_rl.algorithms.grpo.Logger") as mock_logger,
+        patch("nemo_rl.algorithms.grpo.Logger"),
         patch("nemo_rl.algorithms.grpo.CheckpointManager") as mock_checkpointer,
         patch("nemo_rl.algorithms.grpo.StatefulDataLoader"),
         pytest.raises(
             AssertionError,
-            match="policy.generation.colocated.resources.gpus_per_node must be explicitly set",
+            match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set",
         ),
     ):

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d0e203c and 2b16d3a.

📒 Files selected for processing (5)

examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1 hunks)
nemo_rl/algorithms/distillation.py (2 hunks)
nemo_rl/algorithms/grpo.py (2 hunks)
tests/unit/algorithms/test_distillation.py (1 hunks)
tests/unit/algorithms/test_grpo.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (6)

examples/configs/recipes/**/*.yaml