fix: colocated.resources.gpus_per_node is now required for colocated setups#1273
fix: colocated.resources.gpus_per_node is now required for colocated setups#1273
Conversation
Signed-off-by: Terry Kong <terryk@nvidia.com>
Signed-off-by: Terry Kong <terryk@nvidia.com>
📝 WalkthroughWalkthroughThe PR tightens validation for non-colocated inference GPU allocation by requiring an explicit gpus_per_node value matching cluster gpus_per_node in multi-node cases and >0 in single-node cases. It updates error messages, removes implicit defaults, adds corresponding unit tests, and adds gpus_per_node to a vLLM colocated config example. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant User
participant Trainer as Trainer.setup()
participant Algo as GRPO/Distillation.setup
participant Cfg as master_config
User->>Trainer: start()
Trainer->>Algo: setup(master_config)
Alt non-colocated inference
Algo->>Cfg: read cluster.num_nodes, cluster.gpus_per_node, policy.generation.colocated.resources.gpus_per_node
Alt cluster.num_nodes == 1
Note over Algo: Require explicit gpus_per_node > 0
Alt gpus_per_node valid
Algo-->>Trainer: proceed
Else invalid/missing
Algo-->>User: AssertionError (must be explicitly set > 0)
End
Else cluster.num_nodes > 1
Note over Algo: Require explicit gpus_per_node == cluster.gpus_per_node
Alt equals
Algo-->>Trainer: proceed
Else mismatch/missing
Algo-->>User: AssertionError (actual vs expected)
End
End
Else colocated inference
Algo-->>Trainer: proceed (unchanged path)
End
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Suggested labels
Suggested reviewers
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
tests/unit/algorithms/test_distillation.py (2)
354-416: Test correctly validates explicit gpus_per_node requirement.The test properly verifies that non-colocated inference requires explicit
gpus_per_nodeconfiguration whencluster.num_nodes=1.Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
with ( - patch("nemo_rl.algorithms.distillation.Logger") as mock_logger, + patch("nemo_rl.algorithms.distillation.Logger"), patch("nemo_rl.algorithms.distillation.CheckpointManager") as mock_checkpointer, patch("nemo_rl.algorithms.distillation.StatefulDataLoader"), pytest.raises( AssertionError, - match="policy.generation.colocated.resources.gpus_per_node must be explicitly set", + match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set", ), ):
418-479: Test correctly validates explicit gpus_per_node requirement.The test properly verifies that non-colocated inference requires explicit
gpus_per_nodeconfiguration whencluster.num_nodes>1.Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
with ( - patch("nemo_rl.algorithms.distillation.Logger") as mock_logger, + patch("nemo_rl.algorithms.distillation.Logger"), patch("nemo_rl.algorithms.distillation.CheckpointManager") as mock_checkpointer, patch("nemo_rl.algorithms.distillation.StatefulDataLoader"), pytest.raises( AssertionError, - match="policy.generation.colocated.resources.gpus_per_node must be explicitly set", + match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set", ), ):tests/unit/algorithms/test_grpo.py (2)
215-269: Test correctly validates explicit gpus_per_node requirement.The test properly verifies that non-colocated inference requires explicit
gpus_per_nodeconfiguration whenpolicy_nodes=1.Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
with ( - patch("nemo_rl.algorithms.grpo.Logger") as mock_logger, + patch("nemo_rl.algorithms.grpo.Logger"), patch("nemo_rl.algorithms.grpo.CheckpointManager") as mock_checkpointer, patch("nemo_rl.algorithms.grpo.StatefulDataLoader"), pytest.raises( AssertionError, - match="policy.generation.colocated.resources.gpus_per_node must be explicitly set", + match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set", ), ):
271-324: Test correctly validates explicit gpus_per_node requirement.The test properly verifies that non-colocated inference requires explicit
gpus_per_nodeconfiguration whenpolicy_nodes>1.Address static analysis warnings by removing unused mock assignment and escaping the regex pattern:
with ( - patch("nemo_rl.algorithms.grpo.Logger") as mock_logger, + patch("nemo_rl.algorithms.grpo.Logger"), patch("nemo_rl.algorithms.grpo.CheckpointManager") as mock_checkpointer, patch("nemo_rl.algorithms.grpo.StatefulDataLoader"), pytest.raises( AssertionError, - match="policy.generation.colocated.resources.gpus_per_node must be explicitly set", + match=r"policy\.generation\.colocated\.resources\.gpus_per_node must be explicitly set", ), ):
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (5)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml(1 hunks)nemo_rl/algorithms/distillation.py(2 hunks)nemo_rl/algorithms/grpo.py(2 hunks)tests/unit/algorithms/test_distillation.py(1 hunks)tests/unit/algorithms/test_grpo.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (6)
examples/configs/recipes/**/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
examples/configs/recipes/**/*.yaml: Recipe YAMLs under examples/configs/recipes/** are runnable snapshots and may omit documentation
When adding support for a new model, add a recipe YAML under examples/configs/recipes/ in the appropriate domain (llm/ or vlm/) with the correct name
Files:
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
examples/configs/recipes/llm/*.yaml
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
LLM recipe YAML filenames must follow: --ng-[-modifiers][-long][.vN].yaml
Files:
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
examples/configs/recipes/**/*.{yaml,sh}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Known exception: Deepscaler recipes may encode context length in place of the cluster tuple (e.g., grpo-deepscaler-1.5b-8K.*); allowed but document intended hardware in the script
Files:
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
examples/configs/recipes/**
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
Place recipe YAMLs under examples/configs/recipes//
Files:
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Follow the Google Python Style Guide for all Python code
Target Python 3.12+ for all Python code in NeMo-RL
Indent Python code with 4 spaces; do not use tabs
Python filenames should be snake_case (e.g., some_file.py)
Class names should be PascalCase
Function and method names should be snake_case
Local variable names should be snake_case; if starting with a number, prefix with k (e.g., k_99th_percentile)
Global variables should be UPPER_SNAKE_CASE and prefixed with G_ (e.g., G_MY_GLOBAL)
Constants should be UPPER_SNAKE_CASE
Avoid shadowing variables declared in an outer scope
Initialize all externally visible members of a class in the constructor
For public interfaces used outside a file, prefer docstrings over comments
Use comments mainly for code within a function or interfaces local to a file
Commented-out code must include a nearby comment explaining usage and why it is commented out; otherwise remove before merging
Use Google-style docstrings for classes and functions (Sphinx-parseable)
Avoid using reflection when functionality can be easily achieved without it
Limit except clauses to the smallest specific set of exceptions possible
For duck-typing via try/except, keep the try body minimal and use else for main logic
Add the NVIDIA copyright header (with current year) at the top of all Python files, excluding tests/ and test-only scripts
Files:
tests/unit/algorithms/test_distillation.pynemo_rl/algorithms/distillation.pytests/unit/algorithms/test_grpo.pynemo_rl/algorithms/grpo.py
nemo_rl/**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
nemo_rl/**/*.py: Do not set non-None configuration defaults in code; YAML is the single source of truth for defaults
Access required config attributes directly (e.g., policy_cfg["precision"]) and assume presence; do not introduce hidden defaults
Express configuration optionality via TypedDict using typing.NotRequired
When adding a new config key to a TypedDict subclass, document the key’s purpose, valid values/types, and recommended default in code
For any class or function decorated with @ray.remote, add '# pragma: no cover' on the class/def line (and on remote functions)
Files:
nemo_rl/algorithms/distillation.pynemo_rl/algorithms/grpo.py
🧬 Code graph analysis (2)
tests/unit/algorithms/test_distillation.py (2)
tests/unit/algorithms/test_grpo.py (2)
test_noncolocated_inference_requires_explicit_gpus_per_node_single_node(215-268)test_noncolocated_inference_requires_explicit_gpus_per_node_multi_node(271-324)nemo_rl/algorithms/distillation.py (1)
setup(150-450)
tests/unit/algorithms/test_grpo.py (2)
tests/unit/algorithms/test_distillation.py (2)
test_noncolocated_inference_requires_explicit_gpus_per_node_single_node(354-415)test_noncolocated_inference_requires_explicit_gpus_per_node_multi_node(418-479)nemo_rl/algorithms/grpo.py (1)
setup(144-462)
🪛 Ruff (0.13.3)
tests/unit/algorithms/test_distillation.py
405-405: Local variable mock_logger is assigned to but never used
Remove assignment to unused variable mock_logger
(F841)
410-410: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
469-469: Local variable mock_logger is assigned to but never used
Remove assignment to unused variable mock_logger
(F841)
474-474: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
tests/unit/algorithms/test_grpo.py
258-258: Local variable mock_logger is assigned to but never used
Remove assignment to unused variable mock_logger
(F841)
263-263: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
314-314: Local variable mock_logger is assigned to but never used
Remove assignment to unused variable mock_logger
(F841)
319-319: Pattern passed to match= contains metacharacters but is neither escaped nor raw
(RUF043)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Lint check
- GitHub Check: Lint check
- GitHub Check: Post automodel integration comment / Comment on PR
- GitHub Check: Post submodule check comment / Comment on PR
🔇 Additional comments (5)
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml (1)
45-45: LGTM!The addition of
gpus_per_node: 8correctly aligns with the new explicit configuration requirement for non-colocated setups and matches the cluster-levelgpus_per_nodevalue.nemo_rl/algorithms/grpo.py (2)
308-312: Validation logic correctly enforces explicit configuration.The assertion now requires
inference_gpus_per_nodeto be explicitly set and greater than 0 for single-node non-colocated setups, preventing implicit defaults and making configuration intent clear.
341-345: Validation logic correctly enforces explicit configuration and exact equality.The assertion now requires
inference_gpus_per_nodeto be explicitly set and exactly equal tocluster.gpus_per_nodefor multi-node non-colocated setups. The error message helpfully includes both actual and expected values for debugging.nemo_rl/algorithms/distillation.py (2)
307-311: Validation logic correctly enforces explicit configuration.The assertion now requires
inference_gpus_per_nodeto be explicitly set and greater than 0 for single-node non-colocated distillation setups, aligning with the GRPO implementation and preventing implicit defaults.
325-329: Validation logic correctly enforces explicit configuration and exact equality.The assertion now requires
inference_gpus_per_nodeto be explicitly set and exactly equal tocluster.gpus_per_nodefor multi-node non-colocated distillation setups. The error message helpfully includes both actual and expected values for debugging, consistent with the GRPO implementation.
…setups (NVIDIA-NeMo#1273) Signed-off-by: Terry Kong <terryk@nvidia.com>
…setups (NVIDIA-NeMo#1273) Signed-off-by: Terry Kong <terryk@nvidia.com> Signed-off-by: yuanhangs <yuanhangs@nvidia.com>
Initially I set out to fix the failing config (
examples/configs/recipes/llm/grpo-llama3.1-8b-instruct-2n8g-fsdp2tp1-noncolocated.yaml) since it was now missingcolocated.resources.gpus_per_node, which is now mandatory.I also cleaned up the code to properly give an error for this improper configuration and added tests for it
Summary by CodeRabbit
Bug Fixes
Documentation
Tests