Conversation
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
|
/ok to test c9ca7db |
📝 WalkthroughWalkthroughThis PR refactors GitHub Actions workflows to remove Azure-specific authentication and introduce dynamic registry/test-data-path configuration. It adds conditional test skipping for GB200 hardware, disables specific GPU functional tests, updates CI/CD orchestration with preflight jobs, and replaces the decord dependency with decord2. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
tests/unit/models/generation/test_vllm_generation.py (1)
919-922: Deduplicate the GB200 FP8 skip guard into one helper.The same FP8 capability + GB200 skip block is repeated in multiple tests. Centralizing it avoids drift and keeps future policy changes in one place.
♻️ Refactor sketch
+def _skip_if_fp8_unsupported(vllm_precision: str) -> None: + if vllm_precision != "fp8": + return + major_capability, _ = torch.cuda.get_device_capability() + if major_capability < 9: + pytest.skip( + f"Skipping FP8 test. GPU compute capability {major_capability}.0 is < 9.0 (H100 required)." + ) + if "GB200" in torch.cuda.get_device_name(0): + pytest.skip("Skipping FP8 test on GB200 until fixed.")Then call
_skip_if_fp8_unsupported(vllm_precision)in each test.Also applies to: 991-994, 1634-1637, 2052-2055, 2226-2229
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/models/generation/test_vllm_generation.py` around lines 919 - 922, Create a single helper function (e.g. _skip_if_fp8_unsupported(vllm_precision)) in tests/unit/models/generation/test_vllm_generation.py that encapsulates the repeated FP8 capability check and GB200 bypass: it should inspect vllm_precision (or whatever symbol indicates FP8 support) and call pytest.skip("Skipping FP8 test on GB200 until fixed.") when the device name contains "GB200" and FP8 is not supported; then replace the duplicated blocks like the one using torch.cuda.get_device_name(0) and the "GB200" check in each test (seen around lines referenced in the comment) with a call to _skip_if_fp8_unsupported(vllm_precision) so all tests share the same logic.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In @.github/actions/test-template/action.yml:
- Around line 61-63: The action declares the input name registry as optional but
the workflow always builds image refs from inputs.registry, causing invalid
names when empty; update the action input registry (inputs.registry) to be
required: true or provide a safe default (e.g., default: "ghcr.io/OWNER") and
adjust its description accordingly, and ensure the places that build image refs
from inputs.registry (the image ref construction code referencing
inputs.registry at the two spots that concatenate it with image names) will
receive a non-empty value.
- Around line 86-87: The workflow uses plain "apt-get update" and "apt-get
install -y uuid-runtime" which can fail if the runner step is not root; update
those commands to run with elevated privileges (e.g., prepend sudo) and include
safe flags—replace with "sudo apt-get update && sudo apt-get install -y
--no-install-recommends uuid-runtime" so the step works for non-root runners and
avoids extra recommends.
In @.github/workflows/cicd-main.yml:
- Around line 128-139: The new job org-member-pre-flight isn't included in the
final QA gate, so failures there can be missed; update the CI_QA_Gate gating
logic to depend on/org-include the org-member-pre-flight job result (e.g., add
"org-member-pre-flight" to the list/needs/if checks that CI_QA_Gate uses) so the
QA gate explicitly waits for and fails on org-member-pre-flight failures; locate
the CI_QA_Gate definition and add the job name "org-member-pre-flight" to its
required jobs/dependencies or gate criteria.
- Line 20: The branch pattern "pull-request/[0-9]+" is using a regex quantifier
that GitHub Actions globs don't support; replace that string with a proper glob
such as "pull-request/[0-9][0-9]*" (to require one or more digits) or
"pull-request/**" (to match any suffix) so CI triggers on PR-merge branches;
update the pattern literal in the workflow where "pull-request/[0-9]+" is
defined.
In `@tests/functional/L1_Functional_Tests_GPU.sh`:
- Around line 46-47: The functional test invocations for eval.sh and
eval_async.sh were commented out globally (the "run_test" lines invoking
./tests/functional/eval.sh and ./tests/functional/eval_async.sh), which removes
coverage for non-GB200 runners; restore these lines and gate their execution on
a hardware/flag check (e.g., an environment variable like GPU_TYPE or a CI flag
such as ENABLE_GPU_FUNCTIONAL_TESTS) so only appropriate GPU lanes skip or run
them; update the script to conditionally call the run_test entries for eval.sh
and eval_async.sh based on that check (refer to the commented run_test lines for
./tests/functional/eval.sh and ./tests/functional/eval_async.sh).
---
Nitpick comments:
In `@tests/unit/models/generation/test_vllm_generation.py`:
- Around line 919-922: Create a single helper function (e.g.
_skip_if_fp8_unsupported(vllm_precision)) in
tests/unit/models/generation/test_vllm_generation.py that encapsulates the
repeated FP8 capability check and GB200 bypass: it should inspect vllm_precision
(or whatever symbol indicates FP8 support) and call pytest.skip("Skipping FP8
test on GB200 until fixed.") when the device name contains "GB200" and FP8 is
not supported; then replace the duplicated blocks like the one using
torch.cuda.get_device_name(0) and the "GB200" check in each test (seen around
lines referenced in the comment) with a call to
_skip_if_fp8_unsupported(vllm_precision) so all tests share the same logic.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7d245685-5975-40d2-8537-06e270c8c9c1
⛔ Files ignored due to path filters (1)
uv.lockis excluded by!**/*.lock
📒 Files selected for processing (6)
.github/actions/test-template/action.yml.github/copy-pr-bot.yml.github/workflows/cicd-main.ymlpyproject.tomltests/functional/L1_Functional_Tests_GPU.shtests/unit/models/generation/test_vllm_generation.py
| registry: | ||
| description: "Registry to use for test" | ||
| required: false |
There was a problem hiding this comment.
registry is effectively required but declared optional.
Line 97 and Line 136 always build image refs from inputs.registry, so an empty value produces invalid image names. Make the input required (or provide a safe default).
🔧 Suggested change
registry:
description: "Registry to use for test"
- required: false
+ required: trueAlso applies to: 97-98, 136-136
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/actions/test-template/action.yml around lines 61 - 63, The action
declares the input name registry as optional but the workflow always builds
image refs from inputs.registry, causing invalid names when empty; update the
action input registry (inputs.registry) to be required: true or provide a safe
default (e.g., default: "ghcr.io/OWNER") and adjust its description accordingly,
and ensure the places that build image refs from inputs.registry (the image ref
construction code referencing inputs.registry at the two spots that concatenate
it with image names) will receive a non-empty value.
| apt-get update | ||
| apt-get install -y uuid-runtime |
There was a problem hiding this comment.
Use elevated apt commands for runner compatibility.
Line 86 and Line 87 use apt-get directly; this can fail on runners where the step user is not root.
🔧 Suggested change
- apt-get update
- apt-get install -y uuid-runtime
+ sudo apt-get update
+ sudo apt-get install -y uuid-runtime📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| apt-get update | |
| apt-get install -y uuid-runtime | |
| sudo apt-get update | |
| sudo apt-get install -y uuid-runtime |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/actions/test-template/action.yml around lines 86 - 87, The workflow
uses plain "apt-get update" and "apt-get install -y uuid-runtime" which can fail
if the runner step is not root; update those commands to run with elevated
privileges (e.g., prepend sudo) and include safe flags—replace with "sudo
apt-get update && sudo apt-get install -y --no-install-recommends uuid-runtime"
so the step works for non-root runners and avoids extra recommends.
| merge_group: | ||
| types: [checks_requested] | ||
| - main | ||
| - "pull-request/[0-9]+" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
find . -name "cicd-main.yml" -type fRepository: NVIDIA-NeMo/RL
Length of output: 91
🏁 Script executed:
cat -n .github/workflows/cicd-main.yml | head -30Repository: NVIDIA-NeMo/RL
Length of output: 1155
🌐 Web query:
GitHub Actions on.push.branches pattern syntax glob regex documentation
💡 Result:
on.push.branches (and branches-ignore, tags, tags-ignore) uses glob patterns (not regex), matched against the Git ref name (for branches: refs/heads/<branch>). It supports wildcards/special glob chars like *, **, +, ?, !, etc.; if you need a literal match for those characters, escape them with \. You also can’t use branches and branches-ignore together for the same event; to both include and exclude in one list, use branches with negative patterns prefixed by ! (order matters). [1]
Example:
on:
push:
branches:
- "releases/**"
- "!releases/**-alpha"Source: GitHub Docs — Workflow syntax for GitHub Actions (see on.push.<branches|tags|branches-ignore|tags-ignore> and branch filter sections). [1]
References
[1] https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax
Fix push branch pattern: glob patterns don't interpret + as a regex quantifier.
Line 20 uses pull-request/[0-9]+, which in GitHub Actions glob patterns matches pull-request/0+, pull-request/1+, etc., but NOT pull-request/123 (the + is treated as a literal character). This prevents PR-merge-branch pushes from triggering CI.
🔧 Suggested change
- - "pull-request/[0-9]+"
+ - "pull-request/**"📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| - "pull-request/[0-9]+" | |
| - "pull-request/**" |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/cicd-main.yml at line 20, The branch pattern
"pull-request/[0-9]+" is using a regex quantifier that GitHub Actions globs
don't support; replace that string with a proper glob such as
"pull-request/[0-9][0-9]*" (to require one or more digits) or "pull-request/**"
(to match any suffix) so CI triggers on PR-merge branches; update the pattern
literal in the workflow where "pull-request/[0-9]+" is defined.
| org-member-pre-flight: | ||
| uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@fd82c6b23b5987d226f00d0719560f6e91210021 | ||
| with: | ||
| default_runner_prefix: ${{ vars.DEFAULT_RUNNER_PREFIX }} | ||
| non_nvidia_runner_prefix: ${{ vars.NON_NVIDIA_RUNNER_PREFIX }} | ||
| default_test_data_path: ${{ vars.DEFAULT_TEST_DATA_PATH }} | ||
| non_nvidia_test_data_path: ${{ vars.NON_NVIDIA_TEST_DATA_PATH }} | ||
| default_registry: ${{ vars.DEFAULT_CONTAINER_REGISTRY }} | ||
| non_nvidia_registry: ${{ vars.NON_NVIDIA_CONTAINER_REGISTRY }} | ||
| secrets: | ||
| NVIDIA_MANAGEMENT_ORG_PAT: ${{ secrets.NVIDIA_MANAGEMENT_ORG_PAT }} | ||
|
|
There was a problem hiding this comment.
Include org-member-pre-flight in the final QA gate criteria.
This new job now provides core runner/registry/data-path orchestration, but CI_QA_Gate does not explicitly gate on its result. A failure here can be hidden behind downstream skipped jobs.
🔧 Suggested change
CI_QA_Gate:
@@
needs:
- pre-flight
+ - org-member-pre-flight
- pr-branch-up-to-date-check
- lint-check
@@
ALL_SUCCESS: >-
${{
+ needs.org-member-pre-flight.result == 'success' &&
needs.lint-check.result == 'success' &&
(needs.pr-branch-up-to-date-check.result == 'success' || needs.pr-branch-up-to-date-check.result == 'skipped') &&🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In @.github/workflows/cicd-main.yml around lines 128 - 139, The new job
org-member-pre-flight isn't included in the final QA gate, so failures there can
be missed; update the CI_QA_Gate gating logic to depend on/org-include the
org-member-pre-flight job result (e.g., add "org-member-pre-flight" to the
list/needs/if checks that CI_QA_Gate uses) so the QA gate explicitly waits for
and fails on org-member-pre-flight failures; locate the CI_QA_Gate definition
and add the job name "org-member-pre-flight" to its required jobs/dependencies
or gate criteria.
| # run_test uv run --no-sync bash ./tests/functional/eval.sh | ||
| # run_test uv run --no-sync bash ./tests/functional/eval_async.sh |
There was a problem hiding this comment.
Avoid globally disabling these functional tests for all GPU runners.
Line 46, Line 47, Line 55, and Line 56 are commented out unconditionally, so non-GB200 lanes also lose coverage and regressions can slip through. Gate these tests by hardware/flag instead of removing them globally.
🔧 Suggested change
-# run_test uv run --no-sync bash ./tests/functional/eval.sh
-# run_test uv run --no-sync bash ./tests/functional/eval_async.sh
+GPU_NAME="$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n1 || true)"
+if [[ "$GPU_NAME" == *"GB200"* ]]; then
+ echo "Skipping eval tests on GB200 until known issue is fixed."
+else
+ run_test uv run --no-sync bash ./tests/functional/eval.sh
+ run_test uv run --no-sync bash ./tests/functional/eval_async.sh
+fi
@@
-# run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora.sh
-# run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora_async.sh
+if [[ "$GPU_NAME" == *"GB200"* ]]; then
+ echo "Skipping Megatron LoRA GRPO tests on GB200 until known issue is fixed."
+else
+ run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora.sh
+ run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora_async.sh
+fiAlso applies to: 55-56
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/functional/L1_Functional_Tests_GPU.sh` around lines 46 - 47, The
functional test invocations for eval.sh and eval_async.sh were commented out
globally (the "run_test" lines invoking ./tests/functional/eval.sh and
./tests/functional/eval_async.sh), which removes coverage for non-GB200 runners;
restore these lines and gate their execution on a hardware/flag check (e.g., an
environment variable like GPU_TYPE or a CI flag such as
ENABLE_GPU_FUNCTIONAL_TESTS) so only appropriate GPU lanes skip or run them;
update the script to conditionally call the run_test entries for eval.sh and
eval_async.sh based on that check (refer to the commented run_test lines for
./tests/functional/eval.sh and ./tests/functional/eval_async.sh).
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
|
Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
/ok to test 73e70e8 |
Signed-off-by: Charlie Truong <chtruong@nvidia.com>
|
/ok to test 836c8cb |
What does this PR do ?
/ok to test <commit_sha>. The CI will still respect the labels applied. This aligns with how we are kicking off CI for other repos.Issues
List issues that this PR closes (syntax):
Usage
# Add a code snippet demonstrating how to use thisBefore your PR is "Ready for review"
Pre checks:
Additional Information
Summary by CodeRabbit
Chores
Bug Fixes
Tests
Dependencies