ci: Enable GB200 runners by chtruong814 · Pull Request #2017 · NVIDIA-NeMo/RL

chtruong814 · 2026-02-24T04:04:54Z

What does this PR do ?

Enable GB200 runners. Fallback to A100 runners if external contributor. Currently, we are unable to run CI for external contributors in the GB200 cluster. We have additional work to enable this.
fp8 vllm generation tests are failing on gb200. Skipping this for now. Will open an issue
The eval functional tests are failing on gb200 because the resulting score is different than expected. Updating the expected score to a wider range
The image name is updated by a Github Action variable. Currently it's "megatron-bridge" because we did not create the "rl" docker repo in the cluster environment yet. We will update that before we merge this.
Going forward, the CI needs to be kicked off by commenting with /ok to test <commit_sha>. The CI will still respect the labels applied. This aligns with how we are kicking off CI for other repos.

Issues

List issues that this PR closes (syntax):

Usage

You can potentially add a usage example below

# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

...

Summary by CodeRabbit

Chores
- Updated CI/CD infrastructure with configurable registry and test data path parameters; simplified test setup by removing cloud-specific dependencies
- Modified PR bot auto-sync configuration
Bug Fixes
- Added hardware compatibility checks to skip FP8 tests on unsupported devices
Tests
- Disabled four GPU functional tests (eval and grpo variants)
Dependencies
- Replaced decord with decord2

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-03-04T20:17:18Z

/ok to test c9ca7db

coderabbitai · 2026-03-04T20:27:30Z

📝 Walkthrough

Walkthrough

This PR refactors GitHub Actions workflows to remove Azure-specific authentication and introduce dynamic registry/test-data-path configuration. It adds conditional test skipping for GB200 hardware, disables specific GPU functional tests, updates CI/CD orchestration with preflight jobs, and replaces the decord dependency with decord2.

Changes

Cohort / File(s)	Summary
GitHub Actions Configuration `.github/actions/test-template/action.yml`	Added `registry` and `test_data_path` inputs; removed Azure authentication steps (login, ACR, fileshare mount); replaced with simplified local setup using uuidgen; updated docker pull and run commands to use dynamic registry and test data path inputs.
CI/CD Workflow Orchestration `.github/workflows/cicd-main.yml`	Changed triggering from PR-based to push-based with branch matching; introduced org-member-pre-flight job; updated build and test jobs to depend on pre-flight outputs; added dynamic runner, registry, and test_data_path parameter propagation throughout pipeline.
CI Bot Configuration `.github/copy-pr-bot.yml`	Changed `auto_sync_ready` from `true` to `false`, altering default synchronization behavior.
Dependency Management `pyproject.toml`	Replaced conditional dependency `decord; platform_machine == 'x86_64'` with unconditional `decord2`.
GPU Functional Test Configuration `tests/functional/L1_Functional_Tests_GPU.sh`	Disabled four GPU functional tests (eval.sh, eval_async.sh, grpo_megatron_lora.sh, grpo_megatron_lora_async.sh) by commenting out their run_test invocations.
vLLM Generation Tests `tests/unit/models/generation/test_vllm_generation.py`	Added GB200 device detection and conditional test skipping in FP8-related tests; retrieves current CUDA device name and skips tests when device contains "GB200".

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

ci: add a fast test suite #2031: Modifies the same test-template action file to make docker image selection configurable with additional image-tag logic.
test: fix bug in deselection and make fast tests even faster #2038: Adjusts run_test invocations in the same L1_Functional_Tests_GPU.sh file regarding test execution configuration.
feat: Necessary changes for Gym GRPO tutorial #1630: Modifies the same test_vllm_generation.py file with additional mock tokenizer decode logic.

Suggested reviewers

yuki-97
terrykong

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Test Results For Major Changes	⚠️ Warning	PR contains major infrastructure and dependency changes (GB200 runners, GitHub Actions workflows, decord to decord2 swap, disabled functional tests) without documented test results, verification data, or evidence addressing five unresolved review comments.	Update PR description with documented test results, decord2 regression verification, root cause analysis for disabled tests, CI workflow validation, and evidence addressing five unresolved review issues.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'ci: Enable GB200 runners' accurately reflects the main objective of the PR, which is to enable GB200 runners in the CI/CD pipeline. The changes across multiple files (action configs, workflows, test guards) all work toward this goal.
Docstring Coverage	✅ Passed	Docstring coverage is 80.00% which is sufficient. The required threshold is 80.00%.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch chtruong/gb200

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

tests/unit/models/generation/test_vllm_generation.py (1)

919-922: Deduplicate the GB200 FP8 skip guard into one helper.

The same FP8 capability + GB200 skip block is repeated in multiple tests. Centralizing it avoids drift and keeps future policy changes in one place.

♻️ Refactor sketch

+def _skip_if_fp8_unsupported(vllm_precision: str) -> None:
+    if vllm_precision != "fp8":
+        return
+    major_capability, _ = torch.cuda.get_device_capability()
+    if major_capability < 9:
+        pytest.skip(
+            f"Skipping FP8 test. GPU compute capability {major_capability}.0 is < 9.0 (H100 required)."
+        )
+    if "GB200" in torch.cuda.get_device_name(0):
+        pytest.skip("Skipping FP8 test on GB200 until fixed.")

Then call _skip_if_fp8_unsupported(vllm_precision) in each test.

Also applies to: 991-994, 1634-1637, 2052-2055, 2226-2229

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@tests/unit/models/generation/test_vllm_generation.py` around lines 919 - 922,
Create a single helper function (e.g. _skip_if_fp8_unsupported(vllm_precision))
in tests/unit/models/generation/test_vllm_generation.py that encapsulates the
repeated FP8 capability check and GB200 bypass: it should inspect vllm_precision
(or whatever symbol indicates FP8 support) and call pytest.skip("Skipping FP8
test on GB200 until fixed.") when the device name contains "GB200" and FP8 is
not supported; then replace the duplicated blocks like the one using
torch.cuda.get_device_name(0) and the "GB200" check in each test (seen around
lines referenced in the comment) with a call to
_skip_if_fp8_unsupported(vllm_precision) so all tests share the same logic.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In @.github/actions/test-template/action.yml:
- Around line 61-63: The action declares the input name registry as optional but
the workflow always builds image refs from inputs.registry, causing invalid
names when empty; update the action input registry (inputs.registry) to be
required: true or provide a safe default (e.g., default: "ghcr.io/OWNER") and
adjust its description accordingly, and ensure the places that build image refs
from inputs.registry (the image ref construction code referencing
inputs.registry at the two spots that concatenate it with image names) will
receive a non-empty value.
- Around line 86-87: The workflow uses plain "apt-get update" and "apt-get
install -y uuid-runtime" which can fail if the runner step is not root; update
those commands to run with elevated privileges (e.g., prepend sudo) and include
safe flags—replace with "sudo apt-get update && sudo apt-get install -y
--no-install-recommends uuid-runtime" so the step works for non-root runners and
avoids extra recommends.

In @.github/workflows/cicd-main.yml:
- Around line 128-139: The new job org-member-pre-flight isn't included in the
final QA gate, so failures there can be missed; update the CI_QA_Gate gating
logic to depend on/org-include the org-member-pre-flight job result (e.g., add
"org-member-pre-flight" to the list/needs/if checks that CI_QA_Gate uses) so the
QA gate explicitly waits for and fails on org-member-pre-flight failures; locate
the CI_QA_Gate definition and add the job name "org-member-pre-flight" to its
required jobs/dependencies or gate criteria.
- Line 20: The branch pattern "pull-request/[0-9]+" is using a regex quantifier
that GitHub Actions globs don't support; replace that string with a proper glob
such as "pull-request/[0-9][0-9]*" (to require one or more digits) or
"pull-request/**" (to match any suffix) so CI triggers on PR-merge branches;
update the pattern literal in the workflow where "pull-request/[0-9]+" is
defined.

In `@tests/functional/L1_Functional_Tests_GPU.sh`:
- Around line 46-47: The functional test invocations for eval.sh and
eval_async.sh were commented out globally (the "run_test" lines invoking
./tests/functional/eval.sh and ./tests/functional/eval_async.sh), which removes
coverage for non-GB200 runners; restore these lines and gate their execution on
a hardware/flag check (e.g., an environment variable like GPU_TYPE or a CI flag
such as ENABLE_GPU_FUNCTIONAL_TESTS) so only appropriate GPU lanes skip or run
them; update the script to conditionally call the run_test entries for eval.sh
and eval_async.sh based on that check (refer to the commented run_test lines for
./tests/functional/eval.sh and ./tests/functional/eval_async.sh).

---

Nitpick comments:
In `@tests/unit/models/generation/test_vllm_generation.py`:
- Around line 919-922: Create a single helper function (e.g.
_skip_if_fp8_unsupported(vllm_precision)) in
tests/unit/models/generation/test_vllm_generation.py that encapsulates the
repeated FP8 capability check and GB200 bypass: it should inspect vllm_precision
(or whatever symbol indicates FP8 support) and call pytest.skip("Skipping FP8
test on GB200 until fixed.") when the device name contains "GB200" and FP8 is
not supported; then replace the duplicated blocks like the one using
torch.cuda.get_device_name(0) and the "GB200" check in each test (seen around
lines referenced in the comment) with a call to
_skip_if_fp8_unsupported(vllm_precision) so all tests share the same logic.

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7d245685-5975-40d2-8537-06e270c8c9c1

📥 Commits

Reviewing files that changed from the base of the PR and between bd7f236 and c9ca7db.

⛔ Files ignored due to path filters (1)

uv.lock is excluded by !**/*.lock

📒 Files selected for processing (6)

.github/actions/test-template/action.yml
.github/copy-pr-bot.yml
.github/workflows/cicd-main.yml
pyproject.toml
tests/functional/L1_Functional_Tests_GPU.sh
tests/unit/models/generation/test_vllm_generation.py

coderabbitai · 2026-03-04T20:27:33Z

.github/actions/test-template/action.yml

+  registry:
+    description: "Registry to use for test"
+    required: false


⚠️ Potential issue | 🟠 Major

registry is effectively required but declared optional.

Line 97 and Line 136 always build image refs from inputs.registry, so an empty value produces invalid image names. Make the input required (or provide a safe default).

🔧 Suggested change

registry: description: "Registry to use for test" - required: false + required: true

Also applies to: 97-98, 136-136

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/actions/test-template/action.yml around lines 61 - 63, The action declares the input name registry as optional but the workflow always builds image refs from inputs.registry, causing invalid names when empty; update the action input registry (inputs.registry) to be required: true or provide a safe default (e.g., default: "ghcr.io/OWNER") and adjust its description accordingly, and ensure the places that build image refs from inputs.registry (the image ref construction code referencing inputs.registry at the two spots that concatenate it with image names) will receive a non-empty value.

coderabbitai · 2026-03-04T20:27:33Z

.github/actions/test-template/action.yml

+        apt-get update
+        apt-get install -y uuid-runtime


⚠️ Potential issue | 🟠 Major

Use elevated apt commands for runner compatibility.

Line 86 and Line 87 use apt-get directly; this can fail on runners where the step user is not root.

🔧 Suggested change

- apt-get update - apt-get install -y uuid-runtime + sudo apt-get update + sudo apt-get install -y uuid-runtime

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

apt-get update

apt-get install -y uuid-runtime

sudo apt-get update

sudo apt-get install -y uuid-runtime

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/actions/test-template/action.yml around lines 86 - 87, The workflow uses plain "apt-get update" and "apt-get install -y uuid-runtime" which can fail if the runner step is not root; update those commands to run with elevated privileges (e.g., prepend sudo) and include safe flags—replace with "sudo apt-get update && sudo apt-get install -y --no-install-recommends uuid-runtime" so the step works for non-root runners and avoids extra recommends.

coderabbitai · 2026-03-04T20:27:33Z

.github/workflows/cicd-main.yml

-  merge_group:
-    types: [checks_requested]
+      - main
+      - "pull-request/[0-9]+"


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

find . -name "cicd-main.yml" -type f

Repository: NVIDIA-NeMo/RL

Length of output: 91

🏁 Script executed:

cat -n .github/workflows/cicd-main.yml | head -30

Repository: NVIDIA-NeMo/RL

Length of output: 1155

🌐 Web query:

GitHub Actions on.push.branches pattern syntax glob regex documentation

💡 Result:

on.push.branches (and branches-ignore, tags, tags-ignore) uses glob patterns (not regex), matched against the Git ref name (for branches: refs/heads/<branch>). It supports wildcards/special glob chars like *, **, +, ?, !, etc.; if you need a literal match for those characters, escape them with \. You also can’t use branches and branches-ignore together for the same event; to both include and exclude in one list, use branches with negative patterns prefixed by ! (order matters). [1]

Example:

on: push: branches: - "releases/**" - "!releases/**-alpha"

Source: GitHub Docs — Workflow syntax for GitHub Actions (see on.push.<branches|tags|branches-ignore|tags-ignore> and branch filter sections). [1]

References
[1] https://docs.github.com/en/actions/reference/workflows-and-actions/workflow-syntax

Fix push branch pattern: glob patterns don't interpret + as a regex quantifier.

Line 20 uses pull-request/[0-9]+, which in GitHub Actions glob patterns matches pull-request/0+, pull-request/1+, etc., but NOT pull-request/123 (the + is treated as a literal character). This prevents PR-merge-branch pushes from triggering CI.

🔧 Suggested change

- - "pull-request/[0-9]+" + - "pull-request/**"

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

- "pull-request/[0-9]+"

- "pull-request/**"

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/cicd-main.yml at line 20, The branch pattern "pull-request/[0-9]+" is using a regex quantifier that GitHub Actions globs don't support; replace that string with a proper glob such as "pull-request/[0-9][0-9]*" (to require one or more digits) or "pull-request/**" (to match any suffix) so CI triggers on PR-merge branches; update the pattern literal in the workflow where "pull-request/[0-9]+" is defined.

coderabbitai · 2026-03-04T20:27:33Z

.github/workflows/cicd-main.yml

+  org-member-pre-flight:
+    uses: NVIDIA-NeMo/FW-CI-templates/.github/workflows/_cicd_preflight.yml@fd82c6b23b5987d226f00d0719560f6e91210021
+    with:
+      default_runner_prefix: ${{ vars.DEFAULT_RUNNER_PREFIX }}
+      non_nvidia_runner_prefix: ${{ vars.NON_NVIDIA_RUNNER_PREFIX }}
+      default_test_data_path: ${{ vars.DEFAULT_TEST_DATA_PATH }}
+      non_nvidia_test_data_path: ${{ vars.NON_NVIDIA_TEST_DATA_PATH }}
+      default_registry: ${{ vars.DEFAULT_CONTAINER_REGISTRY }}
+      non_nvidia_registry: ${{ vars.NON_NVIDIA_CONTAINER_REGISTRY }}
+    secrets:
+      NVIDIA_MANAGEMENT_ORG_PAT: ${{ secrets.NVIDIA_MANAGEMENT_ORG_PAT }}
+


⚠️ Potential issue | 🟠 Major

Include org-member-pre-flight in the final QA gate criteria.

This new job now provides core runner/registry/data-path orchestration, but CI_QA_Gate does not explicitly gate on its result. A failure here can be hidden behind downstream skipped jobs.

🔧 Suggested change

CI_QA_Gate: @@ needs: - pre-flight + - org-member-pre-flight - pr-branch-up-to-date-check - lint-check @@ ALL_SUCCESS: >- ${{ + needs.org-member-pre-flight.result == 'success' && needs.lint-check.result == 'success' && (needs.pr-branch-up-to-date-check.result == 'success' || needs.pr-branch-up-to-date-check.result == 'skipped') &&

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In @.github/workflows/cicd-main.yml around lines 128 - 139, The new job org-member-pre-flight isn't included in the final QA gate, so failures there can be missed; update the CI_QA_Gate gating logic to depend on/org-include the org-member-pre-flight job result (e.g., add "org-member-pre-flight" to the list/needs/if checks that CI_QA_Gate uses) so the QA gate explicitly waits for and fails on org-member-pre-flight failures; locate the CI_QA_Gate definition and add the job name "org-member-pre-flight" to its required jobs/dependencies or gate criteria.

coderabbitai · 2026-03-04T20:27:33Z

tests/functional/L1_Functional_Tests_GPU.sh

+# run_test      uv run --no-sync bash ./tests/functional/eval.sh
+# run_test      uv run --no-sync bash ./tests/functional/eval_async.sh


⚠️ Potential issue | 🟠 Major

Avoid globally disabling these functional tests for all GPU runners.

Line 46, Line 47, Line 55, and Line 56 are commented out unconditionally, so non-GB200 lanes also lose coverage and regressions can slip through. Gate these tests by hardware/flag instead of removing them globally.

🔧 Suggested change

-# run_test uv run --no-sync bash ./tests/functional/eval.sh -# run_test uv run --no-sync bash ./tests/functional/eval_async.sh +GPU_NAME="$(nvidia-smi --query-gpu=name --format=csv,noheader | head -n1 || true)" +if [[ "$GPU_NAME" == *"GB200"* ]]; then + echo "Skipping eval tests on GB200 until known issue is fixed." +else + run_test uv run --no-sync bash ./tests/functional/eval.sh + run_test uv run --no-sync bash ./tests/functional/eval_async.sh +fi @@ -# run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora.sh -# run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora_async.sh +if [[ "$GPU_NAME" == *"GB200"* ]]; then + echo "Skipping Megatron LoRA GRPO tests on GB200 until known issue is fixed." +else + run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora.sh + run_test uv run --no-sync bash ./tests/functional/grpo_megatron_lora_async.sh +fi

Also applies to: 55-56

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@tests/functional/L1_Functional_Tests_GPU.sh` around lines 46 - 47, The functional test invocations for eval.sh and eval_async.sh were commented out globally (the "run_test" lines invoking ./tests/functional/eval.sh and ./tests/functional/eval_async.sh), which removes coverage for non-GB200 runners; restore these lines and gate their execution on a hardware/flag check (e.g., an environment variable like GPU_TYPE or a CI flag such as ENABLE_GPU_FUNCTIONAL_TESTS) so only appropriate GPU lanes skip or run them; update the script to conditionally call the run_test entries for eval.sh and eval_async.sh based on that check (refer to the commented run_test lines for ./tests/functional/eval.sh and ./tests/functional/eval_async.sh).

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

copy-pr-bot · 2026-03-04T23:23:29Z

Auto-sync is disabled for ready for review pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

chtruong814 · 2026-03-04T23:23:51Z

/ok to test 73e70e8

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 · 2026-03-04T23:48:36Z

/ok to test 836c8cb

Test GB200 runner

c060437

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added the CI:docs Run doctest label Feb 24, 2026

github-actions bot added CI Relating to CI and removed CI:docs Run doctest labels Feb 24, 2026

chtruong814 had a problem deploying to nemo-ci February 24, 2026 04:05 — with GitHub Actions Failure

chtruong814 had a problem deploying to nemo-ci February 24, 2026 17:53 — with GitHub Actions Failure

Fix gb200 container build

02084e3

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added the CI:docs Run doctest label Feb 24, 2026

chtruong814 had a problem deploying to nemo-ci February 24, 2026 19:32 — with GitHub Actions Failure

chtruong814 had a problem deploying to nemo-ci February 25, 2026 16:11 — with GitHub Actions Failure

chtruong814 had a problem deploying to nemo-ci February 25, 2026 18:49 — with GitHub Actions Failure

chtruong814 had a problem deploying to nemo-ci February 25, 2026 20:16 — with GitHub Actions Failure

chtruong814 had a problem deploying to nemo-ci February 25, 2026 21:53 — with GitHub Actions Failure

Test updated registry

bcf8f81

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added CI:docs Run doctest and removed CI:docs Run doctest labels Feb 26, 2026

chtruong814 had a problem deploying to nemo-ci February 26, 2026 00:55 — with GitHub Actions Failure

chtruong814 added CI:docs Run doctest and removed CI:docs Run doctest labels Feb 26, 2026

chtruong814 temporarily deployed to nemo-ci February 26, 2026 05:14 — with GitHub Actions Inactive

Test gb200

e87f2e2

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:docs Run doctest labels Feb 28, 2026

Merge remote-tracking branch 'origin/main' into chtruong/gb200

2435ca5

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:L1 Run doctests, unit tests, and functional tests labels Feb 28, 2026

Force gb200 build

f517e6a

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) and removed CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) labels Feb 28, 2026

chtruong814 temporarily deployed to nemo-ci February 28, 2026 03:09 — with GitHub Actions Inactive

chtruong814 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Mar 3, 2026

chtruong814 had a problem deploying to nemo-ci March 3, 2026 05:55 — with GitHub Actions Failure

Remove Azure login

2bbe325

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 added CI:L1 Run doctests, unit tests, and functional tests and removed CI:L1 Run doctests, unit tests, and functional tests labels Mar 3, 2026

chtruong814 added 14 commits March 3, 2026 00:03

Fix registry

570d4f5

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix image nmae

e4f293a

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix doc test image ref

21e5d84

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Skip broken megatron lora tests

a10a3e4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Skip test_vllm_generation_with_hf_training_colocated

66707ac

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix test skip

9866c4d

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Skip test

5d6eb10

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Skip fp8 generation for gb200 for now

60d4b5c

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Skip fp8 vllm generation tests

08d62fd

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Use variable for runner

31613ca

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Fix lint error in test_vllm_generation

3f623a1

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Use container name variable

6b541f4

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Use copy-pr-bot

4417675

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

Revert changes

c9ca7db

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

coderabbitai bot reviewed Mar 4, 2026

View reviewed changes

chtruong814 added 2 commits March 4, 2026 17:20

Merge remote-tracking branch 'origin/main' into chtruong/gb200

bb72598

Update expected eval metrics

73e70e8

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

chtruong814 mentioned this pull request Mar 4, 2026

feat: Allow custom registry in the build containers job NVIDIA-NeMo/FW-CI-templates#401

Open

Ensure functional tests wait for unit tests

836c8cb

Signed-off-by: Charlie Truong <chtruong@nvidia.com>

		# run_test uv run --no-sync bash ./tests/functional/eval.sh
		# run_test uv run --no-sync bash ./tests/functional/eval_async.sh

Conversation

chtruong814 commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Usage

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

chtruong814 commented Mar 4, 2026

Uh oh!

coderabbitai bot commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

copy-pr-bot bot commented Mar 4, 2026

Uh oh!

chtruong814 commented Mar 4, 2026

Uh oh!

chtruong814 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

chtruong814 commented Feb 24, 2026 •

edited

Loading

coderabbitai bot commented Mar 4, 2026 •

edited

Loading