[TRTINFRA-7215][infra] Add support for enroot SLURM clusters by mlefeb01 · Pull Request #8770 · NVIDIA/TensorRT-LLM

mlefeb01 · 2025-10-29T23:44:25Z

Summary by CodeRabbit

New Features
- Added support for ENROOT as an alternative container runtime alongside Docker
- Enhanced Slurm cluster integration with improved mount configuration for OCI container environments
- Implemented explicit error handling for unsupported container runtimes
- Added automated cleanup of container and agent artifacts on completion

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-10-29T23:49:28Z

📝 Walkthrough

Walkthrough

Added support for ENROOT container runtime alongside existing Docker support in L0_Test.groovy. The change introduces conditional logic to route execution based on containerRuntime type, with a new public function runInEnrootOnNode(label) and dynamic entrypoint resolution. Updated agent setup and cleanup logic to accommodate both runtimes.

Changes

Cohort / File(s)	Summary
Container runtime abstraction `jenkins/L0_Test.groovy`	Imported ContainerRuntime; added conditional runner selection (Docker vs ENROOT); introduced runInEnrootOnNode(label) public function; added exception for unsupported runtimes; implemented dynamic entrypoint lookup via containerRuntimeToEntrypoint mapping.
Agent setup and resource handling `jenkins/L0_Test.groovy`	Modified agent setup flow to use dynamically resolved entrypoint instead of hard-coded Docker paths; updated library resource copying to use entrypoint-derived paths.
Slurm integration and cleanup `jenkins/L0_Test.groovy`	Added mounts variable for OCI/Open container environment support in Slurm submission; refactored cleanup logic to execute gathered cleanupCommands list for removing agent and container artifacts.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant L0Test
    participant ContainerRuntime
    participant DockerRunner
    participant EnrootRunner
    
    User->>L0Test: Trigger pipeline with containerRuntime config
    L0Test->>ContainerRuntime: Read containerRuntime value
    
    alt containerRuntime is DOCKER
        L0Test->>DockerRunner: Route to runInDockerOnNodeMultiStage
        DockerRunner->>DockerRunner: Setup using Docker entrypoint
        DockerRunner-->>L0Test: Execution complete
    else containerRuntime is ENROOT
        L0Test->>EnrootRunner: Route to runInEnrootOnNode(label)
        EnrootRunner->>EnrootRunner: Apply timeout wrapper
        EnrootRunner->>EnrootRunner: Setup using ENROOT entrypoint
        EnrootRunner-->>L0Test: Execution complete
    else unsupported runtime
        L0Test-->>L0Test: Throw exception
    end
    
    L0Test->>L0Test: Execute cleanup commands
    L0Test-->>User: Pipeline complete

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Conditional routing logic: Review the DOCKER vs ENROOT vs exception branching to ensure all runtime paths are handled correctly and no edge cases are missed.
Entrypoint mapping: Verify that containerRuntimeToEntrypoint correctly maps each runtime to its corresponding entrypoint, and that the mapping is exhaustive.
Slurm mounts addition: Validate that the new mounts variable is properly constructed and integrated into Slurm command generation without breaking existing Docker workflows.
Cleanup commands: Ensure cleanup logic correctly gathers and executes removal commands for both container runtimes, preventing artifact leakage.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description Check	⚠️ Warning	The PR description is largely incomplete and does not meet the repository's template requirements. While the PR title appears properly formatted as "[TRTINFRA-7215][infra] Add support for enroot SLURM clusters" based on the objectives, the critical sections of the template are unfilled. The Description section contains only the template comment prompt without any explanation of what changes were made or why, and the Test Coverage section is similarly empty with no test cases listed. The PR Checklist box is marked as completed, but this does not substitute for filling out the substantive sections that explain the implementation and its testing. The raw summary indicates significant changes were made to jenkins/L0_Test.groovy to add ENROOT workflow support, but none of this context appears in the PR description itself.	The author must complete the PR description by filling in the Description section with a clear explanation of the ENROOT support changes, their rationale, and how they improve SLURM cluster handling. Additionally, the Test Coverage section must be completed to list specific tests that validate the new ENROOT workflow functionality, including any relevant L0 tests or pipeline verification steps. These sections should draw from the detailed changes in jenkins/L0_Test.groovy to demonstrate thorough understanding of the implementation and adequate test coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "[TRTINFRA-7215][infra] Add support for enroot SLURM clusters" follows the specified template format with a valid JIRA ticket identifier and type classification. The title is concise, specific, and directly reflects the core changes made in the pull request: introducing ENROOT as an additional container runtime option for SLURM clusters alongside the existing Docker support. A developer scanning the repository history would clearly understand that this change extends the infrastructure to support ENROOT-based execution.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

jenkins/L0_Test.groovy (1)

498-501: Add conditional check for slurmJobID before cleanup.

The cleanup command at line 500 uses ${slurmJobID} which could be null if the SLURM job submission failed. While the || true prevents failure, it results in attempting to remove a malformed path.

Consider adding a conditional check:

 def cleanupCommands = [
     "rm -rf /home/svc_tensorrt/bloom/scripts/agent-${nodeName}.jar /home/svc_tensorrt/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh || true",
-    "rm /lustre/fs1/portfolios/coreai/projects/coreai_tensorrt_ci/users/svc_tensorrt/containers/container-${slurmJobID}.sqsh || true",
-].join(" && ")
+]
+if (slurmJobID) {
+    cleanupCommands.add("rm /lustre/fs1/portfolios/coreai/projects/coreai_tensorrt_ci/users/svc_tensorrt/containers/container-${slurmJobID}.sqsh || true")
+}
+def cleanupCmd = cleanupCommands.join(" && ")
 Utils.exec(
     pipeline,
     script: Utils.sshUserCmd(
         remote,
-        cleanupCommands
+        cleanupCmd
     )
 )

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 834a780 and ef61960.

📒 Files selected for processing (1)

jenkins/L0_Test.groovy (7 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (5)

jenkins/L0_Test.groovy (5)

14-14: LGTM - Import statement is correct.

The import follows the established pattern for shared library components and is properly used in the conditional logic at lines 672-674.

671-678: Good error handling for container runtime selection.

The conditional logic properly routes execution based on container runtime type with clear error handling for unsupported runtimes. The explicit exception provides good visibility if an unsupported runtime is configured.

2194-2195: Good defensive coding practice.

The comment explaining why || true is needed for dmesg -C helps maintainability. This prevents cleanup failures on clusters with restricted permissions.

2535-2544: Verify timeout value difference with the shared library definition.

DEFAULT_TIMEOUT_SHORT is defined in the external shared library com.nvidia.bloom.SlurmConfig (imported at line 10), not in this repository. Without access to that library's definition, I cannot confirm whether the shorter timeout is intentional or an oversight.

Line 2519 uses DEFAULT_TIMEOUT (300 minutes per comment)

Line 2539 uses DEFAULT_TIMEOUT_SHORT (value unknown)

Additionally, per the prior learning about Slurm job monitoring: jobs with built-in timeouts configured typically don't need separate monitoring loop timeouts, as Slurm naturally terminates and removes timed-out jobs from the queue.

Please verify: (1) the actual value of DEFAULT_TIMEOUT_SHORT in the shared library, (2) whether ENROOT execution is expected to complete significantly faster than Docker execution, and (3) whether this timeout wrapper is necessary given Slurm's built-in timeout mechanism.

557-561: Verify if these mounts should be conditional based on cluster type.

The comment indicates these mounts are "Specific for OCI machines," but they're applied unconditionally for all clusters. While the concern is valid, the suggested fix is incorrect—cluster.requiresOciMounts doesn't exist in the codebase.

If these mounts are indeed OCI-specific, you'll need to add a conditional check. The codebase shows the pattern at line 662 uses partition.clusterName for cluster identification (e.g., partition.clusterName == "dlcluster"). Verify whether partition.clusterName or another available cluster property can distinguish OCI clusters, and if so, gate the mounts accordingly. Otherwise, if these mounts are required for all clusters, update the comment to reflect that.

jenkins/L0_Test.groovy

chzblych · 2025-10-30T15:22:43Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-30T15:28:35Z

PR_Github #23064 [ run ] triggered by Bot. Commit: 36bded3

jenkins/L0_Test.groovy

yuanjingx87 · 2025-10-30T17:50:22Z

/bot run

tensorrt-cicd · 2025-10-30T17:58:22Z

PR_Github #23079 [ run ] triggered by Bot. Commit: 5f4d4f4

tensorrt-cicd · 2025-10-30T17:58:24Z

PR_Github #23064 [ run ] completed with state ABORTED. Commit: 36bded3
LLM/main/L0_MergeRequest_PR #17391 (Blue Ocean) completed with status: ABORTED

tensorrt-cicd · 2025-10-30T19:00:37Z

PR_Github #23079 [ run ] completed with state FAILURE. Commit: 5f4d4f4
/LLM/main/L0_MergeRequest_PR pipeline #17405 completed with status: 'FAILURE'

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

mlefeb01 · 2025-10-30T22:26:09Z

/bot run

yuanjingx87 · 2025-10-30T22:28:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-30T22:33:35Z

PR_Github #23087 [ run ] triggered by Bot. Commit: 542c450

tensorrt-cicd · 2025-10-31T02:06:10Z

PR_Github #23087 [ run ] completed with state SUCCESS. Commit: 542c450
/LLM/main/L0_MergeRequest_PR pipeline #17411 completed with status: 'FAILURE'

mlefeb01 · 2025-10-31T02:43:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-31T02:49:03Z

PR_Github #23111 [ run ] triggered by Bot. Commit: 542c450

chzblych · 2025-10-31T08:56:04Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-10-31T09:01:27Z

PR_Github #23178 [ run ] triggered by Bot. Commit: 542c450

tensorrt-cicd · 2025-10-31T09:01:29Z

PR_Github #23111 [ run ] completed with state ABORTED. Commit: 542c450
LLM/main/L0_MergeRequest_PR #17430 (Blue Ocean) completed with status: ABORTED

chzblych · 2025-10-31T10:45:25Z

@mlefeb01 I triggered a new pipeline run because that some new multi-GPU test failures are waived at TOT main. The new pipeline run will merge the test waives from TOT main on the fly.

tensorrt-cicd · 2025-10-31T18:46:30Z

PR_Github #23178 [ run ] completed with state SUCCESS. Commit: 542c450
/LLM/main/L0_MergeRequest_PR pipeline #17473 completed with status: 'SUCCESS'

…8770) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>

mlefeb01 requested review from a team as code owners October 29, 2025 23:44

mlefeb01 requested review from yuanjingx87 and zeroepoch October 29, 2025 23:44

coderabbitai bot reviewed Oct 29, 2025

View reviewed changes

jenkins/L0_Test.groovy Show resolved Hide resolved

zeroepoch approved these changes Oct 30, 2025

View reviewed changes

chzblych approved these changes Oct 30, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

chzblych reviewed Oct 30, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

chzblych reviewed Oct 30, 2025

View reviewed changes

jenkins/L0_Test.groovy Outdated Show resolved Hide resolved

mlefeb01 force-pushed the slurm-enroot branch from eaaa4b2 to 5f4d4f4 Compare October 30, 2025 17:49

mlefeb01 and others added 5 commits October 30, 2025 15:24

Add support for enroot SLURM clusters

582d763

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

Apply suggestion from @chzblych

932bd21

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

Apply suggestion from @chzblych

06ba020

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

Apply suggestion from @chzblych

a264f10

Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>

Fix script cleanup

542c450

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>

mlefeb01 force-pushed the slurm-enroot branch from 5f4d4f4 to 542c450 Compare October 30, 2025 22:24

mlefeb01 merged commit da2dca5 into NVIDIA:main Oct 31, 2025
5 checks passed

mlefeb01 deleted the slurm-enroot branch October 31, 2025 19:22

Conversation

mlefeb01 commented Oct 29, 2025 • edited by chzblych Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chzblych commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

Uh oh!

yuanjingx87 commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

mlefeb01 commented Oct 30, 2025

Uh oh!

yuanjingx87 commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 30, 2025

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

mlefeb01 commented Oct 31, 2025

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

chzblych commented Oct 31, 2025

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

chzblych commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tensorrt-cicd commented Oct 31, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

mlefeb01 commented Oct 29, 2025 •

edited by chzblych

Loading

coderabbitai bot commented Oct 29, 2025 •

edited

Loading

chzblych commented Oct 31, 2025 •

edited

Loading