Skip to content

[TRTINFRA-7215][infra] Add support for enroot SLURM clusters#8770

Merged
mlefeb01 merged 5 commits intoNVIDIA:mainfrom
mlefeb01:slurm-enroot
Oct 31, 2025
Merged

[TRTINFRA-7215][infra] Add support for enroot SLURM clusters#8770
mlefeb01 merged 5 commits intoNVIDIA:mainfrom
mlefeb01:slurm-enroot

Conversation

@mlefeb01
Copy link
Collaborator

@mlefeb01 mlefeb01 commented Oct 29, 2025

Summary by CodeRabbit

  • New Features
    • Added support for ENROOT as an alternative container runtime alongside Docker
    • Enhanced Slurm cluster integration with improved mount configuration for OCI container environments
    • Implemented explicit error handling for unsupported container runtimes
    • Added automated cleanup of container and agent artifacts on completion

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@mlefeb01 mlefeb01 requested review from a team as code owners October 29, 2025 23:44
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 29, 2025

📝 Walkthrough

Walkthrough

Added support for ENROOT container runtime alongside existing Docker support in L0_Test.groovy. The change introduces conditional logic to route execution based on containerRuntime type, with a new public function runInEnrootOnNode(label) and dynamic entrypoint resolution. Updated agent setup and cleanup logic to accommodate both runtimes.

Changes

Cohort / File(s) Summary
Container runtime abstraction
jenkins/L0_Test.groovy
Imported ContainerRuntime; added conditional runner selection (Docker vs ENROOT); introduced runInEnrootOnNode(label) public function; added exception for unsupported runtimes; implemented dynamic entrypoint lookup via containerRuntimeToEntrypoint mapping.
Agent setup and resource handling
jenkins/L0_Test.groovy
Modified agent setup flow to use dynamically resolved entrypoint instead of hard-coded Docker paths; updated library resource copying to use entrypoint-derived paths.
Slurm integration and cleanup
jenkins/L0_Test.groovy
Added mounts variable for OCI/Open container environment support in Slurm submission; refactored cleanup logic to execute gathered cleanupCommands list for removing agent and container artifacts.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant L0Test
    participant ContainerRuntime
    participant DockerRunner
    participant EnrootRunner
    
    User->>L0Test: Trigger pipeline with containerRuntime config
    L0Test->>ContainerRuntime: Read containerRuntime value
    
    alt containerRuntime is DOCKER
        L0Test->>DockerRunner: Route to runInDockerOnNodeMultiStage
        DockerRunner->>DockerRunner: Setup using Docker entrypoint
        DockerRunner-->>L0Test: Execution complete
    else containerRuntime is ENROOT
        L0Test->>EnrootRunner: Route to runInEnrootOnNode(label)
        EnrootRunner->>EnrootRunner: Apply timeout wrapper
        EnrootRunner->>EnrootRunner: Setup using ENROOT entrypoint
        EnrootRunner-->>L0Test: Execution complete
    else unsupported runtime
        L0Test-->>L0Test: Throw exception
    end
    
    L0Test->>L0Test: Execute cleanup commands
    L0Test-->>User: Pipeline complete
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

  • Conditional routing logic: Review the DOCKER vs ENROOT vs exception branching to ensure all runtime paths are handled correctly and no edge cases are missed.
  • Entrypoint mapping: Verify that containerRuntimeToEntrypoint correctly maps each runtime to its corresponding entrypoint, and that the mapping is exhaustive.
  • Slurm mounts addition: Validate that the new mounts variable is properly constructed and integrated into Slurm command generation without breaking existing Docker workflows.
  • Cleanup commands: Ensure cleanup logic correctly gathers and executes removal commands for both container runtimes, preventing artifact leakage.

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description Check ⚠️ Warning The PR description is largely incomplete and does not meet the repository's template requirements. While the PR title appears properly formatted as "[TRTINFRA-7215][infra] Add support for enroot SLURM clusters" based on the objectives, the critical sections of the template are unfilled. The Description section contains only the template comment prompt without any explanation of what changes were made or why, and the Test Coverage section is similarly empty with no test cases listed. The PR Checklist box is marked as completed, but this does not substitute for filling out the substantive sections that explain the implementation and its testing. The raw summary indicates significant changes were made to jenkins/L0_Test.groovy to add ENROOT workflow support, but none of this context appears in the PR description itself. The author must complete the PR description by filling in the Description section with a clear explanation of the ENROOT support changes, their rationale, and how they improve SLURM cluster handling. Additionally, the Test Coverage section must be completed to list specific tests that validate the new ENROOT workflow functionality, including any relevant L0 tests or pipeline verification steps. These sections should draw from the detailed changes in jenkins/L0_Test.groovy to demonstrate thorough understanding of the implementation and adequate test coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "[TRTINFRA-7215][infra] Add support for enroot SLURM clusters" follows the specified template format with a valid JIRA ticket identifier and type classification. The title is concise, specific, and directly reflects the core changes made in the pull request: introducing ENROOT as an additional container runtime option for SLURM clusters alongside the existing Docker support. A developer scanning the repository history would clearly understand that this change extends the infrastructure to support ENROOT-based execution.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)

498-501: Add conditional check for slurmJobID before cleanup.

The cleanup command at line 500 uses ${slurmJobID} which could be null if the SLURM job submission failed. While the || true prevents failure, it results in attempting to remove a malformed path.

Consider adding a conditional check:

 def cleanupCommands = [
     "rm -rf /home/svc_tensorrt/bloom/scripts/agent-${nodeName}.jar /home/svc_tensorrt/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh || true",
-    "rm /lustre/fs1/portfolios/coreai/projects/coreai_tensorrt_ci/users/svc_tensorrt/containers/container-${slurmJobID}.sqsh || true",
-].join(" && ")
+]
+if (slurmJobID) {
+    cleanupCommands.add("rm /lustre/fs1/portfolios/coreai/projects/coreai_tensorrt_ci/users/svc_tensorrt/containers/container-${slurmJobID}.sqsh || true")
+}
+def cleanupCmd = cleanupCommands.join(" && ")
 Utils.exec(
     pipeline,
     script: Utils.sshUserCmd(
         remote,
-        cleanupCommands
+        cleanupCmd
     )
 )
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 834a780 and ef61960.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy (7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (5)
jenkins/L0_Test.groovy (5)

14-14: LGTM - Import statement is correct.

The import follows the established pattern for shared library components and is properly used in the conditional logic at lines 672-674.


671-678: Good error handling for container runtime selection.

The conditional logic properly routes execution based on container runtime type with clear error handling for unsupported runtimes. The explicit exception provides good visibility if an unsupported runtime is configured.


2194-2195: Good defensive coding practice.

The comment explaining why || true is needed for dmesg -C helps maintainability. This prevents cleanup failures on clusters with restricted permissions.


2535-2544: Verify timeout value difference with the shared library definition.

DEFAULT_TIMEOUT_SHORT is defined in the external shared library com.nvidia.bloom.SlurmConfig (imported at line 10), not in this repository. Without access to that library's definition, I cannot confirm whether the shorter timeout is intentional or an oversight.

  • Line 2519 uses DEFAULT_TIMEOUT (300 minutes per comment)
  • Line 2539 uses DEFAULT_TIMEOUT_SHORT (value unknown)

Additionally, per the prior learning about Slurm job monitoring: jobs with built-in timeouts configured typically don't need separate monitoring loop timeouts, as Slurm naturally terminates and removes timed-out jobs from the queue.

Please verify: (1) the actual value of DEFAULT_TIMEOUT_SHORT in the shared library, (2) whether ENROOT execution is expected to complete significantly faster than Docker execution, and (3) whether this timeout wrapper is necessary given Slurm's built-in timeout mechanism.


557-561: Verify if these mounts should be conditional based on cluster type.

The comment indicates these mounts are "Specific for OCI machines," but they're applied unconditionally for all clusters. While the concern is valid, the suggested fix is incorrect—cluster.requiresOciMounts doesn't exist in the codebase.

If these mounts are indeed OCI-specific, you'll need to add a conditional check. The codebase shows the pattern at line 662 uses partition.clusterName for cluster identification (e.g., partition.clusterName == "dlcluster"). Verify whether partition.clusterName or another available cluster property can distinguish OCI clusters, and if so, gate the mounts accordingly. Otherwise, if these mounts are required for all clusters, update the comment to reflect that.

@chzblych
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23064 [ run ] triggered by Bot. Commit: 36bded3

@yuanjingx87
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23079 [ run ] triggered by Bot. Commit: 5f4d4f4

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23064 [ run ] completed with state ABORTED. Commit: 36bded3
LLM/main/L0_MergeRequest_PR #17391 (Blue Ocean) completed with status: ABORTED

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23079 [ run ] completed with state FAILURE. Commit: 5f4d4f4
/LLM/main/L0_MergeRequest_PR pipeline #17405 completed with status: 'FAILURE'

mlefeb01 and others added 5 commits October 30, 2025 15:24
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
@mlefeb01
Copy link
Collaborator Author

/bot run

@yuanjingx87
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23087 [ run ] triggered by Bot. Commit: 542c450

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23087 [ run ] completed with state SUCCESS. Commit: 542c450
/LLM/main/L0_MergeRequest_PR pipeline #17411 completed with status: 'FAILURE'

@mlefeb01
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23111 [ run ] triggered by Bot. Commit: 542c450

@chzblych
Copy link
Collaborator

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23178 [ run ] triggered by Bot. Commit: 542c450

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23111 [ run ] completed with state ABORTED. Commit: 542c450
LLM/main/L0_MergeRequest_PR #17430 (Blue Ocean) completed with status: ABORTED

@chzblych
Copy link
Collaborator

chzblych commented Oct 31, 2025

@mlefeb01 I triggered a new pipeline run because that some new multi-GPU test failures are waived at TOT main. The new pipeline run will merge the test waives from TOT main on the fly.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #23178 [ run ] completed with state SUCCESS. Commit: 542c450
/LLM/main/L0_MergeRequest_PR pipeline #17473 completed with status: 'SUCCESS'

@mlefeb01 mlefeb01 merged commit da2dca5 into NVIDIA:main Oct 31, 2025
5 checks passed
@mlefeb01 mlefeb01 deleted the slurm-enroot branch October 31, 2025 19:22
fredricz-20070104 pushed a commit to fredricz-20070104/TensorRT-LLM that referenced this pull request Nov 5, 2025
…8770)

Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Co-authored-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants