[TRTINFRA-7215][infra] Add support for enroot SLURM clusters#8770
[TRTINFRA-7215][infra] Add support for enroot SLURM clusters#8770mlefeb01 merged 5 commits intoNVIDIA:mainfrom
Conversation
📝 WalkthroughWalkthroughAdded support for ENROOT container runtime alongside existing Docker support in L0_Test.groovy. The change introduces conditional logic to route execution based on containerRuntime type, with a new public function runInEnrootOnNode(label) and dynamic entrypoint resolution. Updated agent setup and cleanup logic to accommodate both runtimes. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant L0Test
participant ContainerRuntime
participant DockerRunner
participant EnrootRunner
User->>L0Test: Trigger pipeline with containerRuntime config
L0Test->>ContainerRuntime: Read containerRuntime value
alt containerRuntime is DOCKER
L0Test->>DockerRunner: Route to runInDockerOnNodeMultiStage
DockerRunner->>DockerRunner: Setup using Docker entrypoint
DockerRunner-->>L0Test: Execution complete
else containerRuntime is ENROOT
L0Test->>EnrootRunner: Route to runInEnrootOnNode(label)
EnrootRunner->>EnrootRunner: Apply timeout wrapper
EnrootRunner->>EnrootRunner: Setup using ENROOT entrypoint
EnrootRunner-->>L0Test: Execution complete
else unsupported runtime
L0Test-->>L0Test: Throw exception
end
L0Test->>L0Test: Execute cleanup commands
L0Test-->>User: Pipeline complete
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
jenkins/L0_Test.groovy (1)
498-501: Add conditional check for slurmJobID before cleanup.The cleanup command at line 500 uses
${slurmJobID}which could be null if the SLURM job submission failed. While the|| trueprevents failure, it results in attempting to remove a malformed path.Consider adding a conditional check:
def cleanupCommands = [ "rm -rf /home/svc_tensorrt/bloom/scripts/agent-${nodeName}.jar /home/svc_tensorrt/bloom/scripts/${nodeName}-slurm_jenkins_agent_setup.sh || true", - "rm /lustre/fs1/portfolios/coreai/projects/coreai_tensorrt_ci/users/svc_tensorrt/containers/container-${slurmJobID}.sqsh || true", -].join(" && ") +] +if (slurmJobID) { + cleanupCommands.add("rm /lustre/fs1/portfolios/coreai/projects/coreai_tensorrt_ci/users/svc_tensorrt/containers/container-${slurmJobID}.sqsh || true") +} +def cleanupCmd = cleanupCommands.join(" && ") Utils.exec( pipeline, script: Utils.sshUserCmd( remote, - cleanupCommands + cleanupCmd ) )
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
jenkins/L0_Test.groovy(7 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (5)
jenkins/L0_Test.groovy (5)
14-14: LGTM - Import statement is correct.The import follows the established pattern for shared library components and is properly used in the conditional logic at lines 672-674.
671-678: Good error handling for container runtime selection.The conditional logic properly routes execution based on container runtime type with clear error handling for unsupported runtimes. The explicit exception provides good visibility if an unsupported runtime is configured.
2194-2195: Good defensive coding practice.The comment explaining why
|| trueis needed fordmesg -Chelps maintainability. This prevents cleanup failures on clusters with restricted permissions.
2535-2544: Verify timeout value difference with the shared library definition.
DEFAULT_TIMEOUT_SHORTis defined in the external shared librarycom.nvidia.bloom.SlurmConfig(imported at line 10), not in this repository. Without access to that library's definition, I cannot confirm whether the shorter timeout is intentional or an oversight.
- Line 2519 uses
DEFAULT_TIMEOUT(300 minutes per comment)- Line 2539 uses
DEFAULT_TIMEOUT_SHORT(value unknown)Additionally, per the prior learning about Slurm job monitoring: jobs with built-in timeouts configured typically don't need separate monitoring loop timeouts, as Slurm naturally terminates and removes timed-out jobs from the queue.
Please verify: (1) the actual value of
DEFAULT_TIMEOUT_SHORTin the shared library, (2) whether ENROOT execution is expected to complete significantly faster than Docker execution, and (3) whether this timeout wrapper is necessary given Slurm's built-in timeout mechanism.
557-561: Verify if these mounts should be conditional based on cluster type.The comment indicates these mounts are "Specific for OCI machines," but they're applied unconditionally for all clusters. While the concern is valid, the suggested fix is incorrect—
cluster.requiresOciMountsdoesn't exist in the codebase.If these mounts are indeed OCI-specific, you'll need to add a conditional check. The codebase shows the pattern at line 662 uses
partition.clusterNamefor cluster identification (e.g.,partition.clusterName == "dlcluster"). Verify whetherpartition.clusterNameor another available cluster property can distinguish OCI clusters, and if so, gate the mounts accordingly. Otherwise, if these mounts are required for all clusters, update the comment to reflect that.
|
/bot run --disable-fail-fast |
|
PR_Github #23064 [ run ] triggered by Bot. Commit: |
eaaa4b2 to
5f4d4f4
Compare
|
/bot run |
|
PR_Github #23079 [ run ] triggered by Bot. Commit: |
|
PR_Github #23064 [ run ] completed with state |
|
PR_Github #23079 [ run ] completed with state |
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Yanchao Lu <yanchaol@nvidia.com>
Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com>
5f4d4f4 to
542c450
Compare
|
/bot run |
|
/bot run --disable-fail-fast |
|
PR_Github #23087 [ run ] triggered by Bot. Commit: |
|
PR_Github #23087 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #23111 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #23178 [ run ] triggered by Bot. Commit: |
|
PR_Github #23111 [ run ] completed with state |
|
@mlefeb01 I triggered a new pipeline run because that some new multi-GPU test failures are waived at TOT main. The new pipeline run will merge the test waives from TOT main on the fly. |
|
PR_Github #23178 [ run ] completed with state |
…8770) Signed-off-by: Matt Lefebvre <mlefebvre@nvidia.com> Signed-off-by: Yanchao Lu <yanchaol@nvidia.com> Co-authored-by: Yanchao Lu <yanchaol@nvidia.com> Signed-off-by: FredricZ-2007 <226039983+fredricz-20070104@users.noreply.github.com>
Summary by CodeRabbit
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.