[None][infra] Some improvements for Slurm execution path in the CI #10316

chzblych · 2025-12-28T15:23:47Z

Summary by CodeRabbit

Release Notes

New Features
- Added automatic retry logic for transient failures during test setup and job operations.
- Introduced persistent job ID tracking for better monitoring and debugging.
Bug Fixes
- Improved handling of delayed job status updates with retry mechanism and fallback defaults.
- Enhanced error messages for job submission failures and status retrieval issues.
- Better aggregation of exit codes across test execution phases for accurate reporting.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

coderabbitai · 2025-12-28T15:30:10Z

📝 Walkthrough

Walkthrough

The pull request upgrades Slurm-based test infrastructure reliability by introducing a retry utility function, applying retry logic to installation and status polling operations, converting enum comparisons to string-based checks, and refactoring exit code tracking to aggregate failures across multiple test phases.

Changes

Cohort / File(s)	Summary
Slurm Job Submission & Status Tracking `jenkins/L0_Test.groovy`	Converted containerRuntime enum comparisons to string checks; added bash utilities script transfer to remote nodes; introduced Slurm job ID persistence to `slurm_job_id.txt`; replaced simple sleep with retry-based polling for sacct status queries with backoff and default fallbacks (EXIT_CODE=1, STATUS="UNKNOWN"); enhanced logging clarity ("Starting Slurm job…", "Submitted Slurm job…")
Bash Utilities `jenkins/scripts/bash_utils.sh`	Added new `retry_command()` function supporting configurable `max_retries` (default 3) and `interval` (default 60) parameters; executes commands with retry logic and backoff before failing
Installation with Retries `jenkins/scripts/slurm_install.sh`	Integrated `bash_utils` sourcing; wrapped wget, tar, apt-get, and pip install commands with `retry_command` for resilience; converted sequential cd/pip install chains to subshell invocations with retry support
Exit Code Tracking & Aggregation `jenkins/scripts/slurm_run.sh`	Refactored to disable immediate exit-on-error (`set +e`); introduced per-phase exit code variables (pytest_exit_code, perf_check_exit_code, perf_sanity_check_exit_code); captures and logs individual phase exit codes; computes and exits with aggregated final_exit_code; added non-fatal coverage config wait for non-root processes

Sequence Diagrams

sequenceDiagram
    participant Jenkins as Jenkins (Groovy)
    participant Slurm as Slurm Controller
    participant Node as Remote Node
    participant Sacct as sacct
    
    rect rgb(240, 248, 255)
    note over Jenkins,Node: Job Submission Phase
    Jenkins->>Slurm: sbatch (submit job)
    Slurm-->>Jenkins: jobid
    Jenkins->>Node: write jobid to slurm_job_id.txt
    Jenkins->>Node: copy bash_utils.sh
    end
    
    rect rgb(245, 245, 220)
    note over Jenkins,Sacct: Status Polling Phase (with Retries)
    Jenkins->>Sacct: sacct --job=jobid (attempt 1)
    alt Sacct delayed/unavailable
        Sacct-->>Jenkins: no status
        Jenkins->>Jenkins: wait 10s
        Jenkins->>Sacct: sacct --job=jobid (attempt 2)
    else Status available
        Sacct-->>Jenkins: STATUS, EXIT_CODE
    end
    end
    
    rect rgb(240, 255, 240)
    note over Jenkins,Node: Exit Code Aggregation Phase
    Node->>Node: pytest phase (capture exit_code_1)
    Node->>Node: perf check phase (capture exit_code_2)
    Node->>Node: perf sanity phase (capture exit_code_3)
    Node->>Node: final_exit_code = sum(exit_codes)
    Node->>Node: exit(final_exit_code)
    end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely composed of the template boilerplate with no actual content filled in for Description, Test Coverage, or concrete implementation details.	Fill in the Description section with what improvements were made and why, and provide Test Coverage details listing relevant tests that validate the Slurm execution changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly describes the main change: improvements to the Slurm execution path in CI infrastructure, which matches the actual changes made to Slurm-related scripts and logging.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)

jenkins/L0_Test.groovy (1)
697-703: Simplify enum comparison by removing unnecessary .toString() calls.

The code uses cluster.containerRuntime directly as a map key in SlurmConfig.containerRuntimeToEntrypoint[cluster.containerRuntime] but then applies .toString() for string comparisons elsewhere (lines 697, 699, 863, 867, 1053). This inconsistency suggests the .toString() calls are unnecessary. For enum comparisons, directly compare against the enum constant:
if (cluster.containerRuntime == ContainerRuntime.DOCKER) {
    // ...
} else if (cluster.containerRuntime == ContainerRuntime.ENROOT) {
    // ...
}
This is more idiomatic, type-safe, and aligns with how the field is already used elsewhere in the code.

🧹 Nitpick comments (2)

jenkins/scripts/slurm_run.sh (2)
66-68: Consider adding a timeout or verification for the coverage config file.

The 10-second sleep assumes the coverage config file will be ready within that time. Consider adding verification to ensure the file exists before proceeding:
🔎 Suggested improvement with file verification
 else
-    # Sleep 10 seconds to wait for the coverage config file to be saved
-    sleep 10
+    # Wait for the coverage config file to be saved
+    for i in {1..30}; do
+        if [ -f "$coverageConfigFile" ]; then
+            echo "Coverage config file found after ${i} seconds"
+            break
+        fi
+        sleep 1
+    done
+    if [ ! -f "$coverageConfigFile" ]; then
+        echo "Warning: Coverage config file not found after 30 seconds"
+    fi
 fi
99-141: Exit code aggregation logic is sound but could be more robust.

The approach of turning off set -e and aggregating exit codes is correct for tracking failures across multiple phases. The use of addition for aggregation is practical, though non-zero exit codes could theoretically add up in unexpected ways.

If you want more robust exit code handling, consider using bitwise OR or taking the maximum:
# Alternative 1: Use bitwise OR
final_exit_code=$((pytest_exit_code | perf_check_exit_code | perf_sanity_check_exit_code))

# Alternative 2: Take the maximum (preserves specific error codes)
final_exit_code=$pytest_exit_code
[ $perf_check_exit_code -gt $final_exit_code ] && final_exit_code=$perf_check_exit_code
[ $perf_sanity_check_exit_code -gt $final_exit_code ] && final_exit_code=$perf_sanity_check_exit_code
However, the current approach is clear and sufficient for detecting any failures (non-zero exit code).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 270be80 and 58620d5.

📒 Files selected for processing (4)

jenkins/L0_Test.groovy
jenkins/scripts/bash_utils.sh
jenkins/scripts/slurm_install.sh
jenkins/scripts/slurm_run.sh

🧰 Additional context used

🧠 Learnings (2)

📓 Common learnings

Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.

📚 Learning: 2025-08-22T19:08:10.822Z

Learnt from: yuanjingx87
Repo: NVIDIA/TensorRT-LLM PR: 7176
File: jenkins/L0_Test.groovy:361-389
Timestamp: 2025-08-22T19:08:10.822Z
Learning: In Slurm job monitoring scripts, when jobs have built-in timeouts configured (via --time parameter or partition/system timeouts), an additional timeout mechanism in the monitoring loop is typically unnecessary. When a Slurm job times out, it gets terminated and removed from the active queue, causing `squeue -j $jobId` to return non-zero and break monitoring loops naturally. The job's final status can then be checked via `sacct` to determine if it failed due to timeout.

Applied to files:

jenkins/L0_Test.groovy

🧬 Code graph analysis (1)

jenkins/scripts/slurm_install.sh (1)

jenkins/scripts/bash_utils.sh (1)

retry_command (11-45)

🪛 Shellcheck (0.11.0)

jenkins/scripts/slurm_install.sh

[warning] 9-9: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)

[warning] 12-12: resourcePathNode is referenced but not assigned.

(SC2154)

[warning] 16-16: llmTarfile is referenced but not assigned.

(SC2154)

[warning] 16-16: tarName is referenced but not assigned.

(SC2154)

[warning] 21-21: pytestCommand is referenced but not assigned.

(SC2154)

jenkins/scripts/slurm_run.sh

[warning] 134-134: jobWorkspace is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Pre-commit Check

🔇 Additional comments (6)

jenkins/scripts/bash_utils.sh (1)

11-45: LGTM! Well-implemented retry utility function.

The retry_command function is cleanly implemented with:

Clear documentation and usage examples

Flexible parameter handling (optional max_retries and interval)

Proper exit code propagation

Informative error messages

This will significantly improve the reliability of Slurm installation and setup operations.
jenkins/scripts/slurm_install.sh (2)
7-9: Verify the dynamic path construction for bash_utils.sh.

The path computation using sed to replace slurm_install.sh with bash_utils.sh is clever, but ensure that:

Both scripts are always in the same directory

The script name exactly matches slurm_install.sh

Consider documenting this path dependency or using a simpler approach:
# Alternative: Use dirname to compute the path
bashUtilsPath="$(dirname "${BASH_SOURCE[0]}")/bash_utils.sh"
source "$bashUtilsPath"
24-25: No action needed—exit code propagation and variable handling are working correctly.

The use of bash -c with retry_command here is properly implemented. Exit codes are correctly propagated through the subshell to retry_command (verified: false && echo returns exit code 1), and variables like $llmSrcNode and $resourcePathNode are expanded by the parent shell before being passed to bash -c, ensuring they resolve correctly without requiring environment variable inheritance.
jenkins/L0_Test.groovy (3)

961-968: Good addition of bash_utils.sh transfer to remote node.

The explicit copying of bash_utils.sh to the remote Slurm node is properly implemented with:

Clear logging of the script contents

Proper path variables (scriptBashUtilsLocalPath, scriptBashUtilsPathNode)

Transfer with execute permissions (true parameter)

This ensures the retry functionality is available during Slurm job execution.

1244-1265: Excellent retry logic for handling delayed sacct updates.

The retry loop for fetching STATUS and EXIT_CODE is well-implemented:

Retries up to 3 times with 10-second intervals

Provides clear feedback on retry attempts

Falls back to sensible defaults (EXIT_CODE=1, STATUS="UNKNOWN") on failure

Handles the common issue of delayed sacct database updates

This significantly improves reliability when querying Slurm job results.

1230-1234: Improved error handling and job ID persistence.

The enhancements include:

Line 1230: More specific error message when job ID is missing

Line 1233: Clearer success message with "Slurm job" terminology

Line 1234: Persistent job ID storage in slurm_job_id.txt

These changes improve debugging and traceability of Slurm jobs.

chzblych · 2025-12-28T15:30:22Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-28T15:36:02Z

PR_Github #30054 [ run ] triggered by Bot. Commit: de2ea25

tensorrt-cicd · 2025-12-28T19:47:26Z

PR_Github #30054 [ run ] completed with state SUCCESS. Commit: de2ea25
/LLM/main/L0_MergeRequest_PR pipeline #23130 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

jenkins/scripts/slurm_install.sh

chzblych · 2025-12-29T01:41:17Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-29T01:47:35Z

PR_Github #30063 [ run ] triggered by Bot. Commit: de2ea25

tensorrt-cicd · 2025-12-29T06:09:50Z

PR_Github #30063 [ run ] completed with state SUCCESS. Commit: de2ea25
/LLM/main/L0_MergeRequest_PR pipeline #23139 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

chzblych · 2025-12-29T06:14:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2025-12-29T06:20:28Z

PR_Github #30074 [ run ] triggered by Bot. Commit: de2ea25

Signed-off-by: Yanchao Lu <[email protected]>

tensorrt-cicd · 2025-12-29T07:59:16Z

PR_Github #30074 [ run ] completed with state SUCCESS. Commit: de2ea25
/LLM/main/L0_MergeRequest_PR pipeline #23142 completed with status: 'SUCCESS'

chzblych · 2025-12-29T08:05:01Z

/bot run --stage-list "GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

tensorrt-cicd · 2025-12-29T08:10:52Z

PR_Github #30078 [ run ] triggered by Bot. Commit: 772af16

tensorrt-cicd · 2025-12-29T10:36:24Z

PR_Github #30078 [ run ] completed with state SUCCESS. Commit: 772af16
/LLM/main/L0_MergeRequest_PR pipeline #23143 (Partly Tested) completed with status: 'SUCCESS'

chzblych · 2025-12-29T11:25:54Z

/bot reuse-pipeline

tensorrt-cicd · 2025-12-29T11:31:33Z

PR_Github #30081 [ reuse-pipeline ] triggered by Bot. Commit: 772af16

tensorrt-cicd · 2025-12-29T11:49:42Z

PR_Github #30081 [ reuse-pipeline ] completed with state SUCCESS. Commit: 772af16
Reusing PR_Github #30078 (Partly Tested) for commit 772af16

chzblych requested review from a team as code owners December 28, 2025 15:23

chzblych requested review from mlefeb01 and zeroepoch December 28, 2025 15:23

chzblych force-pushed the yanchaol-ci branch from 58620d5 to de2ea25 Compare December 28, 2025 15:29

coderabbitai bot reviewed Dec 28, 2025

View reviewed changes

zeroepoch approved these changes Dec 28, 2025

View reviewed changes

jenkins/scripts/slurm_install.sh Show resolved Hide resolved

chzblych enabled auto-merge (squash) December 29, 2025 07:05

chzblych disabled auto-merge December 29, 2025 07:10

[None][infra] Some improvements for Slurm execution path in the CI

772af16

Signed-off-by: Yanchao Lu <[email protected]>

chzblych force-pushed the yanchaol-ci branch from de2ea25 to 772af16 Compare December 29, 2025 08:03

chzblych enabled auto-merge (squash) December 29, 2025 11:28

chzblych merged commit 965578c into NVIDIA:main Dec 29, 2025
5 checks passed

chzblych deleted the yanchaol-ci branch December 29, 2025 12:15

[None][infra] Some improvements for Slurm execution path in the CI #10316

[None][infra] Some improvements for Slurm execution path in the CI #10316

Conversation

chzblych commented Dec 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Uh oh!

coderabbitai bot commented Dec 28, 2025

Walkthrough

Changes

Sequence Diagrams

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

chzblych commented Dec 28, 2025

Uh oh!

tensorrt-cicd commented Dec 28, 2025

Uh oh!

tensorrt-cicd commented Dec 28, 2025

Uh oh!

Uh oh!

chzblych commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

chzblych commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

chzblych commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

chzblych commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

tensorrt-cicd commented Dec 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chzblych commented Dec 28, 2025 •

edited

Loading