Skip to content

Conversation

@chzblych
Copy link
Collaborator

@chzblych chzblych commented Dec 28, 2025

Summary by CodeRabbit

Release Notes

  • New Features

    • Added automatic retry logic for transient failures during test setup and job operations.
    • Introduced persistent job ID tracking for better monitoring and debugging.
  • Bug Fixes

    • Improved handling of delayed job status updates with retry mechanism and fallback defaults.
    • Enhanced error messages for job submission failures and status retrieval issues.
    • Better aggregation of exit codes across test execution phases for accurate reporting.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 28, 2025

📝 Walkthrough

Walkthrough

The pull request upgrades Slurm-based test infrastructure reliability by introducing a retry utility function, applying retry logic to installation and status polling operations, converting enum comparisons to string-based checks, and refactoring exit code tracking to aggregate failures across multiple test phases.

Changes

Cohort / File(s) Summary
Slurm Job Submission & Status Tracking
jenkins/L0_Test.groovy
Converted containerRuntime enum comparisons to string checks; added bash utilities script transfer to remote nodes; introduced Slurm job ID persistence to slurm_job_id.txt; replaced simple sleep with retry-based polling for sacct status queries with backoff and default fallbacks (EXIT_CODE=1, STATUS="UNKNOWN"); enhanced logging clarity ("Starting Slurm job…", "Submitted Slurm job…")
Bash Utilities
jenkins/scripts/bash_utils.sh
Added new retry_command() function supporting configurable max_retries (default 3) and interval (default 60) parameters; executes commands with retry logic and backoff before failing
Installation with Retries
jenkins/scripts/slurm_install.sh
Integrated bash_utils sourcing; wrapped wget, tar, apt-get, and pip install commands with retry_command for resilience; converted sequential cd/pip install chains to subshell invocations with retry support
Exit Code Tracking & Aggregation
jenkins/scripts/slurm_run.sh
Refactored to disable immediate exit-on-error (set +e); introduced per-phase exit code variables (pytest_exit_code, perf_check_exit_code, perf_sanity_check_exit_code); captures and logs individual phase exit codes; computes and exits with aggregated final_exit_code; added non-fatal coverage config wait for non-root processes

Sequence Diagrams

sequenceDiagram
    participant Jenkins as Jenkins (Groovy)
    participant Slurm as Slurm Controller
    participant Node as Remote Node
    participant Sacct as sacct
    
    rect rgb(240, 248, 255)
    note over Jenkins,Node: Job Submission Phase
    Jenkins->>Slurm: sbatch (submit job)
    Slurm-->>Jenkins: jobid
    Jenkins->>Node: write jobid to slurm_job_id.txt
    Jenkins->>Node: copy bash_utils.sh
    end
    
    rect rgb(245, 245, 220)
    note over Jenkins,Sacct: Status Polling Phase (with Retries)
    Jenkins->>Sacct: sacct --job=jobid (attempt 1)
    alt Sacct delayed/unavailable
        Sacct-->>Jenkins: no status
        Jenkins->>Jenkins: wait 10s
        Jenkins->>Sacct: sacct --job=jobid (attempt 2)
    else Status available
        Sacct-->>Jenkins: STATUS, EXIT_CODE
    end
    end
    
    rect rgb(240, 255, 240)
    note over Jenkins,Node: Exit Code Aggregation Phase
    Node->>Node: pytest phase (capture exit_code_1)
    Node->>Node: perf check phase (capture exit_code_2)
    Node->>Node: perf sanity phase (capture exit_code_3)
    Node->>Node: final_exit_code = sum(exit_codes)
    Node->>Node: exit(final_exit_code)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is entirely composed of the template boilerplate with no actual content filled in for Description, Test Coverage, or concrete implementation details. Fill in the Description section with what improvements were made and why, and provide Test Coverage details listing relevant tests that validate the Slurm execution changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: improvements to the Slurm execution path in CI infrastructure, which matches the actual changes made to Slurm-related scripts and logging.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)

697-703: Simplify enum comparison by removing unnecessary .toString() calls.

The code uses cluster.containerRuntime directly as a map key in SlurmConfig.containerRuntimeToEntrypoint[cluster.containerRuntime] but then applies .toString() for string comparisons elsewhere (lines 697, 699, 863, 867, 1053). This inconsistency suggests the .toString() calls are unnecessary. For enum comparisons, directly compare against the enum constant:

if (cluster.containerRuntime == ContainerRuntime.DOCKER) {
    // ...
} else if (cluster.containerRuntime == ContainerRuntime.ENROOT) {
    // ...
}

This is more idiomatic, type-safe, and aligns with how the field is already used elsewhere in the code.

🧹 Nitpick comments (2)
jenkins/scripts/slurm_run.sh (2)

66-68: Consider adding a timeout or verification for the coverage config file.

The 10-second sleep assumes the coverage config file will be ready within that time. Consider adding verification to ensure the file exists before proceeding:

🔎 Suggested improvement with file verification
 else
-    # Sleep 10 seconds to wait for the coverage config file to be saved
-    sleep 10
+    # Wait for the coverage config file to be saved
+    for i in {1..30}; do
+        if [ -f "$coverageConfigFile" ]; then
+            echo "Coverage config file found after ${i} seconds"
+            break
+        fi
+        sleep 1
+    done
+    if [ ! -f "$coverageConfigFile" ]; then
+        echo "Warning: Coverage config file not found after 30 seconds"
+    fi
 fi

99-141: Exit code aggregation logic is sound but could be more robust.

The approach of turning off set -e and aggregating exit codes is correct for tracking failures across multiple phases. The use of addition for aggregation is practical, though non-zero exit codes could theoretically add up in unexpected ways.

If you want more robust exit code handling, consider using bitwise OR or taking the maximum:

# Alternative 1: Use bitwise OR
final_exit_code=$((pytest_exit_code | perf_check_exit_code | perf_sanity_check_exit_code))

# Alternative 2: Take the maximum (preserves specific error codes)
final_exit_code=$pytest_exit_code
[ $perf_check_exit_code -gt $final_exit_code ] && final_exit_code=$perf_check_exit_code
[ $perf_sanity_check_exit_code -gt $final_exit_code ] && final_exit_code=$perf_sanity_check_exit_code

However, the current approach is clear and sufficient for detecting any failures (non-zero exit code).

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 270be80 and 58620d5.

📒 Files selected for processing (4)
  • jenkins/L0_Test.groovy
  • jenkins/scripts/bash_utils.sh
  • jenkins/scripts/slurm_install.sh
  • jenkins/scripts/slurm_run.sh
🧰 Additional context used
🧠 Learnings (2)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
📚 Learning: 2025-08-22T19:08:10.822Z
Learnt from: yuanjingx87
Repo: NVIDIA/TensorRT-LLM PR: 7176
File: jenkins/L0_Test.groovy:361-389
Timestamp: 2025-08-22T19:08:10.822Z
Learning: In Slurm job monitoring scripts, when jobs have built-in timeouts configured (via --time parameter or partition/system timeouts), an additional timeout mechanism in the monitoring loop is typically unnecessary. When a Slurm job times out, it gets terminated and removed from the active queue, causing `squeue -j $jobId` to return non-zero and break monitoring loops naturally. The job's final status can then be checked via `sacct` to determine if it failed due to timeout.

Applied to files:

  • jenkins/L0_Test.groovy
🧬 Code graph analysis (1)
jenkins/scripts/slurm_install.sh (1)
jenkins/scripts/bash_utils.sh (1)
  • retry_command (11-45)
🪛 Shellcheck (0.11.0)
jenkins/scripts/slurm_install.sh

[warning] 9-9: ShellCheck can't follow non-constant source. Use a directive to specify location.

(SC1090)


[warning] 12-12: resourcePathNode is referenced but not assigned.

(SC2154)


[warning] 16-16: llmTarfile is referenced but not assigned.

(SC2154)


[warning] 16-16: tarName is referenced but not assigned.

(SC2154)


[warning] 21-21: pytestCommand is referenced but not assigned.

(SC2154)

jenkins/scripts/slurm_run.sh

[warning] 134-134: jobWorkspace is referenced but not assigned.

(SC2154)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (6)
jenkins/scripts/bash_utils.sh (1)

11-45: LGTM! Well-implemented retry utility function.

The retry_command function is cleanly implemented with:

  • Clear documentation and usage examples
  • Flexible parameter handling (optional max_retries and interval)
  • Proper exit code propagation
  • Informative error messages

This will significantly improve the reliability of Slurm installation and setup operations.

jenkins/scripts/slurm_install.sh (2)

7-9: Verify the dynamic path construction for bash_utils.sh.

The path computation using sed to replace slurm_install.sh with bash_utils.sh is clever, but ensure that:

  1. Both scripts are always in the same directory
  2. The script name exactly matches slurm_install.sh

Consider documenting this path dependency or using a simpler approach:

# Alternative: Use dirname to compute the path
bashUtilsPath="$(dirname "${BASH_SOURCE[0]}")/bash_utils.sh"
source "$bashUtilsPath"

24-25: No action needed—exit code propagation and variable handling are working correctly.

The use of bash -c with retry_command here is properly implemented. Exit codes are correctly propagated through the subshell to retry_command (verified: false && echo returns exit code 1), and variables like $llmSrcNode and $resourcePathNode are expanded by the parent shell before being passed to bash -c, ensuring they resolve correctly without requiring environment variable inheritance.

jenkins/L0_Test.groovy (3)

961-968: Good addition of bash_utils.sh transfer to remote node.

The explicit copying of bash_utils.sh to the remote Slurm node is properly implemented with:

  • Clear logging of the script contents
  • Proper path variables (scriptBashUtilsLocalPath, scriptBashUtilsPathNode)
  • Transfer with execute permissions (true parameter)

This ensures the retry functionality is available during Slurm job execution.


1244-1265: Excellent retry logic for handling delayed sacct updates.

The retry loop for fetching STATUS and EXIT_CODE is well-implemented:

  • Retries up to 3 times with 10-second intervals
  • Provides clear feedback on retry attempts
  • Falls back to sensible defaults (EXIT_CODE=1, STATUS="UNKNOWN") on failure
  • Handles the common issue of delayed sacct database updates

This significantly improves reliability when querying Slurm job results.


1230-1234: Improved error handling and job ID persistence.

The enhancements include:

  • Line 1230: More specific error message when job ID is missing
  • Line 1233: Clearer success message with "Slurm job" terminology
  • Line 1234: Persistent job ID storage in slurm_job_id.txt

These changes improve debugging and traceability of Slurm jobs.

@chzblych
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30054 [ run ] triggered by Bot. Commit: de2ea25

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30054 [ run ] completed with state SUCCESS. Commit: de2ea25
/LLM/main/L0_MergeRequest_PR pipeline #23130 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@chzblych
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30063 [ run ] triggered by Bot. Commit: de2ea25

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30063 [ run ] completed with state SUCCESS. Commit: de2ea25
/LLM/main/L0_MergeRequest_PR pipeline #23139 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@chzblych
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30074 [ run ] triggered by Bot. Commit: de2ea25

@chzblych chzblych enabled auto-merge (squash) December 29, 2025 07:05
@chzblych chzblych disabled auto-merge December 29, 2025 07:10
@tensorrt-cicd
Copy link
Collaborator

PR_Github #30074 [ run ] completed with state SUCCESS. Commit: de2ea25
/LLM/main/L0_MergeRequest_PR pipeline #23142 completed with status: 'SUCCESS'

@chzblych
Copy link
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-1, GB200-8_GPUs-2_Nodes-PyTorch-1, DGX_B200-4_GPUs-PyTorch-1" --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30078 [ run ] triggered by Bot. Commit: 772af16

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30078 [ run ] completed with state SUCCESS. Commit: 772af16
/LLM/main/L0_MergeRequest_PR pipeline #23143 (Partly Tested) completed with status: 'SUCCESS'

@chzblych
Copy link
Collaborator Author

/bot reuse-pipeline

@chzblych chzblych enabled auto-merge (squash) December 29, 2025 11:28
@tensorrt-cicd
Copy link
Collaborator

PR_Github #30081 [ reuse-pipeline ] triggered by Bot. Commit: 772af16

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30081 [ reuse-pipeline ] completed with state SUCCESS. Commit: 772af16
Reusing PR_Github #30078 (Partly Tested) for commit 772af16

@chzblych chzblych merged commit 965578c into NVIDIA:main Dec 29, 2025
5 checks passed
@chzblych chzblych deleted the yanchaol-ci branch December 29, 2025 12:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants