Skip to content

Conversation

@chenfeiz0326
Copy link
Collaborator

@chenfeiz0326 chenfeiz0326 commented Jan 7, 2026

Summary by CodeRabbit

Release Notes

  • New Features
    • Added automated Jenkins pipeline to send Perf Regression Info to Slack Channel
    • Fail the pytest if trtllm-serve or benchmark failed
    • Fix and unwaive the disagg test case

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@chenfeiz0326 chenfeiz0326 self-assigned this Jan 7, 2026
@chenfeiz0326 chenfeiz0326 requested review from a team as code owners January 7, 2026 07:44
@chenfeiz0326 chenfeiz0326 requested a review from yiqingy0 January 7, 2026 07:44
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1"

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

Introduces a new Jenkins pipeline for performance sanity triage that executes within a Kubernetes pod, queries OpenSearch for regression data, and sends Slack notifications. Updates test utilities to refactor regression reporting, simplify output formatting, and remove fatal error handling for post-merge regressions. Modifies integration tests to improve metric parsing and aggregation logic, introducing a new failure detection method.

Changes

Cohort / File(s) Summary
Jenkins Pipeline for Perf Sanity Triage
jenkins/runPerfSanityTriage.groovy
New Kubernetes-based pipeline that defines Docker image, LLM repository resolution, and a pod configuration builder. Orchestrates checkout, dependency installation (Slack SDK), and execution of the perf sanity triage Python script with OpenSearch and Slack integration parameters.
OpenSearch Query Instrumentation
jenkins/scripts/open_search_db.py
Adds stdout print statements alongside existing logs to the queryFromOpenSearchDB function—mirrors messages for missing BASE_URL, access check failures, retries, and final failures.
Perf Sanity Triage Script
jenkins/scripts/perf/perf_sanity_triage.py
New Python script with CLI interface and five core functions: query_regression_data, get_regression_data_by_job_id, send_regression_message, send_message, and main. Queries OpenSearch for regressions, groups by job ID, formats results, and posts to Slack with error handling.
Perf Test Utilities Refactoring
tests/integration/defs/perf/open_search_db_utils.py
Removes _print_perf_data function; simplifies _print_regression_data output (splits on commas, removes detailed per-metric block); adjusts check_perf_regression to log post-merge regressions without raising fatal errors; formats threshold display as percentage.
Integration Test Metric Aggregation
tests/integration/defs/perf/test_perf_sanity.py
Introduces internal parse_metrics_from_output helper for single-pass metric extraction; replaces per-line aggregation with per-server metric dictionaries; removes global flattening of results; skips uploads for servers with incomplete metrics; adds public check_test_failure method to detect server failures; invokes failure check after upload in test_e2e.
Test Waiver List
tests/integration/test_lists/waives.txt
Removes single skip entry for perf/test_perf_sanity.py::test_e2e[disagg_upload-deepseek-r1-fp4_1k1k_ctx1_gen1_dep8_bs768_eplb0_mtp0_ccb-UCX].

Sequence Diagram(s)

sequenceDiagram
    participant Jenkins
    participant K8sPod as Kubernetes Pod
    participant OS as OpenSearch
    participant Slack
    
    Jenkins->>K8sPod: Trigger pipeline with parameters
    K8sPod->>K8sPod: Checkout source & install slack_sdk
    K8sPod->>K8sPod: Launch perf_sanity_triage.py
    
    rect rgb(220, 240, 255)
        Note over K8sPod,OS: Query Phase
        K8sPod->>OS: query_regression_data(project_name)
        OS-->>K8sPod: Return regression data
        K8sPod->>K8sPod: get_regression_data_by_job_id()
    end
    
    rect rgb(240, 220, 255)
        Note over K8sPod: Format Phase
        K8sPod->>K8sPod: send_regression_message()
    end
    
    rect rgb(220, 255, 220)
        Note over K8sPod,Slack: Notify Phase
        K8sPod->>Slack: send_message() to channel
        Slack-->>K8sPod: Message posted
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning The pull request description is incomplete, containing only the template structure without providing actual implementation details, test coverage information, or explanations. Fill in the Description section with what the feature does and why, and provide specific test coverage details in the Test Coverage section. Remove empty template comments.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and concisely summarizes the main feature: supporting a bot to send performance regression messages to a Slack channel, which aligns with the implementation adding a Jenkins pipeline, Python scripts, and Slack integration for perf regression reporting.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Fix all issues with AI agents
In @jenkins/runPerfSanityTriage.groovy:
- Around line 25-26: Remove the duplicate qosClass field in the pod spec: there
are two occurrences of "qosClass: Guaranteed" — delete the earlier one that sits
next to "nodeSelector" and retain the correctly indented "qosClass: Guaranteed"
that appears with the main pod spec (the one suggested to keep); ensure only a
single qosClass entry remains in the pod spec to avoid ambiguity.
- Around line 86-88: SLACK_BOT_TOKEN is currently declared as a plain string
parameter (SLACK_BOT_TOKEN) which exposes secrets; change to use Jenkins
credentials and update usage to call withCredentials([string(credentialsId:
params.SLACK_BOT_TOKEN, variable: 'SLACK_TOKEN')]) around the stage/step that
needs the token and substitute the direct param usage with the environment
variable (e.g., reference SLACK_TOKEN) so the secret is injected at runtime and
not stored or printed in job config/logs.

In @jenkins/scripts/open_search_db.py:
- Around line 288-293: The failure messages use the decremented local variable
retry_time which is 0 when logged; change them to report the original retry
count (use DEFAULT_RETRY_COUNT or capture the initial value into a new variable
like initial_retry_count) so logs show the real attempted retries; update the
OpenSearchDB.logger.info and the print call inside queryFromOpenSearchDB to
reference DEFAULT_RETRY_COUNT or the captured initial_retry_count instead of
retry_time and ensure any message formatting uses that symbol.

In @tests/integration/defs/perf/test_perf_sanity.py:
- Around line 1178-1196: The return type annotation on parse_metrics_from_output
uses Python 3.10 union syntax; change "Dict[str, float] | None" to
"Optional[Dict[str, float]]" and add Optional to the typing imports (alongside
Dict) so the function signature and imports are Python 3.8-compatible; ensure
the rest of the function and reference to PERF_METRIC_LOG_QUERIES remain
unchanged.
🧹 Nitpick comments (7)
jenkins/scripts/open_search_db.py (2)

254-254: Remove unnecessary f-string prefix.

The static analysis tool correctly identifies this f-string has no placeholders.

Suggested fix
-            print(f"OPEN_SEARCH_DB_BASE_URL is not set")
+            print("OPEN_SEARCH_DB_BASE_URL is not set")

254-293: Duplicate logging to both logger and stdout.

The added print() statements duplicate the existing logger.info() calls. This may be intentional for Jenkins console visibility, but consider using a helper that does both to reduce duplication and ensure messages stay in sync.

jenkins/scripts/perf/perf_sanity_triage.py (2)

157-161: Avoid using assert for runtime validation.

assert statements can be disabled with python -O and should not be used to validate external API responses. Use an explicit conditional with exception handling instead.

Suggested fix
         result = client.chat_postMessage(
             channel=channel_id,
             attachments=attachments,
         )
-        assert result["ok"] is True, json.dumps(result.data)
+        if not result.get("ok"):
+            raise SlackApiError(f"Failed to send message: {result.data}", result)
         print(f"Message sent successfully to channel {channel_id}")

72-104: Consider validating query_job_number parameter.

If query_job_number is 0 or negative, the function will return an empty dict, which might be unexpected. Consider adding validation or documenting this behavior.

Suggested fix
 def get_regression_data_by_job_id(data_list, query_job_number):
     """Returns a dict with job_id as key and list of regression data as value.

     Only returns the latest query_job_number jobs.
     """
-    if data_list is None or len(data_list) == 0:
+    if data_list is None or len(data_list) == 0 or query_job_number <= 0:
         return {}
tests/integration/defs/perf/open_search_db_utils.py (2)

1-1: Update copyright year.

The copyright header shows 2022-2024, but this file has meaningful modifications in 2025.

Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.

707-709: Remove commented-out code.

The commented-out RuntimeError should be removed rather than left in the codebase. The comment on line 707 documents the intent clearly.

Suggested fix
         _print_regression_data(data, print_func=print_error)
-        # Regression will not fail the test.
-        # raise RuntimeError(
-        #     f"Found {len(post_merge_regressions)} post-merge regression data")
+        # Note: Post-merge regressions are logged but do not fail the test.
+        # Regression alerts are now sent via Slack bot instead.
tests/integration/defs/perf/test_perf_sanity.py (1)

1385-1404: Consider using a custom exception class.

Using a bare Exception makes it harder to catch specific failures. Consider defining a custom exception or using an existing one like RuntimeError for test failures.

Suggested fix
+class PerfSanityTestFailure(Exception):
+    """Raised when perf sanity tests fail to produce expected results."""
+    pass
+
+
     def check_test_failure(self):
         """Check if any server failed based on perf results."""
         failed_servers = []
         for server_idx, client_configs in self.server_client_configs.items():
             server_perf_results = self._perf_results.get(server_idx, [])
             if len(server_perf_results) != len(client_configs) or any(
                 metrics is None for metrics in server_perf_results
             ):
                 failed_servers.append(server_idx)

         if failed_servers:
             for server_idx in failed_servers:
                 print_error(
                     f"Server {server_idx} failed: perf results count mismatch or incomplete metrics"
                 )
-            raise Exception(
+            raise PerfSanityTestFailure(
                 f"Test failed: servers {failed_servers} did not produce expected results"
             )

         print_info("All servers passed")
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between dc32bac and 5dea124.

📒 Files selected for processing (6)
  • jenkins/runPerfSanityTriage.groovy
  • jenkins/scripts/open_search_db.py
  • jenkins/scripts/perf/perf_sanity_triage.py
  • tests/integration/defs/perf/open_search_db_utils.py
  • tests/integration/defs/perf/test_perf_sanity.py
  • tests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
  • tests/integration/test_lists/waives.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g., some_file.py)
Python classes should use PascalCase (e.g., class SomeClass)
Python functions and methods should use snake_case (e.g., def my_awesome_function():)
Python local variables should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL)
Python constants should use upper snake_case (e.g., MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format """<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:

  • jenkins/scripts/perf/perf_sanity_triage.py
  • tests/integration/defs/perf/test_perf_sanity.py
  • jenkins/scripts/open_search_db.py
  • tests/integration/defs/perf/open_search_db_utils.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:

  • jenkins/scripts/perf/perf_sanity_triage.py
  • tests/integration/defs/perf/test_perf_sanity.py
  • jenkins/scripts/open_search_db.py
  • tests/integration/defs/perf/open_search_db_utils.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
🧬 Code graph analysis (3)
jenkins/scripts/perf/perf_sanity_triage.py (1)
jenkins/scripts/open_search_db.py (2)
  • OpenSearchDB (83-387)
  • queryFromOpenSearchDB (243-294)
tests/integration/defs/perf/test_perf_sanity.py (1)
tests/integration/defs/trt_test_alternative.py (2)
  • print_error (318-324)
  • print_info (300-306)
tests/integration/defs/perf/open_search_db_utils.py (1)
tests/integration/defs/trt_test_alternative.py (2)
  • print_info (300-306)
  • print_error (318-324)
🪛 Ruff (0.14.10)
jenkins/scripts/perf/perf_sanity_triage.py

67-67: Do not catch blind exception: Exception

(BLE001)


165-165: Do not catch blind exception: Exception

(BLE001)

tests/integration/defs/perf/test_perf_sanity.py

1400-1402: Create your own exception

(TRY002)


1400-1402: Avoid specifying long messages outside the exception class

(TRY003)

jenkins/scripts/open_search_db.py

254-254: f-string without any placeholders

Remove extraneous f prefix

(F541)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Pre-commit Check
🔇 Additional comments (7)
jenkins/scripts/perf/perf_sanity_triage.py (2)

169-206: LGTM with minor suggestion.

The CLI interface is well-structured. The bot token masking in log output (line 189) is good security practice. Consider setting a non-zero exit code when the operation fails to enable proper CI/CD flow control.


1-16: Missing NVIDIA copyright header.

As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest modification.

Suggested fix - add at the top of the file
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
# SPDX-License-Identifier: Apache-2.0
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
⛔ Skipped due to learnings
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Learnt from: xinhe-nv
Repo: NVIDIA/TensorRT-LLM PR: 8534
File: scripts/format_test_list.py:1-6
Timestamp: 2025-10-22T06:53:47.017Z
Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.
tests/integration/defs/perf/open_search_db_utils.py (1)

646-666: Simplified regression data printing looks good.

The refactored _print_regression_data function now prints each comma-separated item from s_regression_info on its own line, improving readability.

tests/integration/defs/perf/test_perf_sanity.py (2)

1198-1206: LGTM - Improved per-server result handling.

The refactored logic properly handles cases where server outputs don't match expected client configurations, skipping servers with incomplete data rather than failing silently.


1509-1511: Good addition of explicit failure checking.

Calling check_test_failure() after uploading results ensures test failures are properly surfaced while still allowing metrics to be uploaded for analysis.

jenkins/runPerfSanityTriage.groovy (2)

8-10: withCredentials block runs at script load time.

This withCredentials block is outside the pipeline block and runs during script compilation/load, not during build execution. If the credential lookup fails, it will fail at Jenkins job configuration time. This pattern is common but worth noting - ensure the credential default-llm-repo exists in Jenkins.


91-110: Pipeline stage structure looks good.

The stage properly:

  • Runs in the correct container
  • Checks out source before running the script
  • Installs dependencies
  • Passes all required parameters to the Python script

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30864 [ run ] triggered by Bot. Commit: 5dea124

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30870 [ run ] triggered by Bot. Commit: 5f3dc08

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30864 [ run ] completed with state ABORTED. Commit: 5dea124

Signed-off-by: Chenfei Zhang <[email protected]>
@chenfeiz0326 chenfeiz0326 force-pushed the user/chenfeiz/add_traige_script_for_perf_db branch from 5f3dc08 to 872fc57 Compare January 7, 2026 14:20
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30905 [ run ] triggered by Bot. Commit: 872fc57

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30905 [ run ] completed with state SUCCESS. Commit: 872fc57
/LLM/main/L0_MergeRequest_PR pipeline #23868 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30977 [ run ] triggered by Bot. Commit: 872fc57

Signed-off-by: Chenfei Zhang <[email protected]>
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31024 [ run ] triggered by Bot. Commit: 837c25d

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31024 [ run ] completed with state SUCCESS. Commit: 837c25d
/LLM/main/L0_MergeRequest_PR pipeline #23970 (Partly Tested) completed with status: 'SUCCESS'

@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Post-Merge-2"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31065 [ run ] triggered by Bot. Commit: 837c25d

@chenfeiz0326
Copy link
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31069 [ run ] triggered by Bot. Commit: 837c25d

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31069 [ run ] completed with state DISABLED
CI server is currently disabled for scheduled maintenance. Estimated completion time: 6 AM PST on 12/29.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31065 [ run ] completed with state SUCCESS. Commit: 837c25d
/LLM/main/L0_MergeRequest_PR pipeline #24003 (Partly Tested) completed with status: 'SUCCESS'

Signed-off-by: Chenfei Zhang <[email protected]>
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-1"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31170 [ run ] triggered by Bot. Commit: b04f643

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31170 [ run ] completed with state SUCCESS. Commit: b04f643
/LLM/main/L0_MergeRequest_PR pipeline #24084 (Partly Tested) completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Signed-off-by: Chenfei Zhang <[email protected]>
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1"

Signed-off-by: Chenfei Zhang <[email protected]>
Signed-off-by: Chenfei Zhang <[email protected]>
@chenfeiz0326
Copy link
Collaborator Author

/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-1,GB200-4_GPUs-PyTorch-PerfSanity-2"

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31255 [ run ] triggered by Bot. Commit: a9734a6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31256 [ run ] triggered by Bot. Commit: a9734a6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31255 [ run ] completed with state ABORTED. Commit: a9734a6

@tensorrt-cicd
Copy link
Collaborator

PR_Github #31256 [ run ] completed with state SUCCESS. Commit: a9734a6
/LLM/main/L0_MergeRequest_PR pipeline #24155 (Partly Tested) completed with status: 'SUCCESS'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants