-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel #10489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel #10489
Conversation
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1" |
📝 WalkthroughWalkthroughIntroduces a new Jenkins pipeline for performance sanity triage that executes within a Kubernetes pod, queries OpenSearch for regression data, and sends Slack notifications. Updates test utilities to refactor regression reporting, simplify output formatting, and remove fatal error handling for post-merge regressions. Modifies integration tests to improve metric parsing and aggregation logic, introducing a new failure detection method. Changes
Sequence Diagram(s)sequenceDiagram
participant Jenkins
participant K8sPod as Kubernetes Pod
participant OS as OpenSearch
participant Slack
Jenkins->>K8sPod: Trigger pipeline with parameters
K8sPod->>K8sPod: Checkout source & install slack_sdk
K8sPod->>K8sPod: Launch perf_sanity_triage.py
rect rgb(220, 240, 255)
Note over K8sPod,OS: Query Phase
K8sPod->>OS: query_regression_data(project_name)
OS-->>K8sPod: Return regression data
K8sPod->>K8sPod: get_regression_data_by_job_id()
end
rect rgb(240, 220, 255)
Note over K8sPod: Format Phase
K8sPod->>K8sPod: send_regression_message()
end
rect rgb(220, 255, 220)
Note over K8sPod,Slack: Notify Phase
K8sPod->>Slack: send_message() to channel
Slack-->>K8sPod: Message posted
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🤖 Fix all issues with AI agents
In @jenkins/runPerfSanityTriage.groovy:
- Around line 25-26: Remove the duplicate qosClass field in the pod spec: there
are two occurrences of "qosClass: Guaranteed" — delete the earlier one that sits
next to "nodeSelector" and retain the correctly indented "qosClass: Guaranteed"
that appears with the main pod spec (the one suggested to keep); ensure only a
single qosClass entry remains in the pod spec to avoid ambiguity.
- Around line 86-88: SLACK_BOT_TOKEN is currently declared as a plain string
parameter (SLACK_BOT_TOKEN) which exposes secrets; change to use Jenkins
credentials and update usage to call withCredentials([string(credentialsId:
params.SLACK_BOT_TOKEN, variable: 'SLACK_TOKEN')]) around the stage/step that
needs the token and substitute the direct param usage with the environment
variable (e.g., reference SLACK_TOKEN) so the secret is injected at runtime and
not stored or printed in job config/logs.
In @jenkins/scripts/open_search_db.py:
- Around line 288-293: The failure messages use the decremented local variable
retry_time which is 0 when logged; change them to report the original retry
count (use DEFAULT_RETRY_COUNT or capture the initial value into a new variable
like initial_retry_count) so logs show the real attempted retries; update the
OpenSearchDB.logger.info and the print call inside queryFromOpenSearchDB to
reference DEFAULT_RETRY_COUNT or the captured initial_retry_count instead of
retry_time and ensure any message formatting uses that symbol.
In @tests/integration/defs/perf/test_perf_sanity.py:
- Around line 1178-1196: The return type annotation on parse_metrics_from_output
uses Python 3.10 union syntax; change "Dict[str, float] | None" to
"Optional[Dict[str, float]]" and add Optional to the typing imports (alongside
Dict) so the function signature and imports are Python 3.8-compatible; ensure
the rest of the function and reference to PERF_METRIC_LOG_QUERIES remain
unchanged.
🧹 Nitpick comments (7)
jenkins/scripts/open_search_db.py (2)
254-254: Remove unnecessary f-string prefix.The static analysis tool correctly identifies this f-string has no placeholders.
Suggested fix
- print(f"OPEN_SEARCH_DB_BASE_URL is not set") + print("OPEN_SEARCH_DB_BASE_URL is not set")
254-293: Duplicate logging to both logger and stdout.The added
print()statements duplicate the existinglogger.info()calls. This may be intentional for Jenkins console visibility, but consider using a helper that does both to reduce duplication and ensure messages stay in sync.jenkins/scripts/perf/perf_sanity_triage.py (2)
157-161: Avoid usingassertfor runtime validation.
assertstatements can be disabled withpython -Oand should not be used to validate external API responses. Use an explicit conditional with exception handling instead.Suggested fix
result = client.chat_postMessage( channel=channel_id, attachments=attachments, ) - assert result["ok"] is True, json.dumps(result.data) + if not result.get("ok"): + raise SlackApiError(f"Failed to send message: {result.data}", result) print(f"Message sent successfully to channel {channel_id}")
72-104: Consider validatingquery_job_numberparameter.If
query_job_numberis 0 or negative, the function will return an empty dict, which might be unexpected. Consider adding validation or documenting this behavior.Suggested fix
def get_regression_data_by_job_id(data_list, query_job_number): """Returns a dict with job_id as key and list of regression data as value. Only returns the latest query_job_number jobs. """ - if data_list is None or len(data_list) == 0: + if data_list is None or len(data_list) == 0 or query_job_number <= 0: return {}tests/integration/defs/perf/open_search_db_utils.py (2)
1-1: Update copyright year.The copyright header shows 2022-2024, but this file has meaningful modifications in 2025.
Suggested fix
-# SPDX-FileCopyrightText: Copyright (c) 2022-2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved. +# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
707-709: Remove commented-out code.The commented-out
RuntimeErrorshould be removed rather than left in the codebase. The comment on line 707 documents the intent clearly.Suggested fix
_print_regression_data(data, print_func=print_error) - # Regression will not fail the test. - # raise RuntimeError( - # f"Found {len(post_merge_regressions)} post-merge regression data") + # Note: Post-merge regressions are logged but do not fail the test. + # Regression alerts are now sent via Slack bot instead.tests/integration/defs/perf/test_perf_sanity.py (1)
1385-1404: Consider using a custom exception class.Using a bare
Exceptionmakes it harder to catch specific failures. Consider defining a custom exception or using an existing one likeRuntimeErrorfor test failures.Suggested fix
+class PerfSanityTestFailure(Exception): + """Raised when perf sanity tests fail to produce expected results.""" + pass + + def check_test_failure(self): """Check if any server failed based on perf results.""" failed_servers = [] for server_idx, client_configs in self.server_client_configs.items(): server_perf_results = self._perf_results.get(server_idx, []) if len(server_perf_results) != len(client_configs) or any( metrics is None for metrics in server_perf_results ): failed_servers.append(server_idx) if failed_servers: for server_idx in failed_servers: print_error( f"Server {server_idx} failed: perf results count mismatch or incomplete metrics" ) - raise Exception( + raise PerfSanityTestFailure( f"Test failed: servers {failed_servers} did not produce expected results" ) print_info("All servers passed")
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
jenkins/runPerfSanityTriage.groovyjenkins/scripts/open_search_db.pyjenkins/scripts/perf/perf_sanity_triage.pytests/integration/defs/perf/open_search_db_utils.pytests/integration/defs/perf/test_perf_sanity.pytests/integration/test_lists/waives.txt
💤 Files with no reviewable changes (1)
- tests/integration/test_lists/waives.txt
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
Python filenames should use snake_case (e.g.,some_file.py)
Python classes should use PascalCase (e.g.,class SomeClass)
Python functions and methods should use snake_case (e.g.,def my_awesome_function():)
Python local variables should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL)
Python constants should use upper snake_case (e.g.,MY_CONSTANT)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Use comments in Python for code within a function, or interfaces that are local to a file
Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with the format"""<type>: Description"""
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic
Files:
jenkins/scripts/perf/perf_sanity_triage.pytests/integration/defs/perf/test_perf_sanity.pyjenkins/scripts/open_search_db.pytests/integration/defs/perf/open_search_db_utils.py
**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification
Files:
jenkins/scripts/perf/perf_sanity_triage.pytests/integration/defs/perf/test_perf_sanity.pyjenkins/scripts/open_search_db.pytests/integration/defs/perf/open_search_db_utils.py
🧠 Learnings (1)
📓 Common learnings
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
🧬 Code graph analysis (3)
jenkins/scripts/perf/perf_sanity_triage.py (1)
jenkins/scripts/open_search_db.py (2)
OpenSearchDB(83-387)queryFromOpenSearchDB(243-294)
tests/integration/defs/perf/test_perf_sanity.py (1)
tests/integration/defs/trt_test_alternative.py (2)
print_error(318-324)print_info(300-306)
tests/integration/defs/perf/open_search_db_utils.py (1)
tests/integration/defs/trt_test_alternative.py (2)
print_info(300-306)print_error(318-324)
🪛 Ruff (0.14.10)
jenkins/scripts/perf/perf_sanity_triage.py
67-67: Do not catch blind exception: Exception
(BLE001)
165-165: Do not catch blind exception: Exception
(BLE001)
tests/integration/defs/perf/test_perf_sanity.py
1400-1402: Create your own exception
(TRY002)
1400-1402: Avoid specifying long messages outside the exception class
(TRY003)
jenkins/scripts/open_search_db.py
254-254: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Pre-commit Check
🔇 Additional comments (7)
jenkins/scripts/perf/perf_sanity_triage.py (2)
169-206: LGTM with minor suggestion.The CLI interface is well-structured. The bot token masking in log output (line 189) is good security practice. Consider setting a non-zero exit code when the operation fails to enable proper CI/CD flow control.
1-16: Missing NVIDIA copyright header.As per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header with the year of latest modification.
Suggested fix - add at the top of the file
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved. # SPDX-License-Identifier: Apache-2.0 # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License.⛔ Skipped due to learnings
Learnt from: galagam Repo: NVIDIA/TensorRT-LLM PR: 6487 File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12 Timestamp: 2025-08-06T13:58:07.506Z Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.Learnt from: xinhe-nv Repo: NVIDIA/TensorRT-LLM PR: 8534 File: scripts/format_test_list.py:1-6 Timestamp: 2025-10-22T06:53:47.017Z Learning: The file `scripts/format_test_list.py` in the TensorRT-LLM repository does not require the NVIDIA Apache-2.0 copyright header.tests/integration/defs/perf/open_search_db_utils.py (1)
646-666: Simplified regression data printing looks good.The refactored
_print_regression_datafunction now prints each comma-separated item froms_regression_infoon its own line, improving readability.tests/integration/defs/perf/test_perf_sanity.py (2)
1198-1206: LGTM - Improved per-server result handling.The refactored logic properly handles cases where server outputs don't match expected client configurations, skipping servers with incomplete data rather than failing silently.
1509-1511: Good addition of explicit failure checking.Calling
check_test_failure()after uploading results ensures test failures are properly surfaced while still allowing metrics to be uploaded for analysis.jenkins/runPerfSanityTriage.groovy (2)
8-10:withCredentialsblock runs at script load time.This
withCredentialsblock is outside thepipelineblock and runs during script compilation/load, not during build execution. If the credential lookup fails, it will fail at Jenkins job configuration time. This pattern is common but worth noting - ensure the credentialdefault-llm-repoexists in Jenkins.
91-110: Pipeline stage structure looks good.The stage properly:
- Runs in the correct container
- Checks out source before running the script
- Installs dependencies
- Passes all required parameters to the Python script
|
PR_Github #30864 [ run ] triggered by Bot. Commit: |
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1" |
|
PR_Github #30870 [ run ] triggered by Bot. Commit: |
|
PR_Github #30864 [ run ] completed with state |
Signed-off-by: Chenfei Zhang <[email protected]>
5f3dc08 to
872fc57
Compare
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1" |
|
PR_Github #30905 [ run ] triggered by Bot. Commit: |
|
PR_Github #30905 [ run ] completed with state
|
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1" |
|
PR_Github #30977 [ run ] triggered by Bot. Commit: |
Signed-off-by: Chenfei Zhang <[email protected]>
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1" |
|
PR_Github #31024 [ run ] triggered by Bot. Commit: |
|
PR_Github #31024 [ run ] completed with state |
|
/bot run --stage-list "GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Post-Merge-1,GB200-8_GPUs-2_Nodes-PyTorch-PerfSanity-Post-Merge-2" |
|
PR_Github #31065 [ run ] triggered by Bot. Commit: |
|
/bot run --disable-fail-fast |
|
PR_Github #31069 [ run ] triggered by Bot. Commit: |
|
PR_Github #31069 [ run ] completed with state |
|
PR_Github #31065 [ run ] completed with state |
Signed-off-by: Chenfei Zhang <[email protected]>
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-1" |
|
PR_Github #31170 [ run ] triggered by Bot. Commit: |
|
PR_Github #31170 [ run ] completed with state
|
Signed-off-by: Chenfei Zhang <[email protected]>
|
/bot run --stage-list "GB200-12_GPUs-3_Nodes-PyTorch-PerfSanity-Disagg-Post-Merge-1" |
Signed-off-by: Chenfei Zhang <[email protected]>
Signed-off-by: Chenfei Zhang <[email protected]>
|
/bot run --stage-list "GB200-4_GPUs-PyTorch-PerfSanity-1,GB200-4_GPUs-PyTorch-PerfSanity-2" |
|
PR_Github #31255 [ run ] triggered by Bot. Commit: |
|
PR_Github #31256 [ run ] triggered by Bot. Commit: |
|
PR_Github #31255 [ run ] completed with state |
|
PR_Github #31256 [ run ] completed with state |
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.