Skip to content

[None][chore] Revert "[TRTLLM-8263][feat] Add ctx-only and gen-only Disagg Perf Tes…#11602

Closed
pcastonguay wants to merge 1 commit intoNVIDIA:mainfrom
pcastonguay:revert_80dbb64
Closed

[None][chore] Revert "[TRTLLM-8263][feat] Add ctx-only and gen-only Disagg Perf Tes…#11602
pcastonguay wants to merge 1 commit intoNVIDIA:mainfrom
pcastonguay:revert_80dbb64

Conversation

@pcastonguay
Copy link
Collaborator

@pcastonguay pcastonguay commented Feb 20, 2026

…ts (#11361)"

This reverts commit 80dbb64.

Summary by CodeRabbit

  • New Features

    • Enhanced performance sanity test infrastructure with improved error detection and logging capabilities.
    • Added disaggregated performance sanity test configurations for multi-node GPU setups.
  • Bug Fixes

    • Updated MOE backend from CUTEDSL to TRTLLM across disaggregated perf-sanity test configurations for improved performance.
  • Chores

    • Reorganized and renamed performance sanity test entries for clearer identification.
    • Simplified test configuration structure by removing legacy local SLURM setup scripts.
    • Removed deprecated test configuration files and cleaned up outdated documentation.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

…ts (NVIDIA#11361)"

This reverts commit 80dbb64.

Signed-off-by: Patrice Castonguay <55748270+pcastonguay@users.noreply.github.com>
@pcastonguay pcastonguay requested review from a team as code owners February 20, 2026 18:14
@pcastonguay
Copy link
Collaborator Author

/bot run --post-merge

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36348 [ run ] triggered by Bot. Commit: b385edd Link to invocation

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 20, 2026

📝 Walkthrough

Walkthrough

This PR restructures performance sanity test infrastructure by renaming test configurations with aggr/disagg prefixes, updating the MOE backend from CUTEDSL to TRTLLM across 60+ disaggregated test configs, refactoring performance submission scripts with new environment variable naming conventions, removing legacy local SLURM orchestration code, and significantly reworking the test harness for improved command execution and result collection.

Changes

Cohort / File(s) Summary
Jenkins Test Configuration
jenkins/L0_Test.groovy
Renamed and reduced PerfSanity test entries with new aggr/disagg naming; updated shard parameters and destination tags for DGX B200, GB200 4-GPU, and multi-node configurations (e.g., Node2-GPU8-Post-Merge → Aggr-PerfSanity-Node2-GPU8-Post-Merge).
Perf Script Removal (Local)
jenkins/scripts/perf/local/submit.py, jenkins/scripts/perf/local/slurm_run.sh, jenkins/scripts/perf/local/slurm_install.sh, jenkins/scripts/perf/local/README.md
Removed entire local SLURM submission and orchestration workflow, including wheel building, installation setup, and pytest execution coordination; eliminated command-line interface and launch script generation for local deployments.
Disaggregated Perf Script Updates
jenkins/scripts/perf/disaggregated/slurm_launch_draft.sh, jenkins/scripts/perf/disaggregated/submit.py
Refactored config path resolution with new get_config_yaml function; renamed environment variables from gpusPerNodePerCtxServer/gpusPerNodePerGenServer to gpusPerfNodePerfCtxServer/gpusPerfNodePerfGenServer; updated mode handling from gen_only_no_context to gen_only; adjusted SLURM task-per-node parameters accordingly.
Perf Infrastructure & Documentation
jenkins/scripts/perf/README.md, jenkins/scripts/perf/aggregated/slurm_launch_draft.sh, tests/scripts/perf-sanity/README.md
Documentation reorganized toward config-centric guide; removed aggregated Slurm orchestration block including cleanup handlers and job workspace setup; shifted from detailed architecture to simplified configuration-type coverage (Aggregated/Disaggregated modes).
MOE Backend Configuration Updates
tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/b200-deepseek-*.yaml, tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-*.yaml (60+ files)
Systematically updated worker_config.moe_config.backend from CUTEDSL to TRTLLM across all disaggregated perf-sanity test configurations; one file removed MOE config block entirely due to unsupported configuration.
Test Configuration Removal (GPT-OSS)
tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200_gpt-oss-120b-fp4_*.yaml (5 files)
Removed complete performance sanity test configurations for GPT-OSS-120B variants, eliminating all metadata, SLURM, benchmark, hardware, environment, profiling, accuracy, and worker configurations.
Multi-Node Disaggregated Test Lists
tests/integration/test_lists/test-db/l0_dgx_b200_multi_nodes_disagg_perf_sanity_*.yml, tests/integration/test_lists/test-db/l0_gb200_multi_nodes_disagg_perf_sanity_*.yml (9 new files)
Added new disaggregated multi-node test configurations with fixed GPU count constraints and UCX-based perf test parameterizations; introduced test groups for various node/GPU combinations (1 ctx node to 8 gen nodes).
Aggregated Test List Updates
tests/integration/test_lists/test-db/l0_dgx_b200_perf_sanity.yml, tests/integration/test_lists/test-db/l0_gb200_multi_gpus_perf_sanity.yml, tests/integration/test_lists/test-db/l0_gb200_multi_nodes_aggr_perf_sanity_node2_gpu8.yml
Renamed test group labels (e.g., deepseek-r1-fp4-v2 → deepseek_r1_fp4_v2_blackwell/grace_blackwell); removed deprecated test entries and ctx_only variants; adjusted timeouts for updated configurations.
Removed Aggregated Test Lists
tests/integration/test_lists/test-db/l0_b200_multi_nodes_perf_sanity_*.yml, tests/integration/test_lists/test-db/l0_gb200_multi_nodes_perf_sanity_*.yml (11 deleted files)
Removed non-disaggregated multi-node test configurations, consolidating test matrix into newly added disaggregated variants with explicit GPU/node partitioning.
Test Harness Core Refactoring
tests/integration/defs/perf/test_perf_sanity.py
Significant refactoring: split config parsing into _parse_aggr/disagg_config_file; redesigned command generation with _get_aggr/disagg_commands methods; updated AggrTestCmds/DisaggTestCmds signatures removing test_output_dir; added run_ex method for command execution and output aggregation; introduced per-server log/error file handling; refactored result parsing with centralized metric regex registry and enhanced error detection; updated test case naming format to f"{type}-{config_yml}" pattern.
Test Utilities Updates
tests/test_common/http_utils.py, tests/test_common/error_utils.py
Removed entire error_utils.py module with error keyword scanning and log context extraction; migrated ERROR_KEYWORDS and error detection logic inline into http_utils.py; updated wait_for_endpoint_ready to scan check_files for error keywords every 300 iterations instead of relying on external error module.
Performance Test Metadata
tests/scripts/perf-sanity/deepseek_r1_fp4_v2_*.yaml, tests/scripts/perf-sanity/deepseek_v32_fp4_*.yaml, tests/scripts/perf-sanity/k2_thinking_fp4_*.yaml, tests/scripts/perf-sanity/gpt_oss_120b_fp4_*.yaml (10 files)
Removed hardware.gpus_per_node specifications from blackwell variants and grace_blackwell variants; removed entire gpt_oss_120b_fp4_blackwell.yaml; updated gpt_oss_120b_fp4_grace_blackwell.yaml with new tensor_parallel and GPU support (B200/B300), removed dataset_file references, and purged multiple legacy GPT-OSS configurations.
Test Integration Updates
tests/integration/perf/perf_sanity_triage.py, tests/integration/test_lists/waives.txt
Removed timestamp conversion utilities and shifted baseline/regression date handling to raw string values instead of formatted dates; expanded waives list with new disaggregated and deepseek-related perf_sanity test variants, removing one deprecated aggr_upload context-only entry.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • chzblych
  • yuanjingx87
  • kxdc
🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is incomplete. It only states the revert of commit 80dbb64 and includes the template placeholders for Description, Test Coverage, and PR Checklist without filling in required details. Add a proper Description section explaining why the revert is necessary, identify relevant Test Coverage, and complete the PR Checklist items appropriately.
Docstring Coverage ⚠️ Warning Docstring coverage is 70.97% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main change: reverting a previous commit. It is specific and follows the PR naming template with ticket/type format.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

Issue Planner is now in beta. Read the docs and try it out! Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (7)
tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-deepseek-r1-fp4_128k8k_ctx1_pp8_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml (1)

65-68: ⚠️ Potential issue | 🟠 Major

The gen worker's moe_config.backend should also be reverted to TRTLLM.

Commit 80dbb64 changed both workers' MOE backends:

  • worker_config.ctx.moe_config.backend: CUTEDSL → CUTEDSL (unchanged by 80dbb64)
  • worker_config.gen.moe_config.backend: TRTLLM → CUTEDSL

This PR reverts only the ctx backend (CUTEDSL → TRTLLM), leaving gen at CUTEDSL. For a complete revert, gen should also be restored to TRTLLM, eliminating the asymmetry between the two worker roles.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-deepseek-r1-fp4_128k8k_ctx1_pp8_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml`
around lines 65 - 68, The gen worker's MOE backend was left as CUTEDSL while ctx
was reverted to TRTLLM; update the worker_config.gen.moe_config.backend entry
(the gen worker's moe_config.backend) to TRTLLM to match worker_config.ctx and
complete the revert so both workers use TRTLLM.
tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-deepseek-r1-fp4_8k1k_ctx1_dep4_gen1_dep16_eplb0_mtp1_ccb-UCX.yaml (1)

64-92: ⚠️ Potential issue | 🟡 Minor

Confirm whether the asymmetrical moe_config backends (gen: CUTEDSL, ctx: TRTLLM) in the dep16 variant is intentional.

The asymmetry is confirmed. This config uses CUTEDSL in gen but TRTLLM in ctx, while comparable configs (e.g., the tep8 variants of the same 8k1k config and similar 1k1k patterns) use TRTLLM in both sections. If the dep16 variant requires CUTEDSL specifically for generation on this hardware, this is correct — but given the divergence from the pattern across other comparable test configs, it merits explicit clarification.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-deepseek-r1-fp4_8k1k_ctx1_dep4_gen1_dep16_eplb0_mtp1_ccb-UCX.yaml`
around lines 64 - 92, The review points out asymmetrical moe_config backends:
the generation section uses moe_config.backend = CUTEDSL while the ctx section
uses moe_config.backend = TRTLLM; confirm whether this is intentional for the
dep16 variant and, if not, make them consistent (either set
ctx.moe_config.backend to CUTEDSL to match gen.moe_config.backend or change
gen.moe_config.backend to TRTLLM to match ctx), updating the YAML entries for
the moe_config under the gen and ctx sections accordingly and adding a brief
inline comment indicating why the chosen backend is required for dep16 if the
asymmetry must remain.
tests/integration/defs/perf/test_perf_sanity.py (2)

1-2: ⚠️ Potential issue | 🟡 Minor

Update the copyright year to 2026.

This file was modified in 2026 but the header still ends at 2025.

✏️ Suggested header update
-# SPDX-FileCopyrightText: Copyright (c) 2022-2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-FileCopyrightText: Copyright (c) 2022-2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. … update year on modified files.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 1 - 2, Update
the SPDX header years to include 2026 by changing the trailing year in both
header lines (the copyright line and the SPDX-License-Identifier line) so the
header reflects the latest meaningful modification year; locate the top-of-file
header in test_perf_sanity.py and replace "2022-2025" with "2022-2026" (or
otherwise ensure the final year is 2026).

17-47: ⚠️ Potential issue | 🟡 Minor

Use namespace‑preserving module imports throughout the import block.

Multiple symbol-level imports violate the namespace guidelines. All of the following should import the module and qualify usages instead:

  • from test_common.http_utils import wait_for_endpoint_ready
  • from defs.trt_test_alternative import print_error, print_info
  • from tensorrt_llm._utils import get_free_port
  • from ..conftest import get_llm_root, llm_models_root
  • from .open_search_db_utils import (...) with all named symbols
  • from .utils import collect_and_clean_myelin_time

As per coding guidelines: "Always maintain the namespace when importing. Use from package.subpackage import foo instead of from package.subpackage.foo import SomeClass" and "Python imports must use from package.subpackage import module style; never use from module import Class."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/integration/defs/perf/test_perf_sanity.py` around lines 17 - 47, The
import block uses symbol-level imports that violate the namespace rule; change
them to module-level imports and update all call sites to qualify names. Replace
"from test_common.http_utils import wait_for_endpoint_ready" with "import
test_common.http_utils as http_utils" and use
http_utils.wait_for_endpoint_ready; replace "from defs.trt_test_alternative
import print_error, print_info" with "import defs.trt_test_alternative as
trt_test_alternative" and use trt_test_alternative.print_error/print_info;
replace "from tensorrt_llm._utils import get_free_port" with "import
tensorrt_llm._utils as trt_utils" and use trt_utils.get_free_port; replace "from
..conftest import get_llm_root, llm_models_root" with "import ..conftest as
conftest" and use conftest.get_llm_root / conftest.llm_models_root; replace the
multi-symbol "from .open_search_db_utils import (...)" with "import
.open_search_db_utils as open_search_db_utils" and qualify
SCENARIO_MATCH_FIELDS, add_id, check_perf_regression, get_common_values,
get_history_data, get_job_info, post_new_perf_data, prepare_baseline_data,
prepare_regressive_test_cases as open_search_db_utils.SYMBOL; and replace "from
.utils import collect_and_clean_myelin_time" with "import .utils as utils" and
use utils.collect_and_clean_myelin_time throughout.
jenkins/scripts/perf/disaggregated/submit.py (3)

1-5: ⚠️ Potential issue | 🟡 Minor

Add the NVIDIA Apache 2.0 header to this Python file.

This file is modified but lacks the required NVIDIA copyright header with the latest year of modification (2026).

As per coding guidelines: “All source files must contain an NVIDIA copyright header with the year of latest meaningful modification. Use the Apache License 2.0 format. … update year on modified files.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/scripts/perf/disaggregated/submit.py` around lines 1 - 5, Insert the
NVIDIA Apache 2.0 license header with the latest modification year 2026 at the
top of the file (keeping the existing shebang and imports intact); specifically
add the standard multi-line Apache 2.0 comment block that includes "Copyright
(c) 2026 NVIDIA CORPORATION" and the Apache License, Version 2.0 notice before
or immediately after the "#!/usr/bin/env python3" line so the header is present
for the module containing the current imports (argparse, os, yaml).

223-258: ⚠️ Potential issue | 🟡 Minor

Add a Google-style docstring for get_config_yaml.

This public helper lacks a docstring. Per coding guidelines, all functions must include Google-style docstrings that can be parsed by Sphinx.

📄 Example docstring
 def get_config_yaml(test_list_path, llm_src):
+    """Resolve the perf-sanity config YAML path from a test list entry.
+
+    Args:
+        test_list_path: Path to the rendered test list file.
+        llm_src: Path to the TensorRT-LLM source root.
+
+    Returns:
+        Absolute path to the config YAML file.
+
+    Raises:
+        ValueError: If the test list format is invalid.
+        FileNotFoundError: If the config YAML does not exist.
+    """
     with open(test_list_path, "r") as f:
         first_line = f.readline().strip()
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/scripts/perf/disaggregated/submit.py` around lines 223 - 258, Add a
Google-style docstring to the public helper function get_config_yaml describing
its purpose and behavior; include an initial one-line summary, a short
description, and sections for Args (test_list_path: str, llm_src: str), Returns
(str path to the resolved YAML), and Raises (ValueError for malformed test
names, FileNotFoundError when the config file is missing), and optionally an
example usage. Place this docstring immediately under the def
get_config_yaml(...) signature so Sphinx can parse it.

223-258: ⚠️ Potential issue | 🟠 Major

Local mode requires --test-list due to unconditional get_config_yaml() call even when --config-yaml is supplied.

Line 296 unconditionally calls get_config_yaml(args.test_list, args.llm_src) even in local mode. With both arguments defaulting to empty strings, this fails when a user provides only --config-yaml for local execution.

Proposed fix
-    config_yaml = get_config_yaml(args.test_list, args.llm_src)
+    if args.run_ci:
+        config_yaml = get_config_yaml(args.test_list, args.llm_src)
+    elif args.config_yaml:
+        config_yaml = args.config_yaml
+    else:
+        parser.error(
+            "Provide --run-ci with --test-list/--llm-src, or supply --config-yaml for local mode."
+        )

Also applies to: 296–296

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/scripts/perf/disaggregated/submit.py` around lines 223 - 258,
Unconditionally calling get_config_yaml(args.test_list, args.llm_src) forces
--test-list even when the user supplied --config-yaml (especially in local
mode); change the call site so that you use args.config_yaml if provided, and
only call get_config_yaml(...) when args.config_yaml is empty and args.test_list
is present. Update the logic where get_config_yaml is invoked (the code that
sets config_yaml from args) to: if args.config_yaml: config_yaml =
args.config_yaml elif args.test_list: config_yaml =
get_config_yaml(args.test_list, args.llm_src) else: raise a clear error asking
for one of --config-yaml or --test-list; keep using the existing get_config_yaml
function unchanged.
🧹 Nitpick comments (1)
tests/test_common/http_utils.py (1)

40-41: except Exception is too broad; narrow to expected I/O exceptions.

Per the coding guidelines, exceptions should be limited to the smallest applicable set. For file reading, OSError (covers IOError, FileNotFoundError) and UnicodeDecodeError are the concrete failure modes; catching bare Exception also suppresses unrelated programming errors.

🛠️ Proposed fix
-                    except Exception as err:
+                    except (OSError, UnicodeDecodeError) as err:
                         print(f"Failed to read server file {check_file}: {err}")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_common/http_utils.py` around lines 40 - 41, Replace the broad
"except Exception as err" in tests/test_common/http_utils.py (the block that
prints "Failed to read server file {check_file}: {err}") with a narrower catch
for the I/O-related errors expected when reading files—specifically catch
OSError and UnicodeDecodeError (e.g., "except (OSError, UnicodeDecodeError) as
err") so unrelated exceptions aren't swallowed; keep the same error message
using check_file and err.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@jenkins/scripts/perf/perf_sanity_triage.py`:
- Line 241: The regression_date currently takes the raw millisecond epoch from
data.get("ts_created", "N/A") and will show an unreadable integer in Slack;
replace that assignment so it parses the value (if present and numeric), divides
by 1000 to convert ms → seconds, converts to a human-readable date/time string
(e.g., via datetime.fromtimestamp(...).strftime("%Y-%m-%d %H:%M:%S") or
similar), and falls back to "N/A" on missing/invalid values; update the place
where regression_date is set (the line assigning regression_date from data.get)
so downstream Slack text uses the formatted string.

In
`@tests/integration/test_lists/test-db/l0_dgx_b200_multi_nodes_disagg_perf_sanity_ctx1_node1_gpu4_gen1_node1_gpu8.yml`:
- Around line 5-8: The comment currently describes total hardware ("2 nodes with
each node has 8 GPUs") but should describe per-worker allocations to match the
filename and other GB200 files; update the comment above the system_gpu_count
block to state the per-worker allocation (e.g., "ctx worker: 4 GPUs, gen worker:
8 GPUs; total system_gpu_count: 16") so that it aligns with the filename
ctx1_node1_gpu4_gen1_node1_gpu8 and the system_gpu_count key.

In
`@tests/integration/test_lists/test-db/l0_gb200_multi_nodes_disagg_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml`:
- Around line 5-9: The system_gpu_count is fixed to 8 but the config comments
and group name imply 1 ctx node with 1 GPU + 1 gen node with 4 GPUs (5 active
GPUs); clarify that the 8 GPU requirement is intentional to force whole-node
allocation on GB200 NVL4 (4 GPUs/node) — add an inline comment next to
system_gpu_count explaining that the scheduler must allocate whole NVL4 nodes (2
nodes × 4 GPUs = 8) so the ctx node will have unused GPUs, or adjust the range
if whole-node allocation is not intended; reference the system_gpu_count key and
the ctx/gen worker comments when making the change.

In `@tests/scripts/perf-sanity/gpt_oss_120b_fp4_grace_blackwell.yaml`:
- Around line 3-5: The YAML filename incorrectly uses the "_grace_blackwell"
suffix while declaring standalone Blackwell GPUs (supported_gpus: [B200, B300]);
rename the file to gpt_oss_120b_fp4_blackwell.yaml and update all references to
it (notably the five test entries in the
tests/integration/test_lists/test-db/l0_gb200_multi_gpus_perf_sanity.yml) so
they point to the new filename; ensure the file content still lists
supported_gpus: [B200, B300] and run a quick grep across the repo to catch any
other references to the old gpt_oss_120b_fp4_grace_blackwell.yaml name.

In `@tests/scripts/perf-sanity/README.md`:
- Line 31: Replace the four fenced code blocks that currently start with plain
``` (the blocks containing lines like
"perf/test_perf_sanity.py::test_e2e[aggr_upload-{config yaml file base name}]"
and the two variants with "-{server_config_name}") so they begin with ```text
instead of ``` to satisfy MD040; update all four occurrences (the three
additional blocks that contain the other test identifier variants) so each code
fence explicitly uses the text language specifier.

In `@tests/test_common/http_utils.py`:
- Line 14: The type annotation for the parameter/variable server_proc in
tests/test_common/http_utils.py currently uses subprocess.Popen without
explicitly allowing None, triggering Ruff RUF013; update the annotation to use
the Python 3.10 union form subprocess.Popen | None (i.e., change server_proc:
subprocess.Popen to server_proc: subprocess.Popen | None) so the variable is
explicitly optional and conforms to the project's typing style.
- Around line 26-41: The periodic file-scan using "if check_files and iteration
% 300 == 0" is unreliable; change it to a time-based check using a timestamp
(e.g., start_time = time.time()) and a small interval (e.g., 1–5s) derived from
the existing timeout so the scan runs repeatedly during long request timeouts.
Update the loop that references check_files, ERROR_KEYWORDS and iteration to
instead compare time.time() - last_check_time >= check_interval, update
last_check_time when scanning, and keep the existing behavior of reading files
and raising RuntimeError when keywords are found.

---

Outside diff comments:
In `@jenkins/scripts/perf/disaggregated/submit.py`:
- Around line 1-5: Insert the NVIDIA Apache 2.0 license header with the latest
modification year 2026 at the top of the file (keeping the existing shebang and
imports intact); specifically add the standard multi-line Apache 2.0 comment
block that includes "Copyright (c) 2026 NVIDIA CORPORATION" and the Apache
License, Version 2.0 notice before or immediately after the "#!/usr/bin/env
python3" line so the header is present for the module containing the current
imports (argparse, os, yaml).
- Around line 223-258: Add a Google-style docstring to the public helper
function get_config_yaml describing its purpose and behavior; include an initial
one-line summary, a short description, and sections for Args (test_list_path:
str, llm_src: str), Returns (str path to the resolved YAML), and Raises
(ValueError for malformed test names, FileNotFoundError when the config file is
missing), and optionally an example usage. Place this docstring immediately
under the def get_config_yaml(...) signature so Sphinx can parse it.
- Around line 223-258: Unconditionally calling get_config_yaml(args.test_list,
args.llm_src) forces --test-list even when the user supplied --config-yaml
(especially in local mode); change the call site so that you use
args.config_yaml if provided, and only call get_config_yaml(...) when
args.config_yaml is empty and args.test_list is present. Update the logic where
get_config_yaml is invoked (the code that sets config_yaml from args) to: if
args.config_yaml: config_yaml = args.config_yaml elif args.test_list:
config_yaml = get_config_yaml(args.test_list, args.llm_src) else: raise a clear
error asking for one of --config-yaml or --test-list; keep using the existing
get_config_yaml function unchanged.

In
`@tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-deepseek-r1-fp4_128k8k_ctx1_pp8_gen1_dep32_eplb0_mtp3_ccb-UCX.yaml`:
- Around line 65-68: The gen worker's MOE backend was left as CUTEDSL while ctx
was reverted to TRTLLM; update the worker_config.gen.moe_config.backend entry
(the gen worker's moe_config.backend) to TRTLLM to match worker_config.ctx and
complete the revert so both workers use TRTLLM.

In
`@tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-deepseek-r1-fp4_8k1k_ctx1_dep4_gen1_dep16_eplb0_mtp1_ccb-UCX.yaml`:
- Around line 64-92: The review points out asymmetrical moe_config backends: the
generation section uses moe_config.backend = CUTEDSL while the ctx section uses
moe_config.backend = TRTLLM; confirm whether this is intentional for the dep16
variant and, if not, make them consistent (either set ctx.moe_config.backend to
CUTEDSL to match gen.moe_config.backend or change gen.moe_config.backend to
TRTLLM to match ctx), updating the YAML entries for the moe_config under the gen
and ctx sections accordingly and adding a brief inline comment indicating why
the chosen backend is required for dep16 if the asymmetry must remain.

In `@tests/integration/defs/perf/test_perf_sanity.py`:
- Around line 1-2: Update the SPDX header years to include 2026 by changing the
trailing year in both header lines (the copyright line and the
SPDX-License-Identifier line) so the header reflects the latest meaningful
modification year; locate the top-of-file header in test_perf_sanity.py and
replace "2022-2025" with "2022-2026" (or otherwise ensure the final year is
2026).
- Around line 17-47: The import block uses symbol-level imports that violate the
namespace rule; change them to module-level imports and update all call sites to
qualify names. Replace "from test_common.http_utils import
wait_for_endpoint_ready" with "import test_common.http_utils as http_utils" and
use http_utils.wait_for_endpoint_ready; replace "from defs.trt_test_alternative
import print_error, print_info" with "import defs.trt_test_alternative as
trt_test_alternative" and use trt_test_alternative.print_error/print_info;
replace "from tensorrt_llm._utils import get_free_port" with "import
tensorrt_llm._utils as trt_utils" and use trt_utils.get_free_port; replace "from
..conftest import get_llm_root, llm_models_root" with "import ..conftest as
conftest" and use conftest.get_llm_root / conftest.llm_models_root; replace the
multi-symbol "from .open_search_db_utils import (...)" with "import
.open_search_db_utils as open_search_db_utils" and qualify
SCENARIO_MATCH_FIELDS, add_id, check_perf_regression, get_common_values,
get_history_data, get_job_info, post_new_perf_data, prepare_baseline_data,
prepare_regressive_test_cases as open_search_db_utils.SYMBOL; and replace "from
.utils import collect_and_clean_myelin_time" with "import .utils as utils" and
use utils.collect_and_clean_myelin_time throughout.

---

Duplicate comments:
In
`@tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/b200-deepseek-r1-fp4_1k1k_ctx1_dep4_gen1_dep8_eplb0_mtp1_ccb-UCX.yaml`:
- Around line 91-93: The config shows asymmetric backends between generation and
context (moe_config.backend: TRTLLM while cache_transceiver_config is
different/unspecified), duplicating the earlier issue; update the YAML so
generation and context use the same backend—e.g., explicitly set
cache_transceiver_config to use TRTLLM or change moe_config.backend to match the
cache transceiver—so both components (moe_config.backend and
cache_transceiver_config) are consistent with the matching config used in the
referenced gb200 example.

In
`@tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-deepseek-v32-fp4_8k1k_ctx1_dep4_gen1_dep32_eplb256_mtp0_ccb-UCX.yaml`:
- Around line 99-101: The config shows asymmetric backends between ctx and gen:
moe_config.backend is TRTLLM while gen still uses CUTEDSL (see gen's
load_balancer.num_slots: 256 and nvfp4_gemm_config.allowed_backends containing
cutedsl). Make the backends consistent by updating the gen or ctx section so
both use the same backend (adjust moe_config.backend in the gen/ctx block
accordingly) and ensure associated fields (e.g., load_balancer.num_slots and
nvfp4_gemm_config.allowed_backends) are updated to match the chosen backend.

In
`@tests/integration/defs/perf/disagg/test_configs/disagg/perf-sanity/gb200-qwen3-235b-fp4_8k1k_ctx1_tp1_gen1_dep8_eplb0_mtp0_ccb-UCX.yaml`:
- Around line 87-89: The YAML shows asymmetric backends between generation and
context paths (moe_config.backend: TRTLLM vs cache_transceiver_config
unspecified), duplicate of earlier flag; make the backends consistent by
explicitly setting the same backend for both paths (e.g., set
cache_transceiver_config.backend to TRTLLM or change moe_config.backend to match
the intended cache transceiver backend), ensuring the keys referenced are
moe_config.backend and cache_transceiver_config.backend so both gen and ctx use
the same backend implementation.

---

Nitpick comments:
In `@tests/test_common/http_utils.py`:
- Around line 40-41: Replace the broad "except Exception as err" in
tests/test_common/http_utils.py (the block that prints "Failed to read server
file {check_file}: {err}") with a narrower catch for the I/O-related errors
expected when reading files—specifically catch OSError and UnicodeDecodeError
(e.g., "except (OSError, UnicodeDecodeError) as err") so unrelated exceptions
aren't swallowed; keep the same error message using check_file and err.


# Get regression branch and commit from data
regression_date = _timestamp_to_date(data.get("ts_created", "N/A"))
regression_date = data.get("ts_created", "N/A")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

ts_created will display a raw epoch-millisecond integer in the Slack message.

The query at lines 354–362 builds the ts_created range filter in epoch milliseconds (… * 1000), meaning OpenSearch stores this field as a numeric ms-epoch value. After the removal of the previous _timestamp_to_date() conversion, data.get("ts_created", "N/A") will return a value like 1708444800000, which will appear verbatim in the Slack notification under "Regression date, branch and commit:", making it unreadable.

🛠️ Proposed fix
-            regression_date = data.get("ts_created", "N/A")
+            ts_created = data.get("ts_created")
+            if isinstance(ts_created, (int, float)):
+                import datetime
+                regression_date = datetime.datetime.utcfromtimestamp(ts_created / 1000).strftime(
+                    "%Y-%m-%d"
+                )
+            else:
+                regression_date = ts_created if ts_created is not None else "N/A"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@jenkins/scripts/perf/perf_sanity_triage.py` at line 241, The regression_date
currently takes the raw millisecond epoch from data.get("ts_created", "N/A") and
will show an unreadable integer in Slack; replace that assignment so it parses
the value (if present and numeric), divides by 1000 to convert ms → seconds,
converts to a human-readable date/time string (e.g., via
datetime.fromtimestamp(...).strftime("%Y-%m-%d %H:%M:%S") or similar), and falls
back to "N/A" on missing/invalid values; update the place where regression_date
is set (the line assigning regression_date from data.get) so downstream Slack
text uses the formatted string.

Comment on lines +5 to +8
# 2 nodes with each node has 8 GPUs
system_gpu_count:
gte: 16
lte: 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Nit: Comment describes system hardware, not per-worker allocation — inconsistent with other files in this PR.

The comment "2 nodes with each node has 8 GPUs" describes the total hardware (2 × 8 = 16 GPUs), which is consistent with system_gpu_count: 16. However, the filename convention (ctx1_node1_gpu4_gen1_node1_gpu8) and the per-worker descriptions in related GB200 files (e.g., "1 ctx worker with each 2 nodes and 8 GPUs") suggest the comment should instead describe the per-worker allocation (ctx: 4 GPUs, gen: 8 GPUs).

📝 Suggested comment update
-      # 2 nodes with each node has 8 GPUs
-      system_gpu_count:
+      # 1 ctx worker with 1 node and 4 GPUs
+      # 1 gen worker with 1 node and 8 GPUs
+      system_gpu_count:
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 2 nodes with each node has 8 GPUs
system_gpu_count:
gte: 16
lte: 16
# 1 ctx worker with 1 node and 4 GPUs
# 1 gen worker with 1 node and 8 GPUs
system_gpu_count:
gte: 16
lte: 16
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/test_lists/test-db/l0_dgx_b200_multi_nodes_disagg_perf_sanity_ctx1_node1_gpu4_gen1_node1_gpu8.yml`
around lines 5 - 8, The comment currently describes total hardware ("2 nodes
with each node has 8 GPUs") but should describe per-worker allocations to match
the filename and other GB200 files; update the comment above the
system_gpu_count block to state the per-worker allocation (e.g., "ctx worker: 4
GPUs, gen worker: 8 GPUs; total system_gpu_count: 16") so that it aligns with
the filename ctx1_node1_gpu4_gen1_node1_gpu8 and the system_gpu_count key.

Comment on lines +5 to +9
# 1 ctx worker with each 1 node and 1 GPU
# 1 gen worker with each 1 node and 4 GPUs
system_gpu_count:
gte: 8
lte: 8
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Confirm system_gpu_count: 8 accounts for whole-node allocation on GB200 NVL4.

The group name and comments describe 1+4 = 5 GPUs in active use, but the range is pinned to exactly 8. This is likely intentional if GB200 nodes are NVL4 (4 GPUs/node) and the scheduler must allocate whole nodes (2 nodes × 4 GPUs = 8), leaving 3 GPUs on the ctx node idle. If that assumption is correct, a short inline comment would make the intent self-documenting; if it is wrong, the range may exclude valid systems or cause unexpected scheduling behavior.

📝 Suggested clarifying comment
     system_gpu_count:
-        gte: 8
-        lte: 8
+        # 2 nodes × 4 GPUs/node (GB200 NVL4); ctx uses 1 of 4 GPUs on its node
+        gte: 8
+        lte: 8
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# 1 ctx worker with each 1 node and 1 GPU
# 1 gen worker with each 1 node and 4 GPUs
system_gpu_count:
gte: 8
lte: 8
# 1 ctx worker with each 1 node and 1 GPU
# 1 gen worker with each 1 node and 4 GPUs
system_gpu_count:
# 2 nodes × 4 GPUs/node (GB200 NVL4); ctx uses 1 of 4 GPUs on its node
gte: 8
lte: 8
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In
`@tests/integration/test_lists/test-db/l0_gb200_multi_nodes_disagg_perf_sanity_ctx1_node1_gpu1_gen1_node1_gpu4.yml`
around lines 5 - 9, The system_gpu_count is fixed to 8 but the config comments
and group name imply 1 ctx node with 1 GPU + 1 gen node with 4 GPUs (5 active
GPUs); clarify that the 8 GPU requirement is intentional to force whole-node
allocation on GB200 NVL4 (4 GPUs/node) — add an inline comment next to
system_gpu_count explaining that the scheduler must allocate whole NVL4 nodes (2
nodes × 4 GPUs = 8) so the ctx node will have unused GPUs, or adjust the range
if whole-node allocation is not intended; reference the system_gpu_count key and
the ctx/gen worker comments when making the change.

Comment on lines 3 to +5
supported_gpus:
- GB200
hardware:
gpus_per_node: 4
- B200
- B300
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Find all references to this filename and similar patterns in the codebase
echo "=== Direct references to gpt_oss_120b_fp4_grace_blackwell ==="
rg -n "gpt_oss_120b_fp4_grace_blackwell" --include="*.yml" --include="*.yaml" --include="*.groovy" --include="*.py" --include="*.sh" --include="*.md"

echo ""
echo "=== Check file exists and list nearby test configs ==="
fd "gpt_oss_120b_fp4" --type f

echo ""
echo "=== Look for GB200 vs B200/B300 references in test configs ==="
rg -n "GB200|B200|B300" tests/scripts/perf-sanity/ --include="*.yaml" -A 2 -B 2

Repository: NVIDIA/TensorRT-LLM

Length of output: 1209


🏁 Script executed:

#!/bin/bash
# Fix rg syntax - use -t instead of --include
echo "=== Direct references to gpt_oss_120b_fp4_grace_blackwell ==="
rg "gpt_oss_120b_fp4_grace_blackwell" -t yaml -t py -t groovy -t sh -t md

echo ""
echo "=== Search for references in test lists and configs ==="
rg "grace_blackwell" -t yaml -t py -t groovy

echo ""
echo "=== Read the actual file ==="
cat tests/scripts/perf-sanity/gpt_oss_120b_fp4_grace_blackwell.yaml | head -40

echo ""
echo "=== List all gpt_oss files in perf-sanity ==="
fd "gpt_oss" tests/scripts/perf-sanity/

echo ""
echo "=== Check if B200/B300 references in test infrastructure ==="
rg "supported_gpus" tests/scripts/perf-sanity/ -A 3 | head -50

Repository: NVIDIA/TensorRT-LLM

Length of output: 11321


File should be renamed to match hardware designation: gpt_oss_120b_fp4_blackwell.yaml (and references updated).

The filename gpt_oss_120b_fp4_grace_blackwell.yaml violates the project's naming convention. Files named *_grace_blackwell.yaml consistently declare supported_gpus: [GB200, GB300] (Grace CPU + Blackwell GPU), while this file specifies [B200, B300] (standalone Blackwell GPUs). All similar standalone Blackwell configs use the *_blackwell.yaml naming pattern instead.

Update required in tests/integration/test_lists/test-db/l0_gb200_multi_gpus_perf_sanity.yml where this config is referenced (5 test entries).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/scripts/perf-sanity/gpt_oss_120b_fp4_grace_blackwell.yaml` around lines
3 - 5, The YAML filename incorrectly uses the "_grace_blackwell" suffix while
declaring standalone Blackwell GPUs (supported_gpus: [B200, B300]); rename the
file to gpt_oss_120b_fp4_blackwell.yaml and update all references to it (notably
the five test entries in the
tests/integration/test_lists/test-db/l0_gb200_multi_gpus_perf_sanity.yml) so
they point to the new filename; ensure the file content still lists
supported_gpus: [B200, B300] and run a quick grep across the repo to catch any
other references to the old gpt_oss_120b_fp4_grace_blackwell.yaml name.


**Example**:
**Test Case Names**:
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add language specifiers to fenced code blocks to satisfy MD040.

All four blocks contain plain-text templates or identifiers. Using ```text resolves the markdownlint warning without implying executable content.

📝 Proposed fix (all four blocks)
-```
+```text
 perf/test_perf_sanity.py::test_e2e[aggr_upload-{config yaml file base name}]
 perf/test_perf_sanity.py::test_e2e[aggr_upload-{config yaml file base name}-{server_config_name}]

Apply the same ` ```text ` opener at lines 40, 57, and 62.
</details>




Also applies to: 40-40, 57-57, 62-62

<details>
<summary>🧰 Tools</summary>

<details>
<summary>🪛 markdownlint-cli2 (0.21.0)</summary>

[warning] 31-31: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against the current code and only fix it if needed.

In @tests/scripts/perf-sanity/README.md at line 31, Replace the four fenced code
blocks that currently start with plain (the blocks containing lines like "perf/test_perf_sanity.py::test_e2e[aggr_upload-{config yaml file base name}]" and the two variants with "-{server_config_name}") so they begin withtext
instead of ``` to satisfy MD040; update all four occurrences (the three
additional blocks that contain the other test identifier variants) so each code
fence explicitly uses the text language specifier.


</details>

<!-- fingerprinting:phantom:poseidon:churro -->

<!-- This is an auto-generated comment by CodeRabbit -->

timeout: int = 300,
check_files: list[str] | None = None,
server_proc: subprocess.Popen | None = None,
server_proc: subprocess.Popen = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

server_proc annotation is missing | None.

With Python 3.10+ (per project requirements), the idiomatic form for an optional parameter is T | None. Ruff RUF013 flags this as an implicit Optional.

🛠️ Proposed fix
-    server_proc: subprocess.Popen = None,
+    server_proc: subprocess.Popen | None = None,
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
server_proc: subprocess.Popen = None,
server_proc: subprocess.Popen | None = None,
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 14-14: PEP 484 prohibits implicit Optional

Convert to T | None

(RUF013)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_common/http_utils.py` at line 14, The type annotation for the
parameter/variable server_proc in tests/test_common/http_utils.py currently uses
subprocess.Popen without explicitly allowing None, triggering Ruff RUF013;
update the annotation to use the Python 3.10 union form subprocess.Popen | None
(i.e., change server_proc: subprocess.Popen to server_proc: subprocess.Popen |
None) so the variable is explicitly optional and conforms to the project's
typing style.

Comment on lines 26 to +41
if check_files and iteration % 300 == 0:
for check_file in check_files:
error_lines = check_error(check_file)
if error_lines:
error_lines_str = ", ".join(
[f"line {line_idx}: {line_str}" for line_idx, line_str in error_lines]
)
raise RuntimeError(
f"Found error in server file {check_file}: {error_lines_str}"
)
if os.path.exists(check_file):
try:
with open(check_file, "r") as f:
content = f.read()
for line in content.splitlines():
for keyword in ERROR_KEYWORDS:
if keyword in line:
raise RuntimeError(
f"Found '{keyword}' in server file {check_file}: {line}"
)
except RuntimeError:
raise
except Exception as err:
print(f"Failed to read server file {check_file}: {err}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

File-check interval is effectively unreachable in realistic failure scenarios.

check_files is scanned only when iteration % 300 == 0. With the default timeout=300 each iteration takes:

  • Best case (server rejects connections immediately): ~1 s/iteration → 300 iterations → first check fires at the very end of the timeout window (once, if at all).
  • Worst case (requests hit the 5 s timeout): ~6 s/iteration → only ~50 iterations before timeout expires → iteration 300 is never reached → error keywords in check_files are never detected.

The failure scenario where check_files is most needed (server is unresponsive/timing out) is exactly the scenario where the check never fires.

Replace the iteration counter with a time-based threshold:

🛠️ Proposed fix
     start = time.monotonic()
-    iteration = 0
+    last_file_check = start
     while time.monotonic() - start < timeout:
         if server_proc is not None:
             exit_code = server_proc.poll()
             if exit_code is not None:
                 raise RuntimeError(
                     f"Server process exited with code {exit_code} before becoming ready."
                 )
-        iteration += 1
-        if check_files and iteration % 300 == 0:
+        if check_files and time.monotonic() - last_file_check >= 30:
+            last_file_check = time.monotonic()
             for check_file in check_files:
🧰 Tools
🪛 Ruff (0.15.1)

[warning] 35-37: Abstract raise to an inner function

(TRY301)


[warning] 35-37: Avoid specifying long messages outside the exception class

(TRY003)


[warning] 40-40: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_common/http_utils.py` around lines 26 - 41, The periodic file-scan
using "if check_files and iteration % 300 == 0" is unreliable; change it to a
time-based check using a timestamp (e.g., start_time = time.time()) and a small
interval (e.g., 1–5s) derived from the existing timeout so the scan runs
repeatedly during long request timeouts. Update the loop that references
check_files, ERROR_KEYWORDS and iteration to instead compare time.time() -
last_check_time >= check_interval, update last_check_time when scanning, and
keep the existing behavior of reading files and raising RuntimeError when
keywords are found.

@tensorrt-cicd
Copy link
Collaborator

PR_Github #36348 [ run ] completed with state SUCCESS. Commit: b385edd
/LLM/main/L0_MergeRequest_PR pipeline #28115 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Link to invocation

@chzblych
Copy link
Collaborator

Closed in favor of e4277e4.

@chzblych chzblych closed this Feb 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants