Skip to content

Conversation

@yingguo-trt
Copy link
Collaborator

@yingguo-trt yingguo-trt commented Jan 7, 2026

Summary by CodeRabbit

  • New Features

    • Batch submission support for SLURM jobs with configurable batch size via DISAGG_BATCH_SIZE environment variable (default: 5).
    • CLI option --disagg-batch-size for custom batch configuration.
  • Bug Fixes

    • Improved error handling to allow job submission despite Kerberos warnings.
  • Documentation

    • Comprehensive batch submission guide with quick start examples, workflow details, benefits, timeout information, and troubleshooting.
  • Tests

    • Extended test execution timeouts to 10 hours for batch-based submissions.

✏️ Tip: You can customize this high-level summary in your review settings.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

Signed-off-by: FredricZ-2007 <[email protected]>
… backend result for comparison

Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 7, 2026

📝 Walkthrough

Walkthrough

The changes introduce batch-based job submission for disaggregation tests by adding a BatchManager class to partition test configurations into batches, submitting each batch on-demand to SLURM. A new pytest fixture and CLI option --disagg-batch-size control batch size (default 5). Test methods are updated to retrieve job IDs from the batch manager instead of direct submission. Subprocess utilities are enhanced to support conditional error handling, and test timeouts are increased to 36000 seconds.

Changes

Cohort / File(s) Summary
Documentation
tests/integration/defs/perf/disagg/README.md
Comprehensive batch job submission documentation added, covering SLURM batch submission workflow, batch size configuration, quick start examples, and troubleshooting; references new DISAGG_BATCH_SIZE environment variable and BatchManager architecture component.
Batch Management Infrastructure
tests/integration/defs/perf/disagg/conftest.py
Introduces BatchManager class with methods to accumulate test configs, partition into batches, and submit batches on-demand via SLURM; adds session-scoped batch_manager fixture that reads batch size from CLI option or DISAGG_BATCH_SIZE env var (default 5); adds --disagg-batch-size pytest CLI option.
Subprocess & Execution Changes
tests/integration/defs/perf/disagg/execution/subprocess_utils.py,
tests/integration/defs/perf/disagg/execution/executor.py
subprocess_utils.py: adds optional check parameter (default True) to exec_cmd_with_output for conditional error raising and adjusts stderr logging behavior based on check value. executor.py: calls exec_cmd_with_output with check=False in submit_shell_job to tolerate Kerberos warnings.
Test Method Updates
tests/integration/defs/perf/disagg/test_disagg.py
Updates test_benchmark, test_accuracy, and test_stress methods to accept batch_manager fixture parameter; replaces direct JobManager submission with batch_manager.get_job_id(test_config) for job ID retrieval; removes explicit success flag assertions; increases wait_for_completion timeouts uniformly to 36000 seconds.

Sequence Diagram(s)

sequenceDiagram
    actor Test as Test Method<br/>(pytest)
    participant BM as BatchManager<br/>(fixture)
    participant SU as Subprocess Utils
    participant SLURM as SLURM<br/>(scheduler)

    rect rgb(240, 248, 255)
    Note over Test,SLURM: Session Setup Phase
    Test->>BM: batch_manager fixture initialized
    BM->>BM: Read batch size from CLI/env (default 5)
    BM->>BM: Collect all test configs into all_configs
    end

    rect rgb(255, 245, 238)
    Note over Test,SLURM: Test Execution Phase
    Test->>Test: test_benchmark/accuracy/stress executes
    Test->>BM: get_job_id(test_config)
    BM->>BM: Determine batch number for config
    alt Batch not yet submitted
        BM->>BM: _submit_batch(batch_num)
        BM->>SU: exec_cmd_with_output(sbatch cmd, check=False)
        SU->>SLURM: Submit batch of jobs
        SLURM-->>SU: Return job IDs + warnings
        SU-->>BM: Parse output, extract job_ids
        BM->>BM: Record job_ids in job_mapping
    end
    BM->>BM: Lookup job_id for test_config
    BM-->>Test: Return job_id
    Test->>Test: wait_for_completion(job_id, timeout=36000s)
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description check ⚠️ Warning PR description is incomplete and lacks critical sections from the template. Missing detailed issue explanation, test coverage documentation, and several PR checklist items are unchecked. Complete the 'Description' and 'Test Coverage' sections with clear explanations. Verify and update all PR checklist items, especially documentation updates and architecture diagram changes if applicable.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately reflects the main change: introducing batch-based job submission functionality to support multiple simultaneous job submissions.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tests/integration/defs/perf/disagg/execution/subprocess_utils.py (1)

1-16: Missing NVIDIA copyright header.

Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.

🤖 Fix all issues with AI agents
In @tests/integration/defs/perf/disagg/conftest.py:
- Around line 1-9: This file (conftest.py) is missing the required NVIDIA
copyright header; add the standard multi-line NVIDIA copyright header at the
very top of conftest.py (above the docstring and imports), include the correct
year of the latest meaningful modification, and ensure the header format matches
other TensorRT-LLM source files so linting and license checks pass.
🧹 Nitpick comments (6)
tests/integration/defs/perf/disagg/conftest.py (2)

186-206: Linear search on every get_job_id call may be inefficient for large test suites.

The get_job_id method performs a linear search through all_configs using a generator expression each time it's called. For test suites with hundreds of configurations, consider building a reverse lookup dict during add_config.

♻️ Suggested optimization
 def __init__(self, batch_size=5):
     # ...
     self.submitted_batches = set()
     self.job_mapping = {}
     self.all_configs = []
+    self._config_index = {}  # test_id -> index for O(1) lookup
     # ...

 def add_config(self, test_config):
     """Add a test configuration to the manager."""
+    self._config_index[test_config.test_id] = len(self.all_configs)
     self.all_configs.append(test_config)

 def get_job_id(self, test_config):
     # Find the index of this config in the ordered list
-    try:
-        idx = next(i for i, c in enumerate(self.all_configs) 
-                  if c.test_id == test_config.test_id)
-    except StopIteration:
+    idx = self._config_index.get(test_config.test_id)
+    if idx is None:
         logger.error(f"Config not found in manager: {test_config.test_id}")
         return None
     # ...

245-247: Consider catching more specific exceptions.

The broad Exception catch here (flagged by Ruff BLE001) could mask unexpected errors. Since JobManager.submit_test_job returns a tuple on failure rather than raising, exceptions here likely indicate programming errors or severe issues that may warrant different handling.

♻️ Proposed fix
-            except Exception as e:
+            except (OSError, IOError, ValueError) as e:
                 self.job_mapping[config.test_id] = None
                 logger.error(f"  [{i:3d}/{len(batch_configs)}] Error: {e}")

Alternatively, if you want to keep the broad catch for resilience, at least log the traceback for debugging:

             except Exception as e:
                 self.job_mapping[config.test_id] = None
-                logger.error(f"  [{i:3d}/{len(batch_configs)}] Error: {e}")
+                import traceback
+                logger.error(f"  [{i:3d}/{len(batch_configs)}] Error: {e}")
+                logger.debug(traceback.format_exc())
tests/integration/defs/perf/disagg/README.md (3)

163-183: Add language specifier to fenced code block.

The workflow diagram code block (lines 165-183) lacks a language specifier. Consider adding text or plaintext for documentation clarity and to satisfy markdown linting (MD040).

♻️ Proposed fix
-```
+```text
 Pytest Collection Phase:
   - Collects all test cases (e.g., 100 tests)

218-238: Add language specifiers to performance comparison code blocks.

The "Before" and "After" examples lack language specifiers (MD040). Also, "Speedup: 50x" uses emphasis instead of a proper heading (MD036).

♻️ Proposed fix
 **Before (Sequential Submission):**
-```
+```text
 Case 1: submit + wait (1.5h) = 1.5h
 ...

After (Batch Submission, batch_size=50):
- +text
Batch 0 (50 jobs): submitted in parallel
...


-**Speedup: 50x**
+### Speedup: 50x

240-268: Add language specifiers to troubleshooting code blocks.

The troubleshooting output examples at lines 243 and 254 lack language specifiers.

♻️ Proposed fix
 **Check BatchManager initialization:**
-```
+```text
 ======================================================================
 Batch Manager Initialized
 ...

Monitor batch submission:
- +text

Submitting Batch 0
...

</details>

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/test_disagg.py (1)</summary><blockquote>

`104-108`: **Consider adding a descriptive message to the job_id assertion.**

The assertion `assert job_id, f"Failed to get job_id for {test_config.test_id}"` is good, but for debugging batch submission failures, it may help to include which batch the test belonged to.



This is a minor enhancement that could aid debugging when batch submissions fail silently.


Also applies to: 181-185, 266-270

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 7e88212d2416f05fee6ac82c15a5831d361388f9 and 06686af570c1158f9d81560645cb14013247321b.

</details>

<details>
<summary>📒 Files selected for processing (5)</summary>

* `tests/integration/defs/perf/disagg/README.md`
* `tests/integration/defs/perf/disagg/conftest.py`
* `tests/integration/defs/perf/disagg/execution/executor.py`
* `tests/integration/defs/perf/disagg/execution/subprocess_utils.py`
* `tests/integration/defs/perf/disagg/test_disagg.py`

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>📓 Path-based instructions (3)</summary>

<details>
<summary>**/*.py</summary>


**📄 CodeRabbit inference engine (CODING_GUIDELINES.md)**

> `**/*.py`: The code developed for TensorRT-LLM should conform to Python 3.8+
> Indent Python code with 4 spaces. Do not use tabs
> Always maintain the namespace when importing Python modules, even if only one class or function from a module is used
> Python filenames should use snake_case (e.g., `some_file.py`)
> Python classes should use PascalCase (e.g., `class SomeClass`)
> Python functions and methods should use snake_case (e.g., `def my_awesome_function():`)
> Python local variables should use snake_case, with prefix `k` for variable names that start with a number (e.g., `k_99th_percentile`)
> Python global variables should use upper snake_case with prefix `G` (e.g., `G_MY_GLOBAL`)
> Python constants should use upper snake_case (e.g., `MY_CONSTANT`)
> Avoid shadowing variables declared in an outer scope in Python
> Initialize all externally visible members of a Python class in the constructor
> For Python interfaces that may be used outside a file, prefer docstrings over comments
> Use comments in Python for code within a function, or interfaces that are local to a file
> Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx
> Python attributes and variables can be documented inline with the format `"""<type>: Description"""`
> Avoid using reflection in Python when functionality can be easily achieved without reflection
> When using try-except blocks in Python, limit the except clause to the smallest set of errors possible
> When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic

Files:
- `tests/integration/defs/perf/disagg/execution/executor.py`
- `tests/integration/defs/perf/disagg/test_disagg.py`
- `tests/integration/defs/perf/disagg/conftest.py`
- `tests/integration/defs/perf/disagg/execution/subprocess_utils.py`

</details>
<details>
<summary>**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}</summary>


**📄 CodeRabbit inference engine (CODING_GUIDELINES.md)**

> All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification

Files:
- `tests/integration/defs/perf/disagg/execution/executor.py`
- `tests/integration/defs/perf/disagg/test_disagg.py`
- `tests/integration/defs/perf/disagg/conftest.py`
- `tests/integration/defs/perf/disagg/execution/subprocess_utils.py`

</details>
<details>
<summary>**/*.{md,rst}</summary>


**📄 CodeRabbit inference engine (CODING_GUIDELINES.md)**

> When documenting CLI commands for TensorRT-LLM tools like `trtllm-serve`, `trtllm-bench`, or `trtllm-eval`, prefer using `--config` over `--extra_llm_api_options` for specifying configuration files

Files:
- `tests/integration/defs/perf/disagg/README.md`

</details>

</details><details>
<summary>🧬 Code graph analysis (2)</summary>

<details>
<summary>tests/integration/defs/perf/disagg/execution/executor.py (1)</summary><blockquote>

<details>
<summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py (1)</summary>

* `exec_cmd_with_output` (36-68)

</details>

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/conftest.py (2)</summary><blockquote>

<details>
<summary>tests/integration/defs/perf/disagg/utils/logger.py (4)</summary>

* `info` (81-83)
* `error` (89-91)
* `success` (97-99)
* `warning` (85-87)

</details>
<details>
<summary>tests/integration/defs/perf/disagg/execution/executor.py (2)</summary>

* `JobManager` (21-860)
* `submit_test_job` (200-270)

</details>

</blockquote></details>

</details><details>
<summary>🪛 GitHub Actions: Release Checks</summary>

<details>
<summary>tests/integration/defs/perf/disagg/conftest.py</summary>

[error] 1-1: Trailing whitespace detected and fixed by pre-commit hook.

</details>
<details>
<summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py</summary>

[error] 1-1: ruff-format reformatted this file during pre-commit; changes applied.

---

[warning] 1-1: pre-commit hooks updated formatting in multiple files; CI may fail until changes are committed.

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>tests/integration/defs/perf/disagg/README.md</summary>

165-165: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

221-221: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

230-230: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

238-238: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

---

243-243: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

---

254-254: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.14.10)</summary>

<details>
<summary>tests/integration/defs/perf/disagg/conftest.py</summary>

245-245: Do not catch blind exception: `Exception`

(BLE001)

</details>
<details>
<summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py</summary>

52-52: `subprocess` call: check for execution of untrusted input

(S603)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary>

* GitHub Check: Pre-commit Check

</details>

<details>
<summary>🔇 Additional comments (9)</summary><blockquote>

<details>
<summary>tests/integration/defs/perf/disagg/conftest.py (3)</summary><blockquote>

`27-35`: **LGTM!**

The new `--disagg-batch-size` CLI option is well-documented with clear help text explaining the default behavior, environment variable fallback, and the unlimited batch option (0).

---

`126-161`: **Well-designed BatchManager class with clear documentation.**

The class structure is clean with good docstrings explaining the lazy batch submission approach. The batch_size normalization handling None, 0, and negative values uniformly is a good defensive practice.

---

`257-294`: **LGTM!**

The `batch_manager` fixture correctly implements the priority chain (CLI → env var → default) and gracefully handles invalid environment variable values. The extraction of test configs from collected items is appropriately guarded with `hasattr` checks.

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/execution/executor.py (1)</summary><blockquote>

`117-119`: **LGTM!**

Using `check=False` to tolerate Kerberos warnings aligns with the documented behavior of `submit.py`. The comment clearly explains the rationale, and the subsequent job_id parsing logic handles both successful and failed submissions appropriately.

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/README.md (1)</summary><blockquote>

`345-356`: **Architecture section update is accurate.**

The addition of BatchManager to the architecture list correctly reflects the new component and its role in managing batch job submission for parallel execution.

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py (1)</summary><blockquote>

`36-68`: **LGTM! Well-implemented conditional error handling.**

The addition of the `check` parameter with appropriate default (`True`) maintains backward compatibility while enabling callers like `submit_shell_job` to tolerate non-critical warnings. The conditional stderr logging (error vs warning based on `check` flag) provides good observability.

The static analysis hint about untrusted input (S603) is a false positive in this context since commands originate from internal code paths, not user input.

</blockquote></details>
<details>
<summary>tests/integration/defs/perf/disagg/test_disagg.py (3)</summary><blockquote>

`65-111`: **LGTM! Clean integration with BatchManager.**

The test method correctly obtains `job_id` via `batch_manager.get_job_id()` and appropriately asserts on the result. The 10-hour timeout (36000 seconds) aligns with the documented behavior in README.md and accounts for SLURM queue delays.

---

`138-188`: **LGTM! Consistent pattern with test_benchmark.**

The accuracy test method follows the same batch_manager integration pattern, maintaining consistency across test types.

---

`217-273`: **LGTM! Stress test follows established pattern.**

The stress test method correctly integrates with batch_manager using the same pattern as perf and accuracy tests.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@yingguo-trt yingguo-trt enabled auto-merge (squash) January 7, 2026 08:43
Signed-off-by: yingguo-trt <[email protected]>
Copy link
Collaborator

@fredricz-20070104 fredricz-20070104 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support batch submit, default job size = 5.

@yingguo-trt
Copy link
Collaborator Author

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30882 [ run ] triggered by Bot. Commit: 2259c58

@yingguo-trt
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30886 [ run ] triggered by Bot. Commit: aa71925

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30882 [ run ] completed with state ABORTED. Commit: 2259c58

@yingguo-trt
Copy link
Collaborator Author

/bot run --skip-test

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30889 [ run ] triggered by Bot. Commit: aa71925

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30886 [ run ] completed with state ABORTED. Commit: aa71925

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30889 [ run ] completed with state SUCCESS. Commit: aa71925
/LLM/main/L0_MergeRequest_PR pipeline #23854 (Partly Tested) completed with status: 'SUCCESS'

@yingguo-trt
Copy link
Collaborator Author

/bot reuse-pipeline

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30960 [ reuse-pipeline ] triggered by Bot. Commit: 3e910a9

@tensorrt-cicd
Copy link
Collaborator

PR_Github #30960 [ reuse-pipeline ] completed with state SUCCESS. Commit: 3e910a9
Reusing PR_Github #30889 (Partly Tested) for commit 3e910a9

@yingguo-trt yingguo-trt merged commit f8b2a8f into NVIDIA:main Jan 8, 2026
5 checks passed
@yingguo-trt yingguo-trt deleted the feature/submit_multi_jobs_clean branch January 8, 2026 03:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants