-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][chore] Support multiple job submission at the same time #10492
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[None][chore] Support multiple job submission at the same time #10492
Conversation
Signed-off-by: FredricZ-2007 <[email protected]>
… backend result for comparison Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
Signed-off-by: FredricZ-2007 <[email protected]>
📝 WalkthroughWalkthroughThe changes introduce batch-based job submission for disaggregation tests by adding a BatchManager class to partition test configurations into batches, submitting each batch on-demand to SLURM. A new pytest fixture and CLI option Changes
Sequence Diagram(s)sequenceDiagram
actor Test as Test Method<br/>(pytest)
participant BM as BatchManager<br/>(fixture)
participant SU as Subprocess Utils
participant SLURM as SLURM<br/>(scheduler)
rect rgb(240, 248, 255)
Note over Test,SLURM: Session Setup Phase
Test->>BM: batch_manager fixture initialized
BM->>BM: Read batch size from CLI/env (default 5)
BM->>BM: Collect all test configs into all_configs
end
rect rgb(255, 245, 238)
Note over Test,SLURM: Test Execution Phase
Test->>Test: test_benchmark/accuracy/stress executes
Test->>BM: get_job_id(test_config)
BM->>BM: Determine batch number for config
alt Batch not yet submitted
BM->>BM: _submit_batch(batch_num)
BM->>SU: exec_cmd_with_output(sbatch cmd, check=False)
SU->>SLURM: Submit batch of jobs
SLURM-->>SU: Return job IDs + warnings
SU-->>BM: Parse output, extract job_ids
BM->>BM: Record job_ids in job_mapping
end
BM->>BM: Lookup job_id for test_config
BM-->>Test: Return job_id
Test->>Test: wait_for_completion(job_id, timeout=36000s)
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/integration/defs/perf/disagg/execution/subprocess_utils.py (1)
1-16: Missing NVIDIA copyright header.Per coding guidelines, all TensorRT-LLM source files should contain an NVIDIA copyright header.
🤖 Fix all issues with AI agents
In @tests/integration/defs/perf/disagg/conftest.py:
- Around line 1-9: This file (conftest.py) is missing the required NVIDIA
copyright header; add the standard multi-line NVIDIA copyright header at the
very top of conftest.py (above the docstring and imports), include the correct
year of the latest meaningful modification, and ensure the header format matches
other TensorRT-LLM source files so linting and license checks pass.
🧹 Nitpick comments (6)
tests/integration/defs/perf/disagg/conftest.py (2)
186-206: Linear search on everyget_job_idcall may be inefficient for large test suites.The
get_job_idmethod performs a linear search throughall_configsusing a generator expression each time it's called. For test suites with hundreds of configurations, consider building a reverse lookup dict duringadd_config.♻️ Suggested optimization
def __init__(self, batch_size=5): # ... self.submitted_batches = set() self.job_mapping = {} self.all_configs = [] + self._config_index = {} # test_id -> index for O(1) lookup # ... def add_config(self, test_config): """Add a test configuration to the manager.""" + self._config_index[test_config.test_id] = len(self.all_configs) self.all_configs.append(test_config) def get_job_id(self, test_config): # Find the index of this config in the ordered list - try: - idx = next(i for i, c in enumerate(self.all_configs) - if c.test_id == test_config.test_id) - except StopIteration: + idx = self._config_index.get(test_config.test_id) + if idx is None: logger.error(f"Config not found in manager: {test_config.test_id}") return None # ...
245-247: Consider catching more specific exceptions.The broad
Exceptioncatch here (flagged by Ruff BLE001) could mask unexpected errors. SinceJobManager.submit_test_jobreturns a tuple on failure rather than raising, exceptions here likely indicate programming errors or severe issues that may warrant different handling.♻️ Proposed fix
- except Exception as e: + except (OSError, IOError, ValueError) as e: self.job_mapping[config.test_id] = None logger.error(f" [{i:3d}/{len(batch_configs)}] Error: {e}")Alternatively, if you want to keep the broad catch for resilience, at least log the traceback for debugging:
except Exception as e: self.job_mapping[config.test_id] = None - logger.error(f" [{i:3d}/{len(batch_configs)}] Error: {e}") + import traceback + logger.error(f" [{i:3d}/{len(batch_configs)}] Error: {e}") + logger.debug(traceback.format_exc())tests/integration/defs/perf/disagg/README.md (3)
163-183: Add language specifier to fenced code block.The workflow diagram code block (lines 165-183) lacks a language specifier. Consider adding
textorplaintextfor documentation clarity and to satisfy markdown linting (MD040).♻️ Proposed fix
-``` +```text Pytest Collection Phase: - Collects all test cases (e.g., 100 tests)
218-238: Add language specifiers to performance comparison code blocks.The "Before" and "After" examples lack language specifiers (MD040). Also, "Speedup: 50x" uses emphasis instead of a proper heading (MD036).
♻️ Proposed fix
**Before (Sequential Submission):** -``` +```text Case 1: submit + wait (1.5h) = 1.5h ...After (Batch Submission, batch_size=50):
-+text
Batch 0 (50 jobs): submitted in parallel
...-**Speedup: 50x** +### Speedup: 50x
240-268: Add language specifiers to troubleshooting code blocks.The troubleshooting output examples at lines 243 and 254 lack language specifiers.
♻️ Proposed fix
**Check BatchManager initialization:** -``` +```text ====================================================================== Batch Manager Initialized ...Monitor batch submission:
-+textSubmitting Batch 0
...</details> </blockquote></details> <details> <summary>tests/integration/defs/perf/disagg/test_disagg.py (1)</summary><blockquote> `104-108`: **Consider adding a descriptive message to the job_id assertion.** The assertion `assert job_id, f"Failed to get job_id for {test_config.test_id}"` is good, but for debugging batch submission failures, it may help to include which batch the test belonged to. This is a minor enhancement that could aid debugging when batch submissions fail silently. Also applies to: 181-185, 266-270 </blockquote></details> </blockquote></details> <details> <summary>📜 Review details</summary> **Configuration used**: Path: .coderabbit.yaml **Review profile**: CHILL **Plan**: Pro <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 7e88212d2416f05fee6ac82c15a5831d361388f9 and 06686af570c1158f9d81560645cb14013247321b. </details> <details> <summary>📒 Files selected for processing (5)</summary> * `tests/integration/defs/perf/disagg/README.md` * `tests/integration/defs/perf/disagg/conftest.py` * `tests/integration/defs/perf/disagg/execution/executor.py` * `tests/integration/defs/perf/disagg/execution/subprocess_utils.py` * `tests/integration/defs/perf/disagg/test_disagg.py` </details> <details> <summary>🧰 Additional context used</summary> <details> <summary>📓 Path-based instructions (3)</summary> <details> <summary>**/*.py</summary> **📄 CodeRabbit inference engine (CODING_GUIDELINES.md)** > `**/*.py`: The code developed for TensorRT-LLM should conform to Python 3.8+ > Indent Python code with 4 spaces. Do not use tabs > Always maintain the namespace when importing Python modules, even if only one class or function from a module is used > Python filenames should use snake_case (e.g., `some_file.py`) > Python classes should use PascalCase (e.g., `class SomeClass`) > Python functions and methods should use snake_case (e.g., `def my_awesome_function():`) > Python local variables should use snake_case, with prefix `k` for variable names that start with a number (e.g., `k_99th_percentile`) > Python global variables should use upper snake_case with prefix `G` (e.g., `G_MY_GLOBAL`) > Python constants should use upper snake_case (e.g., `MY_CONSTANT`) > Avoid shadowing variables declared in an outer scope in Python > Initialize all externally visible members of a Python class in the constructor > For Python interfaces that may be used outside a file, prefer docstrings over comments > Use comments in Python for code within a function, or interfaces that are local to a file > Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx > Python attributes and variables can be documented inline with the format `"""<type>: Description"""` > Avoid using reflection in Python when functionality can be easily achieved without reflection > When using try-except blocks in Python, limit the except clause to the smallest set of errors possible > When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block for the main logic Files: - `tests/integration/defs/perf/disagg/execution/executor.py` - `tests/integration/defs/perf/disagg/test_disagg.py` - `tests/integration/defs/perf/disagg/conftest.py` - `tests/integration/defs/perf/disagg/execution/subprocess_utils.py` </details> <details> <summary>**/*.{cpp,cc,cxx,h,hpp,hxx,cu,cuh,py}</summary> **📄 CodeRabbit inference engine (CODING_GUIDELINES.md)** > All TensorRT-LLM source files (.cpp, .h, .cu, .py, and other source files) should contain an NVIDIA copyright header with the year of latest meaningful modification Files: - `tests/integration/defs/perf/disagg/execution/executor.py` - `tests/integration/defs/perf/disagg/test_disagg.py` - `tests/integration/defs/perf/disagg/conftest.py` - `tests/integration/defs/perf/disagg/execution/subprocess_utils.py` </details> <details> <summary>**/*.{md,rst}</summary> **📄 CodeRabbit inference engine (CODING_GUIDELINES.md)** > When documenting CLI commands for TensorRT-LLM tools like `trtllm-serve`, `trtllm-bench`, or `trtllm-eval`, prefer using `--config` over `--extra_llm_api_options` for specifying configuration files Files: - `tests/integration/defs/perf/disagg/README.md` </details> </details><details> <summary>🧬 Code graph analysis (2)</summary> <details> <summary>tests/integration/defs/perf/disagg/execution/executor.py (1)</summary><blockquote> <details> <summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py (1)</summary> * `exec_cmd_with_output` (36-68) </details> </blockquote></details> <details> <summary>tests/integration/defs/perf/disagg/conftest.py (2)</summary><blockquote> <details> <summary>tests/integration/defs/perf/disagg/utils/logger.py (4)</summary> * `info` (81-83) * `error` (89-91) * `success` (97-99) * `warning` (85-87) </details> <details> <summary>tests/integration/defs/perf/disagg/execution/executor.py (2)</summary> * `JobManager` (21-860) * `submit_test_job` (200-270) </details> </blockquote></details> </details><details> <summary>🪛 GitHub Actions: Release Checks</summary> <details> <summary>tests/integration/defs/perf/disagg/conftest.py</summary> [error] 1-1: Trailing whitespace detected and fixed by pre-commit hook. </details> <details> <summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py</summary> [error] 1-1: ruff-format reformatted this file during pre-commit; changes applied. --- [warning] 1-1: pre-commit hooks updated formatting in multiple files; CI may fail until changes are committed. </details> </details> <details> <summary>🪛 markdownlint-cli2 (0.18.1)</summary> <details> <summary>tests/integration/defs/perf/disagg/README.md</summary> 165-165: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 221-221: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 230-230: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 238-238: Emphasis used instead of a heading (MD036, no-emphasis-as-heading) --- 243-243: Fenced code blocks should have a language specified (MD040, fenced-code-language) --- 254-254: Fenced code blocks should have a language specified (MD040, fenced-code-language) </details> </details> <details> <summary>🪛 Ruff (0.14.10)</summary> <details> <summary>tests/integration/defs/perf/disagg/conftest.py</summary> 245-245: Do not catch blind exception: `Exception` (BLE001) </details> <details> <summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py</summary> 52-52: `subprocess` call: check for execution of untrusted input (S603) </details> </details> </details> <details> <summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)</summary> * GitHub Check: Pre-commit Check </details> <details> <summary>🔇 Additional comments (9)</summary><blockquote> <details> <summary>tests/integration/defs/perf/disagg/conftest.py (3)</summary><blockquote> `27-35`: **LGTM!** The new `--disagg-batch-size` CLI option is well-documented with clear help text explaining the default behavior, environment variable fallback, and the unlimited batch option (0). --- `126-161`: **Well-designed BatchManager class with clear documentation.** The class structure is clean with good docstrings explaining the lazy batch submission approach. The batch_size normalization handling None, 0, and negative values uniformly is a good defensive practice. --- `257-294`: **LGTM!** The `batch_manager` fixture correctly implements the priority chain (CLI → env var → default) and gracefully handles invalid environment variable values. The extraction of test configs from collected items is appropriately guarded with `hasattr` checks. </blockquote></details> <details> <summary>tests/integration/defs/perf/disagg/execution/executor.py (1)</summary><blockquote> `117-119`: **LGTM!** Using `check=False` to tolerate Kerberos warnings aligns with the documented behavior of `submit.py`. The comment clearly explains the rationale, and the subsequent job_id parsing logic handles both successful and failed submissions appropriately. </blockquote></details> <details> <summary>tests/integration/defs/perf/disagg/README.md (1)</summary><blockquote> `345-356`: **Architecture section update is accurate.** The addition of BatchManager to the architecture list correctly reflects the new component and its role in managing batch job submission for parallel execution. </blockquote></details> <details> <summary>tests/integration/defs/perf/disagg/execution/subprocess_utils.py (1)</summary><blockquote> `36-68`: **LGTM! Well-implemented conditional error handling.** The addition of the `check` parameter with appropriate default (`True`) maintains backward compatibility while enabling callers like `submit_shell_job` to tolerate non-critical warnings. The conditional stderr logging (error vs warning based on `check` flag) provides good observability. The static analysis hint about untrusted input (S603) is a false positive in this context since commands originate from internal code paths, not user input. </blockquote></details> <details> <summary>tests/integration/defs/perf/disagg/test_disagg.py (3)</summary><blockquote> `65-111`: **LGTM! Clean integration with BatchManager.** The test method correctly obtains `job_id` via `batch_manager.get_job_id()` and appropriately asserts on the result. The 10-hour timeout (36000 seconds) aligns with the documented behavior in README.md and accounts for SLURM queue delays. --- `138-188`: **LGTM! Consistent pattern with test_benchmark.** The accuracy test method follows the same batch_manager integration pattern, maintaining consistency across test types. --- `217-273`: **LGTM! Stress test follows established pattern.** The stress test method correctly integrates with batch_manager using the same pattern as perf and accuracy tests. </blockquote></details> </blockquote></details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
Signed-off-by: yingguo-trt <[email protected]>
fredricz-20070104
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Support batch submit, default job size = 5.
|
/bot run |
|
PR_Github #30882 [ run ] triggered by Bot. Commit: |
|
/bot run --skip-test |
|
PR_Github #30886 [ run ] triggered by Bot. Commit: |
|
PR_Github #30882 [ run ] completed with state |
|
/bot run --skip-test |
|
PR_Github #30889 [ run ] triggered by Bot. Commit: |
|
PR_Github #30886 [ run ] completed with state |
|
PR_Github #30889 [ run ] completed with state |
|
/bot reuse-pipeline |
|
PR_Github #30960 [ reuse-pipeline ] triggered by Bot. Commit: |
|
PR_Github #30960 [ reuse-pipeline ] completed with state |
Summary by CodeRabbit
New Features
DISAGG_BATCH_SIZEenvironment variable (default: 5).--disagg-batch-sizefor custom batch configuration.Bug Fixes
Documentation
Tests
✏️ Tip: You can customize this high-level summary in your review settings.
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.