-
Notifications
You must be signed in to change notification settings - Fork 2k
[TRTLLM-9736][feat] AsyncLLM and verl integ #9353
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
66e39dd to
6e9b967
Compare
📝 WalkthroughWalkthroughIntroduces AsyncLLM class extending LLM with async lifecycle management, GPU memory release/resume operations, and collective RPC for multi-worker coordination. Extends Ray executor with async initialization, placement group support, per-worker GPU sharing, and asynchronous collective RPC. Adds OpenAI API endpoints for memory and weight updates, supporting RL/agentic use cases. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant AsyncLLM
participant RayExecutor
participant Workers as Ray Workers
User->>AsyncLLM: setup_async()
AsyncLLM->>RayExecutor: init_workers_async()
RayExecutor->>RayExecutor: has_event_loop check
alt Event Loop Exists
RayExecutor->>RayExecutor: defer init
else No Event Loop
RayExecutor->>Workers: initialize workers
end
RayExecutor-->>AsyncLLM: ready
User->>AsyncLLM: release(tags)
AsyncLLM->>RayExecutor: collective_rpc_async("sleep", tags)
RayExecutor->>Workers: collective RPC call
par Async RPC
Workers->>Workers: gc.collect()
Workers->>Workers: torch.cuda.empty_cache()
end
Workers-->>RayExecutor: results
RayExecutor-->>AsyncLLM: complete
User->>AsyncLLM: resume(tags)
AsyncLLM->>RayExecutor: collective_rpc_async("wakeup", tags)
RayExecutor->>Workers: collective RPC call
Workers-->>RayExecutor: results
RayExecutor-->>AsyncLLM: complete
sequenceDiagram
participant Client
participant OpenAIServer
participant AsyncLLM
participant RayExecutor
Client->>OpenAIServer: POST /release_memory
OpenAIServer->>OpenAIServer: parse MemoryUpdateRequest(tags)
OpenAIServer->>AsyncLLM: release(tags)
AsyncLLM->>RayExecutor: collective_rpc_async("sleep", tags)
RayExecutor-->>AsyncLLM: success
AsyncLLM-->>OpenAIServer: complete
OpenAIServer-->>Client: JSONResponse(success)
Client->>OpenAIServer: POST /update_weights
OpenAIServer->>OpenAIServer: parse UpdateWeightsRequest(weights)
OpenAIServer->>AsyncLLM: update_weights(weights)
AsyncLLM->>RayExecutor: collective_rpc_async("update_weights", weights)
RayExecutor->>RayExecutor: await all worker updates
RayExecutor-->>AsyncLLM: results
AsyncLLM-->>OpenAIServer: complete
OpenAIServer-->>Client: JSONResponse(success)
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Areas requiring extra attention:
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/executor/ray_gpu_worker.py (1)
1-9: Importgcto avoidNameErrorinsleep()
gc.collect()is used inRayGPUWorker.sleep()(Line 223) butgcis never imported, so this path will raiseNameErrorat runtime.Apply a fix like:
-import importlib -import os +import importlib +import os +import gcAlso applies to: 223-224
🧹 Nitpick comments (8)
tensorrt_llm/_torch/virtual_memory.py (1)
70-79: Update stale TODO aroundMODEL_EXTRAtagThe comment for
MODEL_EXTRAstill refers to the old"_no_capture_model_extra"workaround and a TODO about a Torchtorch.cuda.empty_cache()crash, but the live value is now"model_extra". This makes the comment misleading.Consider either removing the commented-out line and TODO or updating the comment to reflect the current behavior and supported Torch versions.
tensorrt_llm/serve/openai_protocol.py (1)
934-939: Document semantics oftagsandweightsfor new OpenAI protocol typesThe new request models are structurally fine, but external clients may not know what to pass:
MemoryUpdateRequest.tags: implicitly expected to matchExecutorMemoryTypetag values (e.g.,"model","kv_cache").UpdateWeightsRequest.weights: appears to be a mapping from device UUIDs to serialized IPC handles for weights (base64-encoded pickle blobs per the RLHF utils path).Consider adding short docstrings or
Field(..., description="...")metadata on these fields to spell out the expected values and formats so OpenAI-style clients can integrate without reading internal implementation.tests/unittest/llmapi/test_llm_async.py (1)
13-36: Verify thatgenerate_asyncis actually awaitable in AsyncLLMThis test does:
output_before = await llm.generate_async(prompt, sampling_params) ... output_after = await llm.generate_async(prompt, sampling_params)In the current
LLMimplementation,generate_asyncis a regular (synchronous) method returning aRequestOutput, not a coroutine. The providedAsyncLLMsnippet only adds async lifecycle/RPC methods and does not show an async override ofgenerate_async.If
AsyncLLMdoes not overridegenerate_asyncasasync def, theseawaitexpressions will raise aTypeErrorat runtime. Please double-check:
- If
generate_asyncremains synchronous, drop theawaitand call it directly inside the async test.- If you intend an async variant, ensure
AsyncLLM(or its executor) exposes an awaitable API and that this test is aligned with that contract.Also applies to: 58-65
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_executor.py (1)
79-88: Commented-out test cases and markers reduce clarity and coverage.The commented
@pytest.mark.gpu4and the(4, [2, 3])test case suggest incomplete work. Either enable these or remove them to avoid confusion. Leaving commented code in tests can mislead reviewers about intended coverage.tensorrt_llm/_torch/async_llm.py (2)
4-4: Unused import:ExecutorMemoryType.
ExecutorMemoryTypeis imported but not referenced in the code. Consider removing it or using it to document/validate thetagsparameter.from ..llmapi.llm import LLM -from .virtual_memory import ExecutorMemoryType
23-37: Document valid tag values forreleaseandresume.The
tagsparameter accepts a list of strings but valid values aren't documented. Consider referencingExecutorMemoryTypeenum values in the docstring to help users.async def release(self, tags: list[str]): """Release the GPU memory used by the LLM asynchronously. Args: - tags: List of memory tag strings to release (e.g., ["model", "kv_cache"]). + tags: List of memory tag strings to release. Valid values include: + "sampler", "drafter", "guided_decoder", "spec_resource_manager", + "model_extra", "executor_extra", "kv_cache", "model", "draft_model". """tensorrt_llm/executor/ray_executor.py (2)
356-359: Use explicitOptionaltype annotation.The parameter
worker_kwargs: Dict = Noneshould useOptional[Dict]for clarity and to satisfy type checkers.def _get_placement_group( self, tp_size: int, - worker_kwargs: Dict = None) -> Tuple[Any, List[int]]: + worker_kwargs: Optional[Dict] = None) -> Tuple[Any, List[int]]:
388-396: Addstrict=Truetozip()for safety.Using
zip()withoutstrict=Truecan silently truncate if the iterables have different lengths, though the validator should prevent this. Addingstrict=Trueprovides defense in depth.flat_pgs = [] flat_indices = [] for pg, indices in zip(llm_args.placement_groups, - llm_args.placement_bundle_indices): + llm_args.placement_bundle_indices, + strict=True): for idx in indices: flat_pgs.append(pg) flat_indices.append(idx)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (13)
tensorrt_llm/__init__.py(2 hunks)tensorrt_llm/_torch/async_llm.py(1 hunks)tensorrt_llm/_torch/virtual_memory.py(1 hunks)tensorrt_llm/executor/ray_executor.py(8 hunks)tensorrt_llm/executor/ray_gpu_worker.py(1 hunks)tensorrt_llm/llmapi/__init__.py(2 hunks)tensorrt_llm/llmapi/llm.py(2 hunks)tensorrt_llm/llmapi/llm_args.py(5 hunks)tensorrt_llm/llmapi/rlhf_utils.py(2 hunks)tensorrt_llm/serve/openai_protocol.py(2 hunks)tensorrt_llm/serve/openai_server.py(4 hunks)tests/unittest/_torch/ray_orchestrator/multi_gpu/test_executor.py(2 hunks)tests/unittest/llmapi/test_llm_async.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/llmapi/__init__.pytensorrt_llm/serve/openai_protocol.pytensorrt_llm/__init__.pytensorrt_llm/_torch/virtual_memory.pytensorrt_llm/llmapi/rlhf_utils.pytensorrt_llm/_torch/async_llm.pytensorrt_llm/executor/ray_executor.pytensorrt_llm/executor/ray_gpu_worker.pytensorrt_llm/llmapi/llm.pytensorrt_llm/serve/openai_server.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/ray_orchestrator/multi_gpu/test_executor.pytests/unittest/llmapi/test_llm_async.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tensorrt_llm/llmapi/__init__.pytensorrt_llm/serve/openai_protocol.pytensorrt_llm/__init__.pytensorrt_llm/_torch/virtual_memory.pytensorrt_llm/llmapi/rlhf_utils.pytensorrt_llm/_torch/async_llm.pytensorrt_llm/executor/ray_executor.pytensorrt_llm/executor/ray_gpu_worker.pytensorrt_llm/llmapi/llm.pytensorrt_llm/serve/openai_server.pytensorrt_llm/llmapi/llm_args.pytests/unittest/_torch/ray_orchestrator/multi_gpu/test_executor.pytests/unittest/llmapi/test_llm_async.py
🧠 Learnings (7)
📓 Common learnings
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Learnt from: venkywonka
Repo: NVIDIA/TensorRT-LLM PR: 6029
File: .github/pull_request_template.md:45-53
Timestamp: 2025-08-27T17:50:13.264Z
Learning: For PR templates in TensorRT-LLM, avoid suggesting changes that would increase developer overhead, such as converting plain bullets to mandatory checkboxes. The team prefers guidance-style bullets that don't require explicit interaction to reduce friction in the PR creation process.
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/virtual_memory.py
📚 Learning: 2025-09-24T03:31:28.908Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 7520
File: tensorrt_llm/_torch/pyexecutor/resource_manager.py:605-613
Timestamp: 2025-09-24T03:31:28.908Z
Learning: In TensorRT-LLM Ray orchestrator mode, ProcessGroups are initialized with both Gloo and NCCL backends (e.g., "cuda:nccl,cpu:gloo"), allowing PyTorch distributed to automatically route CPU tensors through Gloo and GPU tensors through NCCL. This eliminates the need for manual device placement when performing allreduce operations on base types.
Applied to files:
tensorrt_llm/executor/ray_executor.pytensorrt_llm/llmapi/llm.pytensorrt_llm/llmapi/llm_args.py
📚 Learning: 2025-07-17T09:01:27.402Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 5616
File: tensorrt_llm/executor/worker.py:375-384
Timestamp: 2025-07-17T09:01:27.402Z
Learning: In tensorrt_llm/executor/worker.py, the LoRA adapter cache optimization logic that checks `is_adapter_in_cpu_cache()` and conditionally passes None for weights/config has a known race condition issue that cannot be solved with simple error handling or verification checks. This is a known limitation that requires a more comprehensive solution.
Applied to files:
tensorrt_llm/executor/ray_gpu_worker.py
📚 Learning: 2025-08-06T13:58:07.506Z
Learnt from: galagam
Repo: NVIDIA/TensorRT-LLM PR: 6487
File: tests/unittest/_torch/auto_deploy/unit/singlegpu/test_ad_trtllm_bench.py:1-12
Timestamp: 2025-08-06T13:58:07.506Z
Learning: In TensorRT-LLM, test files (files under tests/ directories) do not require NVIDIA copyright headers, unlike production source code files. Test files typically start directly with imports, docstrings, or code.
Applied to files:
tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-09-23T15:12:38.312Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: cpp/tensorrt_llm/thop/allreduceOp.cpp:352-446
Timestamp: 2025-09-23T15:12:38.312Z
Learning: In TensorRT-LLM NCCL device implementation, NCCL version 2.28+ requirements are handled at runtime in the nccl_device/config layer rather than with compile-time guards. This allows the allreduceOp to remain version-agnostic and delegates version compatibility validation to the appropriate lower-level components that can gracefully handle unsupported configurations.
Applied to files:
tensorrt_llm/llmapi/llm.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which can contain default `cuda_graph_config` values, so `llm_args` may already have this config before the extra options processing.
Applied to files:
tensorrt_llm/llmapi/llm_args.py
🧬 Code graph analysis (9)
tensorrt_llm/llmapi/__init__.py (1)
tensorrt_llm/_torch/async_llm.py (1)
AsyncLLM(7-68)
tensorrt_llm/__init__.py (2)
tensorrt_llm/llmapi/llm.py (1)
LLM(1104-1120)tensorrt_llm/_torch/async_llm.py (1)
AsyncLLM(7-68)
tensorrt_llm/llmapi/rlhf_utils.py (1)
tensorrt_llm/serialization.py (1)
loads(168-184)
tensorrt_llm/_torch/async_llm.py (3)
tensorrt_llm/llmapi/llm.py (1)
LLM(1104-1120)tensorrt_llm/_torch/virtual_memory.py (1)
ExecutorMemoryType(70-82)tensorrt_llm/executor/ray_executor.py (3)
init_workers_async(167-172)collective_rpc(190-210)collective_rpc_async(213-224)
tensorrt_llm/executor/ray_executor.py (1)
tensorrt_llm/executor/utils.py (1)
has_event_loop(68-73)
tensorrt_llm/llmapi/llm.py (2)
tests/integration/defs/conftest.py (1)
get_device_count(1988-1990)tensorrt_llm/llmapi/utils.py (1)
get_device_count(133-134)
tensorrt_llm/serve/openai_server.py (3)
tensorrt_llm/serve/openai_protocol.py (3)
to_llm_disaggregated_params(967-977)MemoryUpdateRequest(934-935)UpdateWeightsRequest(938-939)tensorrt_llm/llmapi/rlhf_utils.py (1)
update_weights(33-72)tensorrt_llm/llmapi/llm.py (1)
_collective_rpc(1017-1042)
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_executor.py (1)
tensorrt_llm/llmapi/llm.py (1)
_collective_rpc(1017-1042)
tests/unittest/llmapi/test_llm_async.py (3)
tensorrt_llm/_torch/async_llm.py (4)
AsyncLLM(7-68)setup_async(19-21)release(23-29)resume(31-37)tensorrt_llm/_torch/virtual_memory.py (1)
ExecutorMemoryType(70-82)tests/unittest/utils/util.py (1)
get_current_process_gpu_memory(543-567)
🪛 Ruff (0.14.5)
tensorrt_llm/llmapi/rlhf_utils.py
57-57: pickle and modules that wrap it can be unsafe when used to deserialize untrusted data, possible security issue
(S301)
tensorrt_llm/executor/ray_executor.py
119-119: Use raise without specifying exception name
Remove exception name
(TRY201)
165-165: Avoid specifying long messages outside the exception class
(TRY003)
172-172: Avoid specifying long messages outside the exception class
(TRY003)
359-359: PEP 484 prohibits implicit Optional
Convert to T | None
(RUF013)
380-382: Avoid specifying long messages outside the exception class
(TRY003)
390-391: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
tensorrt_llm/executor/ray_gpu_worker.py
223-223: Undefined name gc
(F821)
tensorrt_llm/serve/openai_server.py
998-998: Unused method argument: request
(ARG002)
1014-1014: Unused method argument: request
(ARG002)
1031-1031: f-string without any placeholders
Remove extraneous f prefix
(F541)
tensorrt_llm/llmapi/llm_args.py
3035-3037: Avoid specifying long messages outside the exception class
(TRY003)
3040-3042: Avoid specifying long messages outside the exception class
(TRY003)
3046-3049: Avoid specifying long messages outside the exception class
(TRY003)
3053-3055: Avoid specifying long messages outside the exception class
(TRY003)
3061-3063: Avoid specifying long messages outside the exception class
(TRY003)
🔇 Additional comments (19)
tensorrt_llm/llmapi/__init__.py (1)
6-7: AsyncLLM export wiring looks consistentImporting
AsyncLLMhere and adding it to__all__correctly exposes it astensorrt_llm.llmapi.AsyncLLM, matching the top-level re-export intensorrt_llm/__init__.py. No issues.Also applies to: 25-28
tensorrt_llm/llmapi/llm.py (1)
192-199: Conditional GPU count check withRAY_LOCAL_WORLD_SIZElooks reasonableGating the per-node GPU availability check on
os.getenv("RAY_LOCAL_WORLD_SIZE") is Noneavoids rejecting Ray orchestrator setups that intentionally subdivide physical GPUs while still protecting non-Ray multi-GPU runs. The error message retains the totalworld_size, which is acceptable for now.tensorrt_llm/__init__.py (1)
87-88: Top-levelAsyncLLMexposure is consistent with llmapiRe-exporting
AsyncLLMalongsideLLMand updating__all__makes the new async interface discoverable viatensorrt_llm.AsyncLLMwithout affecting existing imports. Looks good.Also applies to: 99-140
tensorrt_llm/llmapi/rlhf_utils.py (1)
6-7: I'm unable to access the repository due to a persistent clone failure. To verify this review comment, I need to examine the actual code. Could you provide:
- The current docstring for the
update_weightsmethod (around lines 41-47 intensorrt_llm/llmapi/rlhf_utils.py)- The current implementation that uses
pickle.loadsandbase64.b64decode(around lines 56-65)- Confirmation of whether
tensorrt_llm.serialization.loadsor similar safer deserialization utilities exist in your codebaseAlternatively, if you can confirm repository access is available in your environment, I can proceed with verification once the clone succeeds.
tests/unittest/_torch/ray_orchestrator/multi_gpu/test_executor.py (3)
93-93: Verifytp_sizecalculation for edge cases.
tp_size = n_gpus // 2yieldstp_size=1whenn_gpus=2. If a test case withn_gpus=4andbundle_indices=[2,3]is later enabled,tp_sizebecomes 2, which should matchlen(bundle_indices). Ensure this invariant holds for all intended configurations.
29-76: LGTM for the env vars-based placement test.The test properly initializes Ray, creates a placement group, wraps LLM as a Ray actor with appropriate scheduling, and verifies device UUIDs. The try/finally cleanup is correct.
101-118: LGTM for the API-based placement test.The test exercises the new
placement_groupsandplacement_bundle_indicesparameters directly, verifying correct GPU placement via_collective_rpc. Cleanup is handled properly.tensorrt_llm/serve/openai_server.py (1)
260-268: Route registrations look correct.The new endpoints are properly registered for both standard and MM-encoder configurations.
Also applies to: 305-313
tensorrt_llm/_torch/async_llm.py (2)
12-17: LGTM for AsyncLLM initialization.The constructor correctly enforces Ray orchestrator and sets a sensible default for
ray_worker_extension_cls.
48-67: LGTM for async collective_rpc implementation.Clean delegation to
_executor.collective_rpc_asyncwith proper async/await pattern.tensorrt_llm/llmapi/llm_args.py (4)
2704-2722: LGTM for new placement configuration fields.The new fields are well-documented with appropriate prototype status. The
exclude_from_json=Trueonplacement_groupscorrectly prevents serialization of Ray-specific objects.
3029-3065: LGTM for placement configuration validation.The validator comprehensively checks:
- Ray orchestrator requirement
- Co-presence of placement_groups and placement_bundle_indices
- Length matching
- per_worker_gpu_share bounds
- PlacementGroup type validation when Ray is available
The static analysis warnings about long exception messages (TRY003) are acceptable here as they provide clear, actionable error messages.
23-26: LGTM for guarded PlacementGroup import.Properly handles the case when Ray is not installed by setting
PlacementGroup = None.
1998-1999: The semantic concern about conflatingtensor_parallel_sizewithgpus_per_nodeis valid and warrants further investigation.From available documentation and first-principles analysis:
RAY_LOCAL_WORLD_SIZEindicates the number of worker processes on a single nodegpus_per_nodeshould represent actual physical GPU count per nodetensor_parallel_sizeis a parallelism strategy configuration, not a hardware specificationThese are distinct concepts and should not be automatically mapped. However, without access to the full function context, comments explaining intent, and related test cases, I cannot definitively determine if this is intentional (e.g., a workaround for specific Ray scheduling patterns) or a bug.
The code's behavior should be validated by checking:
- Why this override exists—whether it's a known Ray integration pattern
- Whether tests verify the GPU allocation is correct for multi-node setups
- Comments or documentation explaining the design decision
tensorrt_llm/executor/ray_executor.py (5)
160-172: LGTM for sync/async worker initialization split.Clean separation of sync and async initialization paths with proper error handling.
212-224: LGTM forcollective_rpc_asyncimplementation.Properly delegates to
collective_rpcwithnon_block=Trueand awaits the gathered results.
82-95: LGTM for deferred worker initialization.Storing
worker_kwargsand conditionally initializing workers based onhas_event_loop()enables async initialization when running within an event loop (e.g., from AsyncLLM.setup_async).
139-158: LGTM for enhanced worker creation with placement group support.The logic correctly handles both external placement groups (as a list) and internally created ones (single PG). Per-rank assignment to the appropriate PG and bundle index is correct.
372-396: LGTM for external placement group handling.Properly validates total workers against world_size and flattens the placement groups and bundle indices for per-worker assignment.
Signed-off-by: Yuan Tong <[email protected]>
Signed-off-by: Erin Ho <[email protected]>
Signed-off-by: Erin Ho <[email protected]>
This reverts commit b51aac0 Signed-off-by: Yuan Tong <[email protected]>
Signed-off-by: Erin Ho <[email protected]> update test list
Signed-off-by: Erin Ho <[email protected]>
|
PR_Github #27571 [ run ] completed with state |
|
/bot run --disable-fail-fast |
|
PR_Github #27674 [ run ] triggered by Bot. Commit: |
|
PR_Github #27674 [ run ] completed with state |
|
/bot run |
|
PR_Github #27728 [ run ] triggered by Bot. Commit: |
|
PR_Github #27728 [ run ] completed with state |
Signed-off-by: Liwei Ma <[email protected]> Signed-off-by: Yuan Tong <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Erin Ho <[email protected]> Co-authored-by: Liwei Ma <[email protected]> Co-authored-by: Yuan Tong <[email protected]> Co-authored-by: Superjomn <[email protected]>
Signed-off-by: Liwei Ma <[email protected]> Signed-off-by: Yuan Tong <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Erin Ho <[email protected]> Co-authored-by: Liwei Ma <[email protected]> Co-authored-by: Yuan Tong <[email protected]> Co-authored-by: Superjomn <[email protected]>
Signed-off-by: Liwei Ma <[email protected]> Signed-off-by: Yuan Tong <[email protected]> Signed-off-by: Superjomn <[email protected]> Signed-off-by: Erin Ho <[email protected]> Co-authored-by: Liwei Ma <[email protected]> Co-authored-by: Yuan Tong <[email protected]> Co-authored-by: Superjomn <[email protected]>
…e fix has been merged via #9353) (#9655) Signed-off-by: Yuan Tong <[email protected]>
…e fix has been merged via NVIDIA#9353) (NVIDIA#9655) Signed-off-by: Yuan Tong <[email protected]>
…e fix has been merged via NVIDIA#9353) (NVIDIA#9655) Signed-off-by: Yuan Tong <[email protected]>
…e fix has been merged via NVIDIA#9353) (NVIDIA#9655) Signed-off-by: Yuan Tong <[email protected]> Signed-off-by: lkomali <[email protected]>
…e fix has been merged via NVIDIA#9353) (NVIDIA#9655) Signed-off-by: Yuan Tong <[email protected]>
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.