-
Notifications
You must be signed in to change notification settings - Fork 1
remove spurious cpu->gpu and gpu->cpu transfers #123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: feat/ad-2025-07-22
Are you sure you want to change the base?
Conversation
Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR removes spurious CPU-to-GPU and GPU-to-CPU tensor transfers to improve performance. The changes implement optimizations by allocating tensors on the GPU device from the start, using pinned memory for efficient transfers, and avoiding unnecessary device copies.
- Pre-allocates GPU tensors in SequenceInfo to avoid repeated tensor creation overhead
- Replaces CPU-GPU round trips with direct GPU operations and pinned memory copies
- Updates executor interfaces to pass tensors instead of converting to lists
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| tensorrt_llm/_torch/pyexecutor/py_executor.py | Minor logging update for pipeline executor tracing |
| tensorrt_llm/_torch/pyexecutor/config.py | Changes default attention backend from TRTLLM to FLASHINFER |
| tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py | Comment clarification for example sequence setting |
| tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py | Major refactoring to avoid GPU-CPU transfers and improve tensor handling |
| tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py | Extensive optimization of SequenceInfo class with GPU tensor pre-allocation and pinned memory |
| tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py | Uses non-blocking copy for input buffer transfers |
| max_num_tokens=max_num_tokens, | ||
| device=device, | ||
| ) | ||
| print(" in seq_info for device: ", torch.cuda.current_device()) |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement should be removed or replaced with proper logging for production code.
| print(" in seq_info for device: ", torch.cuda.current_device()) | |
| ad_logger.info(f"In seq_info for device: {torch.cuda.current_device()}") |
| _num_pages: int = 1 | ||
|
|
||
| def __post_init__(self): | ||
| print("in __post_init__ device: ", self.device) |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Debug print statement should be removed or replaced with proper logging for production code.
| self.input_pos_host[:bs].copy_(seq_len, non_blocking=True) | ||
| else: | ||
| self.input_pos[:bs] += seq_len.to(self.device) | ||
| self.input_pos_host[:bs] += seq_len.to(self.device) |
Copilot
AI
Aug 1, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This operation moves seq_len to device before adding to host tensor, which defeats the purpose of keeping calculations on host. Consider converting seq_len to CPU first: self.input_pos_host[:bs] += seq_len.cpu()
| self.input_pos_host[:bs] += seq_len.to(self.device) | |
| self.input_pos_host[:bs] += seq_len.cpu() |
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
…st comprehension perf improvement Signed-off-by: Gal Hubara Agam <[email protected]>
Signed-off-by: Gal Hubara Agam <[email protected]>
Signed-off-by: Gal Hubara Agam <[email protected]>
| si = self.cache_seq_interface.info | ||
| si.update_pos(input_pos, reset=True) | ||
| # skip calling _update_position_ids() here, as it will be called in nest_sequences | ||
| si.update_pos(input_pos, reset=True, update_position_ids=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe it's better to not call update_pos here at all and introduce a different method that does what update_pos (update_position_ids=False) does? As-is, it is bit confusing to call update_pos without updating positions.
@galagam
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updating the position ids requires both the input positions and the sequence lengths, so it makes sense to update it whenever either is updated, but it's a bit wasteful.
A possible alternative would be to require the user to call it explicitly.
That is
si.update_input_pos() # rename update_pos
si.nest_sequences()
si.update_position_ids()
In any case, due to my recent changes, run time of update_position_ids decreased by x30, so it's not as critical to add this specific optimization as I initially believed. I'll run a more exhaustive check and consider to keep this optimization out of this PR for code simplicity.
@suyoggupta
Signed-off-by: Gal Hubara Agam <[email protected]>
@coderabbitai summary
