Skip to content

Conversation

@suyoggupta
Copy link

@suyoggupta suyoggupta commented Aug 1, 2025

@coderabbitai summary
Screenshot 2025-07-31 at 11 29 11 PM

Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Copilot AI review requested due to automatic review settings August 1, 2025 06:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR removes spurious CPU-to-GPU and GPU-to-CPU tensor transfers to improve performance. The changes implement optimizations by allocating tensors on the GPU device from the start, using pinned memory for efficient transfers, and avoiding unnecessary device copies.

  • Pre-allocates GPU tensors in SequenceInfo to avoid repeated tensor creation overhead
  • Replaces CPU-GPU round trips with direct GPU operations and pinned memory copies
  • Updates executor interfaces to pass tensors instead of converting to lists

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
tensorrt_llm/_torch/pyexecutor/py_executor.py Minor logging update for pipeline executor tracing
tensorrt_llm/_torch/pyexecutor/config.py Changes default attention backend from TRTLLM to FLASHINFER
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py Comment clarification for example sequence setting
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py Major refactoring to avoid GPU-CPU transfers and improve tensor handling
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py Extensive optimization of SequenceInfo class with GPU tensor pre-allocation and pinned memory
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py Uses non-blocking copy for input buffer transfers

max_num_tokens=max_num_tokens,
device=device,
)
print(" in seq_info for device: ", torch.cuda.current_device())
Copy link

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statement should be removed or replaced with proper logging for production code.

Suggested change
print(" in seq_info for device: ", torch.cuda.current_device())
ad_logger.info(f"In seq_info for device: {torch.cuda.current_device()}")

Copilot uses AI. Check for mistakes.
_num_pages: int = 1

def __post_init__(self):
print("in __post_init__ device: ", self.device)
Copy link

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug print statement should be removed or replaced with proper logging for production code.

Copilot uses AI. Check for mistakes.
self.input_pos_host[:bs].copy_(seq_len, non_blocking=True)
else:
self.input_pos[:bs] += seq_len.to(self.device)
self.input_pos_host[:bs] += seq_len.to(self.device)
Copy link

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This operation moves seq_len to device before adding to host tensor, which defeats the purpose of keeping calculations on host. Consider converting seq_len to CPU first: self.input_pos_host[:bs] += seq_len.cpu()

Suggested change
self.input_pos_host[:bs] += seq_len.to(self.device)
self.input_pos_host[:bs] += seq_len.cpu()

Copilot uses AI. Check for mistakes.
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
Signed-off-by: Suyog Gupta <[email protected]>
@suyoggupta suyoggupta requested a review from galagam August 1, 2025 06:29
galagam added 3 commits August 4, 2025 05:06
…st comprehension perf improvement

Signed-off-by: Gal Hubara Agam <[email protected]>
si = self.cache_seq_interface.info
si.update_pos(input_pos, reset=True)
# skip calling _update_position_ids() here, as it will be called in nest_sequences
si.update_pos(input_pos, reset=True, update_position_ids=False)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe it's better to not call update_pos here at all and introduce a different method that does what update_pos (update_position_ids=False) does? As-is, it is bit confusing to call update_pos without updating positions.
@galagam

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updating the position ids requires both the input positions and the sequence lengths, so it makes sense to update it whenever either is updated, but it's a bit wasteful.
A possible alternative would be to require the user to call it explicitly.
That is

si.update_input_pos()  #  rename update_pos
si.nest_sequences()
si.update_position_ids()

In any case, due to my recent changes, run time of update_position_ids decreased by x30, so it's not as critical to add this specific optimization as I initially believed. I'll run a more exhaustive check and consider to keep this optimization out of this PR for code simplicity.
@suyoggupta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants