remove spurious cpu->gpu and gpu->cpu transfers #123

suyoggupta · 2025-08-01T06:11:16Z

@coderabbitai summary

Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Suyog Gupta <[email protected]>

Signed-off-by: Suyog Gupta <[email protected]>

Copilot

Pull Request Overview

This PR removes spurious CPU-to-GPU and GPU-to-CPU tensor transfers to improve performance. The changes implement optimizations by allocating tensors on the GPU device from the start, using pinned memory for efficient transfers, and avoiding unnecessary device copies.

Pre-allocates GPU tensors in SequenceInfo to avoid repeated tensor creation overhead
Replaces CPU-GPU round trips with direct GPU operations and pinned memory copies
Updates executor interfaces to pass tensors instead of converting to lists

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
tensorrt_llm/_torch/pyexecutor/py_executor.py	Minor logging update for pipeline executor tracing
tensorrt_llm/_torch/pyexecutor/config.py	Changes default attention backend from TRTLLM to FLASHINFER
tensorrt_llm/_torch/auto_deploy/transform/library/export_to_gm.py	Comment clarification for example sequence setting
tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py	Major refactoring to avoid GPU-CPU transfers and improve tensor handling
tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py	Extensive optimization of SequenceInfo class with GPU tensor pre-allocation and pinned memory
tensorrt_llm/_torch/auto_deploy/compile/backends/torch_cudagraph.py	Uses non-blocking copy for input buffer transfers

Copilot · 2025-08-01T06:11:59Z

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

            max_num_tokens=max_num_tokens,
+            device=device,
        )
+        print(" in seq_info for device: ", torch.cuda.current_device())


Debug print statement should be removed or replaced with proper logging for production code.

Suggested change

print(" in seq_info for device: ", torch.cuda.current_device())

ad_logger.info(f"In seq_info for device: {torch.cuda.current_device()}")

Copilot · 2025-08-01T06:12:00Z

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

    _num_pages: int = 1

    def __post_init__(self):
+        print("in __post_init__ device: ", self.device)


Debug print statement should be removed or replaced with proper logging for production code.

Copilot · 2025-08-01T06:12:00Z

tensorrt_llm/_torch/auto_deploy/custom_ops/attention_interface.py

+            self.input_pos_host[:bs].copy_(seq_len, non_blocking=True)
        else:
-            self.input_pos[:bs] += seq_len.to(self.device)
+            self.input_pos_host[:bs] += seq_len.to(self.device)


This operation moves seq_len to device before adding to host tensor, which defeats the purpose of keeping calculations on host. Consider converting seq_len to CPU first: self.input_pos_host[:bs] += seq_len.cpu()

Suggested change

self.input_pos_host[:bs] += seq_len.to(self.device)

self.input_pos_host[:bs] += seq_len.cpu()

Signed-off-by: Suyog Gupta <[email protected]>

…st comprehension perf improvement Signed-off-by: Gal Hubara Agam <[email protected]>

Signed-off-by: Gal Hubara Agam <[email protected]>

suyoggupta · 2025-08-04T15:59:49Z

tensorrt_llm/_torch/auto_deploy/shim/ad_executor.py

        si = self.cache_seq_interface.info
-        si.update_pos(input_pos, reset=True)
+        # skip calling _update_position_ids() here, as it will be called in nest_sequences
+        si.update_pos(input_pos, reset=True, update_position_ids=False)


maybe it's better to not call update_pos here at all and introduce a different method that does what update_pos (update_position_ids=False) does? As-is, it is bit confusing to call update_pos without updating positions.
@galagam

Updating the position ids requires both the input positions and the sequence lengths, so it makes sense to update it whenever either is updated, but it's a bit wasteful.
A possible alternative would be to require the user to call it explicitly.
That is

si.update_input_pos() # rename update_pos si.nest_sequences() si.update_position_ids()

In any case, due to my recent changes, run time of update_position_ids decreased by x30, so it's not as critical to add this specific optimization as I initially believed. I'll run a more exhaustive check and consider to keep this optimization out of this PR for code simplicity.
@suyoggupta

Signed-off-by: Gal Hubara Agam <[email protected]>

suyoggupta added 7 commits July 31, 2025 22:45

avoid gpu->cpu transfer when using overlap scheduler

42a507e

Signed-off-by: Suyog Gupta <[email protected]> Signed-off-by: Suyog Gupta <[email protected]>

prealloc

a420975

Signed-off-by: Suyog Gupta <[email protected]>

optimize prepare input

4b6b5b2

Signed-off-by: Suyog Gupta <[email protected]>

more changes

67c0957

Signed-off-by: Suyog Gupta <[email protected]>

remove more syncs

4a6f1c2

Signed-off-by: Suyog Gupta <[email protected]>

clean up

e3d4415

Signed-off-by: Suyog Gupta <[email protected]>

clean up position_id handling

f2ba5da

Signed-off-by: Suyog Gupta <[email protected]>

Copilot AI review requested due to automatic review settings August 1, 2025 06:11

Copilot AI reviewed Aug 1, 2025

View reviewed changes

suyoggupta added 6 commits July 31, 2025 23:14

revert spurious changes

c309859

Signed-off-by: Suyog Gupta <[email protected]>

revert spurious print

dfa1d1b

Signed-off-by: Suyog Gupta <[email protected]>

Update export_to_gm.py

63fe801

Signed-off-by: Suyog Gupta <[email protected]>

revert spurious print

e6b5a08

Signed-off-by: Suyog Gupta <[email protected]>

revert nvtx debug

8fef393

Signed-off-by: Suyog Gupta <[email protected]>

revert nvtx debug

e4aa045

Signed-off-by: Suyog Gupta <[email protected]>

suyoggupta requested a review from galagam August 1, 2025 06:29

galagam added 3 commits August 4, 2025 05:06

update_position_ids is the bottleneck - add nvtx markers and small li…

6483f0f

…st comprehension perf improvement Signed-off-by: Gal Hubara Agam <[email protected]>

avoid double call to update_position_ids

e2ad155

Signed-off-by: Gal Hubara Agam <[email protected]>

optimize prepare_list for generation mode

7d5a102

Signed-off-by: Gal Hubara Agam <[email protected]>

suyoggupta commented Aug 4, 2025

View reviewed changes

bugfix in _update_position_ids optimized code

b4df6b4

Signed-off-by: Gal Hubara Agam <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

remove spurious cpu->gpu and gpu->cpu transfers #123

remove spurious cpu->gpu and gpu->cpu transfers #123

Uh oh!

suyoggupta commented Aug 1, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

Copilot AI Aug 1, 2025

Uh oh!

suyoggupta Aug 4, 2025

Uh oh!

galagam Aug 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	print(" in seq_info for device: ", torch.cuda.current_device())
	ad_logger.info(f"In seq_info for device: {torch.cuda.current_device()}")

	self.input_pos_host[:bs] += seq_len.to(self.device)
	self.input_pos_host[:bs] += seq_len.cpu()

remove spurious cpu->gpu and gpu->cpu transfers #123

Are you sure you want to change the base?

remove spurious cpu->gpu and gpu->cpu transfers #123

Uh oh!

Conversation

suyoggupta commented Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Aug 1, 2025

Choose a reason for hiding this comment

Uh oh!

suyoggupta Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

galagam Aug 4, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suyoggupta commented Aug 1, 2025 •

edited

Loading