`plan()` huge overhead: tricks to reduce? #1743

tanishqkumar · 2025-09-20T15:56:45Z

tanishqkumar
Sep 20, 2025

Hi,

I'm noticing .plan has huge overhead (~0.6ms) in my runs of BatchPrefillWithPagedKVCacheWrapper. For reference, my Llama-3.2-1B forward pass is 1.5ms, so this is an enormous overhead. I am using a custom mask and doing multi-query (append) decoding for a new speculative decoding variant I am working on with ~20 queries per forward and BS=1 sequences. I am passing in device tensors, but it shouldn't matter because I'll have to move them to host myself (incurring more overhead) if I choose to pass host tensors into the wrapper. Anyways, I was wondering since plan() mostly does CPU/h2d work, whether two wrappers could be "double buffered" for a loop of Prefill/Append operations, where plan for iter N takes place on a separate CUDAstream to .run() for iter N-1, since the former is CPU/h2d intensive and the latter is presumably compute intensive. I also know vLLM/SGLang use their custom fast_plan_decode functions to speed up planning, though I'm not sure I fully understand what those do.

Any advice on reducing the huge plan overhead?

yzh119 · 2025-09-20T16:10:36Z

yzh119
Sep 20, 2025
Maintainer

Plan can we fully executed on GPU, for blackwell ops we already started using GPU-based plan functions: https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/attention/blackwell/plan.cuh, will work on that for earlier architectures if necessary.

3 replies

tanishqkumar Sep 20, 2025
Author

Thanks for this. I am on Hopper, so it'll have to go through CPU it seems. I am trying to implement a "fast plan" to get around this overhead because my decode shapes (number of queries) are identical on each step (and all the other metadata is predictable in advance of decoding), by just planning once at the beginning, then updating the buffers on the next steps. This seems to mostly work, but not perfectly, on my unit tests. What are metadata besides the buffers I need to update for this to work (ie. avoid planning every step)?

I am using CUDAgraphs and want to mimic a fast_decode_plan but for the prefill wrapper with a custom mask. Much appreciated, thank you!

w = wrapper
w._paged_kv_indptr_buf[: B + 1].copy_(kv_indptr)
w._paged_kv_indices_buf[: kv_indices.numel()].copy_(kv_indices)
w._paged_kv_last_page_len_buf[: B].copy_(kv_last_page_len)
w._kv_lens_buffer[: B].copy_(kv_lens_tokens)
w._mask_indptr_buf[: B + 1].copy_(packed_mask_indptr)
w._custom_mask_buf[: packed_mask.numel()].copy_(packed_mask)

w._max_kv_len = kv_lens_tokens.max().item()

tanishqkumar Sep 20, 2025
Author

Update for future readers: I think it was because the real plan() function runs a plan of the attention backend cached_module_.plan() that requires knowing the shapes of tensors like kv_indices, etc. In my case, some decode steps may have required more pages than others, so I found simply increasing block size to ensure all tensors remain the same size throughout my short K-step multi-query decoding process sufficed (along with populating the buffers above) to avoid having to run plan() on every forward, reducing overhead substantially.

yzh119 Sep 20, 2025
Maintainer

Yes lots of plan steps are duplicated.

In parallel, I'll start working on porting the GPU-based plan to hopper and ampere. The GPU-based plan overhead is much smaller according to our profiling results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FlashInfer

`plan()` huge overhead: tricks to reduce? #1743

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

FlashInfer

plan() huge overhead: tricks to reduce? #1743

Uh oh!

Uh oh!

tanishqkumar Sep 20, 2025

Replies: 1 comment · 3 replies

Uh oh!

yzh119 Sep 20, 2025 Maintainer

Uh oh!

Uh oh!

tanishqkumar Sep 20, 2025 Author

Uh oh!

Uh oh!

tanishqkumar Sep 20, 2025 Author

Uh oh!

yzh119 Sep 20, 2025 Maintainer

`plan()` huge overhead: tricks to reduce? #1743

tanishqkumar
Sep 20, 2025

Replies: 1 comment 3 replies

yzh119
Sep 20, 2025
Maintainer

tanishqkumar Sep 20, 2025
Author

tanishqkumar Sep 20, 2025
Author

yzh119 Sep 20, 2025
Maintainer