plan() huge overhead: tricks to reduce?
#1743
Unanswered
tanishqkumar
asked this question in
Q&A
Replies: 1 comment 3 replies
-
|
Plan can we fully executed on GPU, for blackwell ops we already started using GPU-based plan functions: https://github.com/flashinfer-ai/flashinfer/blob/main/include/flashinfer/attention/blackwell/plan.cuh, will work on that for earlier architectures if necessary. |
Beta Was this translation helpful? Give feedback.
3 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I'm noticing
.planhas huge overhead (~0.6ms) in my runs ofBatchPrefillWithPagedKVCacheWrapper. For reference, my Llama-3.2-1B forward pass is 1.5ms, so this is an enormous overhead. I am using a custom mask and doing multi-query (append) decoding for a new speculative decoding variant I am working on with ~20 queries per forward and BS=1 sequences. I am passing in device tensors, but it shouldn't matter because I'll have to move them to host myself (incurring more overhead) if I choose to pass host tensors into the wrapper. Anyways, I was wondering sinceplan()mostly does CPU/h2d work, whether two wrappers could be "double buffered" for a loop of Prefill/Append operations, whereplanfor iter N takes place on a separate CUDAstream to.run()for iter N-1, since the former is CPU/h2d intensive and the latter is presumably compute intensive. I also know vLLM/SGLang use their customfast_plan_decodefunctions to speed up planning, though I'm not sure I fully understand what those do.Any advice on reducing the huge
planoverhead?Beta Was this translation helpful? Give feedback.
All reactions