[Optimization] Optimize get_block_shape_and_split_kv_block #4317

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

Sunny-bot1 wants to merge 2 commits into PaddlePaddle:develop from Sunny-bot1:attn-block

Collaborator

Sunny-bot1 commented Sep 29, 2025

PR修改

融合get_max_len和get_kv_max_len（3.6us+1.8us->3.6us）
融合max_len_tensor_cpu和max_len_kv_cpu到cpu的copy，将max_len_kv_cpu放在max_len_tensor_cpu[8]（36us->18us）
优化split_q_block kernel（21us->3us）
消除一些冗余分支和memset

TODO

进CUDA graph
kernel、DtoH进一步融合
优化MLA前处理相关kernel


          optimize get_block_shape_and_split_kv_block

9d861a8

paddle-bot bot commented Sep 29, 2025

Thanks for your contribution!


          delete max_len_kv_cpu

07a9c52

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet