Skip to content

Conversation

Sunny-bot1
Copy link
Collaborator

PR修改

  1. 融合get_max_len和get_kv_max_len(3.6us+1.8us->3.6us)
  2. 融合max_len_tensor_cpu和max_len_kv_cpu到cpu的copy,将max_len_kv_cpu放在max_len_tensor_cpu[8](36us->18us)
  3. 优化split_q_block kernel(21us->3us)
  4. 消除一些冗余分支和memset

TODO

  1. 进CUDA graph
  2. kernel、DtoH进一步融合
  3. 优化MLA前处理相关kernel

Copy link

paddle-bot bot commented Sep 29, 2025

Thanks for your contribution!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant