[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729
[Cherry-Pick][Feature] support decode attention for mix(#7688)#7729lizhenyun01 wants to merge 13 commits intoPaddlePaddle:release/2.6from
Conversation
|
Thanks for your contribution! |
CI报告基于以下代码生成(30分钟更新一次): 1 任务总览
2 任务状态汇总2.1 Required任务 : 4/10 通过
2.2 可选任务 — 22/26 通过
3 失败详情(仅 required)Approval — 流程审批(置信度: 高)Approval
根因详情:
关键日志: 修复建议:
修复建议摘要: 联系 4 类 RD 完成 GitHub Review Approve 链接: 查看日志 |
PaddlePaddle-bot
left a comment
There was a problem hiding this comment.
🤖 Paddle-CI-Agent | pr_review |
2026-05-11 12:04:45
📋 Review 摘要
PR 概述:新增 C16(FP16/BF16 KV cache)和静态 C8(INT8 KV cache)decode attention 支持,通过环境变量 USE_DECODE_ATTENTION=1(需同时开启 flash_attn)启用。
变更范围:custom_ops/gpu_ops/append_attention/、layers/attention/、worker/gpu_model_runner.py、spec_decode/mtp.py
影响面 Tag:[OP] [Feature] [Speculative Decoding]
📝 PR 规范检查
PR 标题格式符合 Cherry-Pick 规范([Cherry-Pick][Feature] ... (#7688))。但 ## Modifications、## Usage or Command、## Accuracy Tests 三个段落均为空(仅占位符注释),Checklist 全部未勾选,描述不合规。
PR 描述建议(可直接复制):
## Motivation
C16(FP16/BF16 KV cache)和静态 C8(INT8 KV cache)decode attention 支持,在 flash_attn 开启情况下通过 `export USE_DECODE_ATTENTION=1` 启用,以降低访存压力并提升 decode 阶段吞吐。
## Modifications
- 新增 `custom_ops/gpu_ops/append_attention/` 目录,包含以下新 CUDA 实现:
- `decode_append_attention_c16_impl.cuh`:FP16/BF16 KV cache 的 decode attention kernel
- `decode_append_attention_c8_impl.cuh`:静态 INT8 KV cache 的 decode attention kernel
- `attention_func.cuh`、`mma_tensor_op.cuh`、`utils.cuh`、`mem_util.cuh`:共用 CUDA 工具函数
- `config_for_attention.cu`:新增 `config_for_attention` Op(含 `GetMaxLenKernel`)
- 更新 `custom_ops/setup_ops.py`,将新 CUDA 源文件加入编译列表
- 新增 `fastdeploy/envs.py` 中 `USE_DECODE_ATTENTION` 环境变量
- 更新 `fastdeploy/model_executor/layers/attention/ops/` Python 绑定(`config_for_attention.py`、`decode_append_attention.py`、`decoder_write_cache_with_rope.py`)
- 更新 `append_attn_backend.py` 和 `flash_attn_backend.py`,集成新的 decode attention 路径
- 更新 `gpu_model_runner.py`、`metax_model_runner.py`、`input_batch.py`、`spec_decode/mtp.py`
- 新增测试:`tests/operators/attention/test_decode_append_attention_c16.py`、`benchmark_decode_attention.py`
## Usage or Command
```bash
# flash_attn 已开启时,通过环境变量启用 decode attention
export USE_DECODE_ATTENTION=1
```
## Accuracy Tests
N/A
## Checklist
- [x] Add at least a tag in the PR title.
- Tag list: [`[FDConfig]`,`[APIServer]`,`[Engine]`, `[Scheduler]`, `[PD Disaggregation]`, `[Executor]`, `[Graph Optimization]`, `[Speculative Decoding]`, `[RL]`, `[Models]`, `[Quantization]`, `[Loader]`, `[OP]`, `[KVCache]`, `[DataProcessor]`, `[BugFix]`, `[Docs]`, `[CI]`, `[Optimization]`, `[Feature]`, `[Benchmark]`, `[Others]`, `[XPU]`, `[HPU]`, `[GCU]`, `[DCU]`, `[Iluvatar]`, `[Metax]`]
- You can add new tags based on the PR content, but the semantics must be clear.
- [ ] Format your code, run `pre-commit` before commit.
- [x] Add unit tests. Please write the reason in this PR if no unit tests.
- [ ] Provide accuracy results.
- [x] If the current PR is submitting to the `release` branch, make sure the PR has been submitted to the `develop` branch, then cherry-pick it to the `release` branch with the `[Cherry-Pick]` PR tag.问题
| 级别 | 文件 | 概述 |
|---|---|---|
| 📝 PR 规范 | — | Modifications / Usage or Command / Accuracy Tests 段落为空,Checklist 全未勾选 |
| ❓ 疑问 | custom_ops/gpu_ops/append_attention/cu_tensor_map.cuh:22 |
文件已加入仓库但未被任何 kernel 引用(c8 impl 中对应 include 已注释掉,c16 impl 也未包含),内含 SM90+ 专属 TMA API |
| ❓ 疑问 | custom_ops/gpu_ops/append_attention/decode_append_attention_c8_impl.cuh:15 |
// #include "cu_tensor_map.cuh" 为注释行,建议明确处理 |
| 🟡 建议 | fastdeploy/worker/gpu_model_runner.py |
gpu_model_runner.py 已更新,metax_model_runner.py 已同步,但 dcu_model_runner.py、iluvatar_model_runner.py 未在变更列表中——请确认其他 GPU-adjacent 硬件是否需要同步此 feature |
总体评价
PR 整体思路清晰,C16/C8 decode attention kernel 实现完整,并附有测试文件。主要待确认点:cu_tensor_map.cuh 是否为遗留文件(TMA 功能尚未接入),以及其他 GPU-adjacent 硬件 runner 是否需要同步更新。
| #include <cuda/barrier> | ||
| #include <stdexcept> | ||
|
|
||
| using barrier = cuda::barrier<cuda::thread_scope_block>; |
There was a problem hiding this comment.
❓ 疑问 cu_tensor_map.cuh 已加入仓库,但当前没有被任何 kernel 引用——decode_append_attention_c8_impl.cuh 第15行中对应的 #include "cu_tensor_map.cuh" 已注释掉,decode_append_attention_c16_impl.cuh 也未包含此头文件。
该文件内的 cuda::device::experimental 命名空间(Hopper TMA API)和 CUtensorMapDataType 均为 SM90+ 专属能力。请确认:
- 若此文件是为后续 TMA 优化预留,建议在注释中注明
TODO: SM90+; - 若暂不使用,建议从本次 PR 中移除,待真正使用时再合入,避免未来误引入引发编译失败。
| // See the License for the specific language governing permissions and | ||
| // limitations under the License. | ||
| #pragma once | ||
| #include "utils.cuh" |
There was a problem hiding this comment.
❓ 疑问 // #include "cu_tensor_map.cuh" 已注释掉。若 C8 kernel 最终不需要 TMA,建议删除此注释行以保持代码整洁;若后续要用,建议附上 SM90 gate(#if __CUDA_ARCH__ >= 900)。
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## release/2.6 #7729 +/- ##
==============================================
Coverage ? 71.82%
==============================================
Files ? 381
Lines ? 54014
Branches ? 8444
==============================================
Hits ? 38795
Misses ? 12445
Partials ? 2774
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Motivation
C16/静态C8 attention支持,使用方式:flash_attn开启情况下export USE_DECODE_ATTENTION=1
Modifications
Usage or Command
Accuracy Tests
Checklist
[FDConfig],[APIServer],[Engine],[Scheduler],[PD Disaggregation],[Executor],[Graph Optimization],[Speculative Decoding],[RL],[Models],[Quantization],[Loader],[OP],[KVCache],[DataProcessor],[BugFix],[Docs],[CI],[Optimization],[Feature],[Benchmark],[Others],[XPU],[HPU],[GCU],[DCU],[Iluvatar],[Metax]]pre-commitbefore commit.releasebranch, make sure the PR has been submitted to thedevelopbranch, then cherry-pick it to thereleasebranch with the[Cherry-Pick]PR tag.