Merge pull request #214 from flash-algo:update-feature-decs

LoserCheems · web-flow · commit 8a652aa88e29 · 2025-12-12T16:15:44.000+08:00
Add gradient computation for bias and token-level KV sparsity support
diff --git a/README.md b/README.md
@@ -35,7 +35,8 @@ Thus, a more effective approach is sparse attention: interacting each query with
 - Grouped Query Attention and Multi Query Attention
 - Flexible Mask and Bias
 - Skipping memory access and computation for masked regions
-- Gradient computation for bias
+- Gradient computation for bias to support learnable attention sink
+- Token-level KV sparsity for each Q
 
 ### Features We Aim to Support
 
diff --git a/README_zh.md b/README_zh.md
@@ -35,7 +35,8 @@ Flash-Sparse-Attention 是一个高性能的可训练稀疏注意力实现, 将
 - 分组查询注意力和多查询注意力
 - 灵活的掩码与偏置
 - 跳过掩码区域的访存与计算
-- 偏置的梯度计算
+- 偏置的梯度计算以支持可学习 attention sink
+- 对于每个 Q 有 token 级别的 KV 稀疏性
 
 ### 我们想要支持的功能