HKUSTDial · LoserCheems · Dec 12, 2025 · Dec 12, 2025 · Copilot · Dec 12, 2025
diff --git a/README.md b/README.md
@@ -35,7 +35,8 @@ Thus, a more effective approach is sparse attention: interacting each query with
 - Grouped Query Attention and Multi Query Attention
 - Flexible Mask and Bias
 - Skipping memory access and computation for masked regions
-- Gradient computation for bias
+- Gradient computation for bias to support learnable attention sink
+- Token-level KV sparsity for each Q
 
 ### Features We Aim to Support
 
- Gradient computation for bias to support learnable attention sink
- Token-level KV sparsity for each Q
-
-### Features We Aim to Support
+
+
+### Features We Aim to Support
+
+- Gradient computation for bias to support learnable attention sink
+- Token-level KV sparsity for each Q
- Gradient computation for bias to support learnable attention sink
- Token-level KV sparsity for each Q
-
-### Features We Aim to Support
+
+
+### Features We Aim to Support
+
+- Gradient computation for bias to support learnable attention sink
+- Token-level KV sparsity for each Q

diff --git a/README_zh.md b/README_zh.md
@@ -35,7 +35,8 @@ Flash-Sparse-Attention 是一个高性能的可训练稀疏注意力实现, 将
 - 分组查询注意力和多查询注意力
 - 灵活的掩码与偏置
 - 跳过掩码区域的访存与计算
-- 偏置的梯度计算
+- 偏置的梯度计算以支持可学习 attention sink
+- 对于每个 Q 有 token 级别的 KV 稀疏性
 
 ### 我们想要支持的功能