Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ Thus, a more effective approach is sparse attention: interacting each query with
- Grouped Query Attention and Multi Query Attention
- Flexible Mask and Bias
- Skipping memory access and computation for masked regions
- Gradient computation for bias
- Gradient computation for bias to support learnable attention sink
- Token-level KV sparsity for each Q

### Features We Aim to Support

Comment on lines +38 to 42
Copy link

Copilot AI Dec 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description claims "New tests added to validate the gradient computation and sparsity features," but only README files are being modified in this PR. No test files or actual implementation code changes are included. If these features were implemented and tested in a previous commit, the PR description should be updated to accurately reflect that this PR only updates documentation. If the implementation and tests are planned for the future, these features should be moved to the "Features We Aim to Support" section instead.

Suggested change
- Gradient computation for bias to support learnable attention sink
- Token-level KV sparsity for each Q
### Features We Aim to Support
### Features We Aim to Support
- Gradient computation for bias to support learnable attention sink
- Token-level KV sparsity for each Q

Copilot uses AI. Check for mistakes.
Expand Down
3 changes: 2 additions & 1 deletion README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,8 @@ Flash-Sparse-Attention 是一个高性能的可训练稀疏注意力实现, 将
- 分组查询注意力和多查询注意力
- 灵活的掩码与偏置
- 跳过掩码区域的访存与计算
- 偏置的梯度计算
- 偏置的梯度计算以支持可学习 attention sink
- 对于每个 Q 有 token 级别的 KV 稀疏性

### 我们想要支持的功能

Expand Down
Loading