Add gradient computation for bias and token-level KV sparsity support#214
Add gradient computation for bias and token-level KV sparsity support#214LoserCheems merged 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR updates the documentation in both English and Chinese README files to provide more detailed descriptions of two existing features in the "Supported Features" section. The changes clarify that gradient computation for bias supports learnable attention sink mechanisms and add explicit mention of token-level KV sparsity for each query.
Key Changes
- Enhanced feature description for bias gradient computation to specify its use case (learnable attention sink)
- Added explicit documentation of token-level KV sparsity capability for each query
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
| README.md | Updated lines 38-39 to clarify bias gradient computation purpose and add token-level KV sparsity feature |
| README_zh.md | Corresponding Chinese translation updates for the same feature descriptions |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| - Gradient computation for bias to support learnable attention sink | ||
| - Token-level KV sparsity for each Q | ||
|
|
||
| ### Features We Aim to Support | ||
|
|
There was a problem hiding this comment.
The PR description claims "New tests added to validate the gradient computation and sparsity features," but only README files are being modified in this PR. No test files or actual implementation code changes are included. If these features were implemented and tested in a previous commit, the PR description should be updated to accurately reflect that this PR only updates documentation. If the implementation and tests are planned for the future, these features should be moved to the "Features We Aim to Support" section instead.
| - Gradient computation for bias to support learnable attention sink | |
| - Token-level KV sparsity for each Q | |
| ### Features We Aim to Support | |
| ### Features We Aim to Support | |
| - Gradient computation for bias to support learnable attention sink | |
| - Token-level KV sparsity for each Q |
Summary
Root Cause
Changes
Reproduction
Tests
Compatibility
Checklist