-
Notifications
You must be signed in to change notification settings - Fork 70
Extend attn in Shortfin #2518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Extend attn in Shortfin #2518
Conversation
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2518 +/- ##
=======================================
Coverage ? 77.55%
=======================================
Files ? 264
Lines ? 25198
Branches ? 0
=======================================
Hits ? 19543
Misses ? 5655
Partials ? 0 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
stbaione
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May want to add a case to accuracy_test and smoke tests for validation
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py
Outdated
Show resolved
Hide resolved
|
I'll add the accuracy tests and smoke tests after the iree issue with dynamic batch sizes is fixed |
stbaione
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just a couple questions
| chunk_block_size=None, | ||
| ) | ||
|
|
||
| async def prepare_args(self, batch_size: int) -> List[sfnp.device_array]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the recent changes, I think this prepare_args produces the same results as the existing PrefillTask.prepare_args func. I think we can just use the existing PrefillTask then
| "Export from `sharktank` with `--has-prefill-position` for full trie prefix sharing benefits." | ||
| ) | ||
|
|
||
| batch_mode = server_params.batch_mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like chunk_block_size isn't taken into consideration when extend_attention is used.
Might be good to log a warning if both have a value, and just say that chunk_block_size is ignored when using extend_attention
| seq_lens = torch.empty(bs_min, dtype=torch.int64) | ||
|
|
||
| print(f"Exporting prefill_bs{bs}") | ||
| # Use different naming for extend-attention mode to avoid confusion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the reasoning for adding this change?
e7018d3 to
3fdfded
Compare
… into update-extend-attn
Implement Extend-Attention in Shortfin