Extend attn in Shortfin #2518

zeeshanhaque21 · 2025-10-15T15:59:46Z

Implement Extend-Attention in Shortfin

Used dynamic chunk sizing that adapts based on active requests, maximizing GPU utilization by filling the token budget.
Dynamic chunking: Calculate chunk size at scheduling time: (token_budget / num_active) // block_seq_stride * block_seq_stride
Position tracking: Track full requests with current positions instead of pre-chunking
Simplified flow: make_task_inputs() returns single task; scheduler chunks on-demand
Added tests

shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py

codecov-commenter · 2025-10-15T16:06:29Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@2cb50fc). Learn more about missing BASE report.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #2518   +/-   ##
=======================================
  Coverage        ?   77.55%           
=======================================
  Files           ?      264           
  Lines           ?    25198           
  Branches        ?        0           
=======================================
  Hits            ?    19543           
  Misses          ?     5655           
  Partials        ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…urrent requests

stbaione

May want to add a case to accuracy_test and smoke tests for validation

shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py

shortfin/python/shortfin_apps/llm/components/service.py

shortfin/python/shortfin_apps/llm/server.py

shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py

zeeshanhaque21 · 2025-10-17T20:40:19Z

I'll add the accuracy tests and smoke tests after the iree issue with dynamic batch sizes is fixed

stbaione

Looks good, just a couple questions

stbaione · 2025-10-20T14:45:26Z

shortfin/python/shortfin_apps/llm/components/invocation.py

+            chunk_block_size=None,
+        )
+
+    async def prepare_args(self, batch_size: int) -> List[sfnp.device_array]:


Given the recent changes, I think this prepare_args produces the same results as the existing PrefillTask.prepare_args func. I think we can just use the existing PrefillTask then

stbaione · 2025-10-20T15:11:44Z

shortfin/python/shortfin_apps/llm/components/lifecycle.py

                "Export from `sharktank` with `--has-prefill-position` for full trie prefix sharing benefits."
            )

+        batch_mode = server_params.batch_mode


It looks like chunk_block_size isn't taken into consideration when extend_attention is used.

Might be good to log a warning if both have a value, and just say that chunk_block_size is ignored when using extend_attention

stbaione · 2025-10-27T18:17:33Z

sharktank/sharktank/examples/export_paged_llm_v1.py

        seq_lens = torch.empty(bs_min, dtype=torch.int64)

-        print(f"Exporting prefill_bs{bs}")
+        # Use different naming for extend-attention mode to avoid confusion


What's the reasoning for adding this change?

… into update-extend-attn

…attn-shortfin

zeeshanhaque21 added 2 commits October 15, 2025 08:52

Extend attn prefill

d2bb0ae

Merge branch 'main' into extend-attn-shortfin

f8de260

stbaione reviewed Oct 15, 2025

View reviewed changes

shortfin/python/shortfin_apps/llm/components/batching/modes/extend_attention.py Outdated Show resolved Hide resolved

zeeshanhaque21 added 9 commits October 15, 2025 12:53

Fixt chunked mode

ad46fef

Add tests

25af208

Fix tests

638361e

precommit fix

03be4cb

Merge branch 'main' into extend-attn-shortfin

31e3aed

Merge branch 'main' into extend-attn-shortfin

3c082fe

Change chunking strategy to dnamically recompute based on number of c…

730591c

…urrent requests

Fix tests

74ee726

precommit

3496380

zeeshanhaque21 marked this pull request as ready for review October 16, 2025 00:17

zeeshanhaque21 requested a review from rsuderman October 16, 2025 00:20

cleanup

66bce01

stbaione self-requested a review October 16, 2025 13:26

stbaione requested changes Oct 16, 2025

View reviewed changes

zeeshanhaque21 added 5 commits October 17, 2025 13:11

Address PR comments

0a0896e

Refactor scheduler and prefill task

c61915f

Add tests for PrefillTask

10794f0

Formatting

6c15862

Merge branch 'main' into extend-attn-shortfin

3ad3509

stbaione self-requested a review October 20, 2025 14:36

stbaione reviewed Oct 20, 2025

View reviewed changes

zeeshanhaque21 added 4 commits October 27, 2025 09:01

Merge branch 'main' into extend-attn-shortfin

759e204

Add parameter

9eb0de4

Merge branch 'main' into extend-attn-shortfin

249fdf9

Modify sharktank to export flags

5446cf1

stbaione reviewed Oct 27, 2025

View reviewed changes

Change min prefill bs to 1 in export

3fdfded

zeeshanhaque21 force-pushed the extend-attn-shortfin branch from e7018d3 to 3fdfded Compare October 27, 2025 18:21

zeeshanhaque21 and others added 11 commits October 27, 2025 11:34

Add debug logs to investigate data corruption

b16a5e3

revert back to bs_min of 2 for torch.export

709f975

Add debug logs

eb621cb

add use_extend_attention to ServiceConfig & update prefill name

856a70a

Enable extend attention in default path

353d8ac

Merge branch 'main' into update-extend-attn

13858a5

Fix error

82bb572

Merge branch 'update-extend-attn' of https://github.com/nod-ai/shark-ai…

7957f6e

… into update-extend-attn

Add debug statements

c9fdadb

Merge remote-tracking branch 'origin/update-extend-attn' into extend-…

5e245a7

…attn-shortfin

Merge remote-tracking branch 'origin/main' into extend-attn-shortfin

1c7bb7b

Extend attn in Shortfin #2518

Are you sure you want to change the base?

Extend attn in Shortfin #2518

Uh oh!

Conversation

zeeshanhaque21 commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

stbaione left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

zeeshanhaque21 commented Oct 17, 2025

Uh oh!

stbaione left a comment

Choose a reason for hiding this comment

Uh oh!

stbaione Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

stbaione Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

stbaione Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zeeshanhaque21 commented Oct 15, 2025 •

edited

Loading

codecov-commenter commented Oct 15, 2025 •

edited

Loading