Skip to content

Conversation

@zeeshanhaque21
Copy link
Member

@zeeshanhaque21 zeeshanhaque21 commented Oct 15, 2025

Implement Extend-Attention in Shortfin

  • Used dynamic chunk sizing that adapts based on active requests, maximizing GPU utilization by filling the token budget.
  • Dynamic chunking: Calculate chunk size at scheduling time: (token_budget / num_active) // block_seq_stride * block_seq_stride
  • Position tracking: Track full requests with current positions instead of pre-chunking
  • Simplified flow: make_task_inputs() returns single task; scheduler chunks on-demand
  • Added tests

@codecov-commenter
Copy link

codecov-commenter commented Oct 15, 2025

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
⚠️ Please upload report for BASE (main@2cb50fc). Learn more about missing BASE report.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2518   +/-   ##
=======================================
  Coverage        ?   77.55%           
=======================================
  Files           ?      264           
  Lines           ?    25198           
  Branches        ?        0           
=======================================
  Hits            ?    19543           
  Misses          ?     5655           
  Partials        ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@zeeshanhaque21 zeeshanhaque21 marked this pull request as ready for review October 16, 2025 00:17
@stbaione stbaione self-requested a review October 16, 2025 13:26
Copy link
Contributor

@stbaione stbaione left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May want to add a case to accuracy_test and smoke tests for validation

@zeeshanhaque21
Copy link
Member Author

I'll add the accuracy tests and smoke tests after the iree issue with dynamic batch sizes is fixed

@stbaione stbaione self-requested a review October 20, 2025 14:36
Copy link
Contributor

@stbaione stbaione left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just a couple questions

chunk_block_size=None,
)

async def prepare_args(self, batch_size: int) -> List[sfnp.device_array]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the recent changes, I think this prepare_args produces the same results as the existing PrefillTask.prepare_args func. I think we can just use the existing PrefillTask then

"Export from `sharktank` with `--has-prefill-position` for full trie prefix sharing benefits."
)

batch_mode = server_params.batch_mode
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like chunk_block_size isn't taken into consideration when extend_attention is used.

Might be good to log a warning if both have a value, and just say that chunk_block_size is ignored when using extend_attention

seq_lens = torch.empty(bs_min, dtype=torch.int64)

print(f"Exporting prefill_bs{bs}")
# Use different naming for extend-attention mode to avoid confusion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning for adding this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants