Replies: 1 comment 1 reply
-
xref this existing discussion for the same topic: #607 |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hey everyone,
I wanted to kick off a discussion about bringing prefill/decode disaggregated serving into the core GIE project.
Reference: llm-d/llm-d-inference-scheduler#356
As some of you know, this logic already exists and works in the llm-d project. The idea is to upstream it so we can create a standard protocol that other model servers can hook into. This would be a huge win for interoperability and would directly help all other adopters than llm-d.
The "Why"
Disaggregated serving is a key optimization for LLMs. By splitting the compute-heavy prefill from the memory-heavy decode steps, we can scale those resources independently and get much better hardware utilization.
By standardizing this in GIE, we can:
The Proposed Protocol
The protocol we've been using in llm-d is pretty straightforward and would be a great starting point:
The great thing is, this should all be doable with the existing scheduler plugin architecture. We won't need to touch the core framework. It would mostly be a new ProfileHandler that orchestrates two separate scheduling profiles (prefill-profile and decode-profile).
Next Steps
This is the first step toward a formal GIEP. The plan would be to tackle this in two phases:
Milestone 1: Get the protocol defined and upstream the existing llm-d logic.
Milestone 2 (Future): Look into more advanced scheduling algorithms (e.g., deciding when to disaggregate based on sequence length, and etc).
Greatly appreciate any feedback on this. Does this protocol seem reasonable? Any potential gotchas we should be thinking about?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions