Standardize a Protocol for Prefill/Decode Disaggregated Serving #1650

yangligt2 · 2025-09-25T17:20:41Z

yangligt2
Sep 25, 2025

Hey everyone,

I wanted to kick off a discussion about bringing prefill/decode disaggregated serving into the core GIE project.
Reference: llm-d/llm-d-inference-scheduler#356
As some of you know, this logic already exists and works in the llm-d project. The idea is to upstream it so we can create a standard protocol that other model servers can hook into. This would be a huge win for interoperability and would directly help all other adopters than llm-d.

The "Why"

Disaggregated serving is a key optimization for LLMs. By splitting the compute-heavy prefill from the memory-heavy decode steps, we can scale those resources independently and get much better hardware utilization.
By standardizing this in GIE, we can:

Let different model servers (sglang, vLLM, etc.) use this optimization out-of-the-box with IGW.
Keep the complex orchestration logic inside the EPP, so the model servers/inference frameworks themselves can stay simpler.

The Proposed Protocol

The protocol we've been using in llm-d is pretty straightforward and would be a great starting point:

The EPP picks a pair of servers: one for prefill and one for decode.
The EPP adds a header: It injects the address of the prefill server into the request (e.g., X-Prefill-Endpoint: :).
The EPP routes to the decode server: The gateway sends the request to the chosen decode server.
The decode server coordinates: It's then up to the decode server to read the header and pull the KV cache from the prefill server.
The great thing is, this should all be doable with the existing scheduler plugin architecture. We won't need to touch the core framework. It would mostly be a new ProfileHandler that orchestrates two separate scheduling profiles (prefill-profile and decode-profile).

Next Steps

This is the first step toward a formal GIEP. The plan would be to tackle this in two phases:

Milestone 1: Get the protocol defined and upstream the existing llm-d logic.
Milestone 2 (Future): Look into more advanced scheduling algorithms (e.g., deciding when to disaggregate based on sequence length, and etc).

Greatly appreciate any feedback on this. Does this protocol seem reasonable? Any potential gotchas we should be thinking about?

Thanks!

danehans · 2025-10-02T17:36:26Z

danehans
Oct 2, 2025
Maintainer

xref this existing discussion for the same topic: #607

1 reply

danehans Oct 2, 2025
Maintainer

The EPP adds a header: It injects the address of the prefill server into the request (e.g., X-Prefill-Endpoint: :).

See the existing EPP proto spec. We should following the current pref "x-gateway- prefix, e.g. "x-gateway-prefill-endpoint" which would represent a list of <ip:port>,<ip:port>,... prefill endpoint tuples. This should only be required when the EPP is enabled with the disagg PD plugin.

The EPP routes to the decode server: The gateway sends the request to the chosen decode server.

Note that the EPP does not "route". It only picks endpoints and then provides the endpoints to the gateway. The gateway is responsible for "routing" the request to an endpoint from the EPP-provided list. The gateway is also responsible for communicating back to the EPP, the endpoint that served the request. I anticipate a similar feedback loop will be needed with disagg PD.

The EPP proto spec defines ^ in more detail and will need to be updated based on the details of your proposal that supports this discussion (please make sure to link your proposal PR to this discussion).

The EPP picks a pair of servers: one for prefill and one for decode.

Does the existing "x-gateway-destination-endpoint" represent the decode server? If so, then is it a matter of the gateway passing the "x-gateway-prefill-endpoint" header along to the decode endpoint?

danehans · 2025-11-19T17:45:36Z

danehans
Nov 19, 2025
Maintainer

@yangligt2 are you still planning on submitting a formal GIEP? If so, please link to this discussion. Thank you!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Standardize a Protocol for Prefill/Decode Disaggregated Serving #1650

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Standardize a Protocol for Prefill/Decode Disaggregated Serving #1650

Uh oh!

yangligt2 Sep 25, 2025

The "Why"

The Proposed Protocol

Next Steps

Replies: 2 comments · 1 reply

Uh oh!

danehans Oct 2, 2025 Maintainer

Uh oh!

danehans Oct 2, 2025 Maintainer

Uh oh!

danehans Nov 19, 2025 Maintainer

yangligt2
Sep 25, 2025

Replies: 2 comments 1 reply

danehans
Oct 2, 2025
Maintainer

danehans Oct 2, 2025
Maintainer

danehans
Nov 19, 2025
Maintainer