KEP-766: Add DisaggregatedSet proposal#767
KEP-766: Add DisaggregatedSet proposal#767hasB4K wants to merge 5 commits intokubernetes-sigs:mainfrom
Conversation
|
@hasB4K: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: hasB4K The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @hasB4K! |
✅ Deploy Preview for kubernetes-sigs-lws ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
|
Hi @hasB4K. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Regular contributors should join the org to skip this step. Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/ok-to-test |
|
Can you run this script https://github.com/kubernetes-sigs/lws/blob/main/hack/update-toc.sh to fix the formatting in the KEP? |
This KEP proposes adding DisaggDeployment as a higher-level API for managing disaggregated inference workloads. DisaggDeployment orchestrates multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle management, including two-dimensional rolling updates and service orchestration. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
98fa70f to
a6d7172
Compare
|
And please sign the CLA. |
Update KEP metadata with actual reviewers and approvers, and remove the feature gate from Alpha requirements since DisaggDeployment doesn't require a feature gate. Signed-off-by: Mathis Felardos <mathis@mistral.ai>
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
|
|
||
| ### Non-Goals | ||
|
|
||
| 1. **Auto-scaling support**: HPA/VPA integration or automatic scaling based on inference load metrics is out of scope. |
There was a problem hiding this comment.
You can still autoscale at the prefill / decode level right?
There was a problem hiding this comment.
Yes, I think it should work. I haven't tested it in my implementation, but I don't see any reason why it couldn't work. I put it as Non-Goal since it's not the focus of the operator for now, and I actually think that scaling Prefill and Decode independently makes more sense than scaling the entire DisaggDeployment directly.
There was a problem hiding this comment.
I actually think that scaling Prefill and Decode independently
Yeah that's the one that I was referring to, still be able to scale independently while treating it as a unit
|
|
||
| // LeaderWorkerTemplate defines the template for leader/worker pods. | ||
| // +required | ||
| LeaderWorkerTemplate LeaderWorkerTemplate `json:"leaderWorkerTemplate"` |
There was a problem hiding this comment.
Is there a way to pass annotations that will be injected to the LWS object? The exclusive-placement feature relies on that to work https://lws.sigs.k8s.io/docs/concepts/#exclusive-lws-to-topology-placement
There was a problem hiding this comment.
An idea would be to add a field here called ExclusivePlacement and if set, have the controller inject the annotation. Though would need some sort of config to set subgroup-exclusive-placement as well.
There was a problem hiding this comment.
if it doesn't work out of the box in my implementation, we could add it. I'll see if we can add annotations without changing anything tomorrow morning.
There was a problem hiding this comment.
Yeah it is a future work type of thing, just thinking out loud. We can help with the implementation as well.
| **Example 2: Asymmetric Replicas** (9 prefill, 3 decode → 9 prefill, 3 decode, maxSurge=1): | ||
|
|
||
| When prefill and decode have different replica counts, the algorithm computes steps based on the larger dimension (9 in this case). The smaller side (decode) progresses proportionally slower: | ||
|
|
There was a problem hiding this comment.
If I'm understanding correctly, this means that during updates we are keeping track of four different LWS instances correct?
Can you expand more on why you opted for this implementation instead of for example using the partition field in LWS similar to how we use the partition field in Sts at the LWS level? Is the only gap #710?
You can add it to the alternatives considered section, and we can discuss later if there are any improvements we can make at the LWS level to simplify this algorithm, or if it is worth it at all.
There was a problem hiding this comment.
Okay, I was not aware of the partition field. So just read about it, and maybe I'm wrong here, but I don't think it would work:
-
Revision-aware traffic routing: DisaggDeployment is designed for disaggregated inference, where a load balancer (e.g., a modified LLM-d Endpoint Picker) must route prefill requests to backends whose decode counterparts are on the same revision. With separate LWS (and service) per revision, each pod's revision is explicit via labels (
disaggdeployment.x-k8s.io/revision). The LB can count backends per revision across both pools and distribute traffic proportionally. With partition-based updates, pods within the same LWS have different templates based on ordinal, making revision-aware routing significantly more complex. The goal is to avoid having a prefill talk to an incompatible decode. Both sides must be treated as totally incompatible. This special routing requires work on the LLM-d side. -
LWS as a Read-Only resource: Treating LWS as a Read-Only resource (like a ReplicaSet) makes more sense, as you want to update Prefill and Decode at different paces depending on the step you are at. It's a tied update. I don't think you can achieve this with partition.
-
Ops observability: It's simpler for ops observability. You can see directly at which stage your update is, since you can see the version right away during updates. (see scrennshot)
There was a problem hiding this comment.
Thanks for adding the section in the KEP. A few follow up questions that I have:
With partition-based updates, pods within the same LWS have different templates based on ordinal, making revision-aware routing significantly more complex
I'm assuming that the labels are injected at the LWS level, but can't we achieve the same logic if the disagg revision label is injected to the leader pod instead? Then, those above the partition will have the newer revision label, those below will have the old label, so the router can still do the mapping between old version and newer version. This might depend on the router's implementation, but I don't think it make a difference at what level the label is injected.
as you want to update Prefill and Decode at different paces depending on the step you are at. It's a tied update. I don't think you can achieve this with partition.
Why not? I don't see the difference between scaling prefill & decode vs changing the partition value independently.
Ops observability
This is I think would be the main benefit. But I wonder how much it matters to have this be easily readable. Neither StatefulSet nor LWS have this.
Just to make it clear, this is not a blocker for the merger of the KEP or the code implementation. I just think that it is worth discussing for the future if there's a simpler path here. The algorithm needed for rolling updates is complex already, and would be easier to understand if we stick to existing Kubernetes concepts like partition instead of trying to maintain four LWS replicas.
|
|
||
| 4. **Configuration drift**: Without a unified resource, prefill and decode configurations can drift apart, leading to subtle incompatibilities. | ||
|
|
||
| ### Goals |
There was a problem hiding this comment.
Has there been any thoughts on how this would work with Kueue?
What I am wondering is if this DisaggDeployment is quota limited do you want to avoid creating the LWS's?
There was a problem hiding this comment.
In my understanding Kueue is working thanks to annotations right? I would say that it works out of the box, and if adding annotation doesn't work directly, we could add support:
see:
#767 (comment)
There was a problem hiding this comment.
Well Kueue works with annotations for LWS because it has a internal controller that works with it. https://github.com/kubernetes-sigs/kueue/tree/main/pkg/controller/jobs/leaderworkerset
But I guess I'm mostly curious on things like Kueue's TopologyAwareScheduling (ie #665 ) with dissaggreatedserving
There was a problem hiding this comment.
Well, I'm not sure 🙂 - I guess we will need to experiment
4764822 to
5eb8a46
Compare
Address reviewer feedback explaining why we use multiple LWS per revision instead of the LWS partition field: - Revision-aware traffic routing for LLM-d Endpoint Picker - LWS as read-only resource (like Deployment/ReplicaSet pattern) - Simpler ops observability during rolling updates
5eb8a46 to
fcc3b9c
Compare
| // +kubebuilder:default=1 | ||
| Replicas *int32 `json:"replicas,omitempty"` | ||
|
|
||
| // LeaderWorkerTemplate defines the template for leader/worker pods. |
There was a problem hiding this comment.
Have you tested if setting size = 1 for both the prefill and the decode server works?
There was a problem hiding this comment.
You mean replicas==1 for both side? Yes I did, and it works.
| DisaggDeployment simplifies the deployment of disaggregated LLM inference by: | ||
| - Managing two LeaderWorkerSets (prefill and decode) as a single logical unit | ||
| - Providing coordinated two-dimensional rolling updates across both sides | ||
| - Automatically creating Services when both sides are ready |
There was a problem hiding this comment.
We have generally avoided bundling/instantiating easily composable resources. For example, in upstream, you don't see the workload apis (like Deployment, StatefulSet etc.) automatically create a Service.
There was a problem hiding this comment.
The service is optional, and I thought it was okay to be able to create one since LWS already does this.
But anyway, we can discuss about this. it was easier to handle our scheduler on our side, but I understand your concerns.
There was a problem hiding this comment.
LWS only creates a headless service to trigger programming DNS entries for the individual workers, this is useful to standardize discovery among the leader-workers of a group (e.g., typically workers need to know the network address of their leader, for other the frameworks it could be the opposite, the leader needs to know the network addresses of its workers).
|
|
||
| 4. **Traffic management / routing**: Integration with service meshes or ingress controllers for traffic splitting during rollouts is out of scope. | ||
|
|
||
| 5. **Generic group support**: Supporting arbitrary groups beyond prefill/decode (e.g., `decode-long-context`, `encode`) is deferred to a future enhancement. |
There was a problem hiding this comment.
I think we should at least make the api future proof to support that.
There was a problem hiding this comment.
I think it's already future proof since:
apiVersion: disaggregatedset.x-k8s.io/v1alpha1
kind: DisaggregatedSet
metadata:
name: my-llm
spec:
prefill:
replicas: 3 # Number of LWS groups
rolloutStrategy:
maxSurge: 1 # Extra replicas during update
maxUnavailable: 0 # Don't remove until new is ready
serviceTemplate: # Optional: create Service when ready
spec:
ports:
- port: 8080
leaderWorkerTemplate:
size: 2 # 1 leader + 1 worker per group
workerTemplate:
spec:
containers:
- name: prefill
image: vllm:latest
resources:
limits:
nvidia.com/gpu: 1
decode:
replicas: 6 # Often more decode than prefill
rolloutStrategy:
maxSurge: 2
maxUnavailable: 1
serviceTemplate:
spec:
ports:
- port: 8080
leaderWorkerTemplate:
size: 1 # Single pod per group
workerTemplate:
spec:
containers:
- name: decode
image: vllm:latest
resources:
limits:
nvidia.com/gpu: 1
decode-long-context: # <== adding a special group should work out of the box in the config.
[ ... ]
What I mean is prefill and decode are currently hardcoded, but they could simply be treated as a standalone "side" with a name (either prefill or decode).
So except if I miss something (which is totally possible), I think it's already future proof?
There was a problem hiding this comment.
This is still hardcoding the prefill and decode as part of the api, we could instead have a list with named entries to make the api more generic. But we can follow up on that possibility when we merge it to run with the lws controller.
There was a problem hiding this comment.
I guess you are suggesting this solution:
type DisaggregatedSetSpec struct {
Sides []DisaggSideConfig `json:"sides"`
}
type DisaggSideConfig struct {
Name string `json:"name"` // "prefill", "decode", etc.
// ... rest of fields
}
YAML:
spec:
sides:
- name: prefill
replicas: 10
...
- name: decode
replicas: 2
...
- name: custom_stage
replicas: 5
I need to say that this is an easy change, and I could do that (while only supporting prefill and decode for now).
What I had in mind however was this:
Option 2: Pure Map (Breaking Change)
type DisaggregatedSetSpec struct {
Sides map[string]*DisaggSideConfig `json:"sides"`
}
YAML:
spec:
sides:
prefill: <lws1>
decode: <lws2>
custom_stage: <lws3>
This would work with older YAML configs too.
There was a problem hiding this comment.
we could instead have a list with named entries to make the api more generic
If we go this route, we should make it clear somewhere that we only support synced rolling updates between the first two items in the list, not all of them.
|
|
||
| #### Story 3: Coordinated Service Discovery | ||
|
|
||
| As an application developer, I want Services to be automatically created for my prefill and decode endpoints, but only when both sides are ready to serve traffic, so that my application doesn't route requests to partially-ready infrastructure. |
There was a problem hiding this comment.
I do not think relying on the existence of the Service resource is the proper approach to address this problem. The P/D coordinator should handle this case, not the workload api.
There was a problem hiding this comment.
The service is optional, and I thought it was okay to be able to create one since LWS already does this.
But anyway, we can discuss about this. it was easier to handle our scheduler on our side, but I understand your concerns.
| old + new <= target + maxSurge | ||
| ``` | ||
|
|
||
| 5. **Scale-Up Priority**: New replicas are scaled up before old replicas are scaled down, ensuring capacity is never below the minimum. |
There was a problem hiding this comment.
Are we not concerned about limited hardware capacity? does this mean the cluster must have hardware capacity for at least one extra replica to conduct and update (vs doing in-place)? Can we do surge =0 and maxUnavailable = 1 to trigger an in-place rollout?
There was a problem hiding this comment.
Yes, this works, and it's part of the test suite. Simply we favor maxSurge over maxUnavailable. We could change this behavior if you guys want.
There was a problem hiding this comment.
No, I think preferring maxSurge seems like the better semantics.
| | 5 | 1 | 1 | 3 | 3 | | ||
| | 6 | 0 | 0 | 3 | 3 | | ||
|
|
||
| **Example 2: Asymmetric Replicas** (9 prefill, 3 decode → 9 prefill, 3 decode, maxSurge=1): |
There was a problem hiding this comment.
Can we spill out the maxSurge semantics? is it for both prefill and decode or for each?
|
The implementation has been pushed here 🙂: I wanted this PR to be on top of that branch, but since my branch is on my fork, I couldn't do it 😬 (it created a PR on my forked repo instead 🫤) |
This KEP proposes adding DisaggDeployment as a higher-level API for managing disaggregated inference workloads. DisaggDeployment orchestrates multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle management, including two-dimensional rolling updates and service orchestration.
What type of PR is this?
/kind feature
/kind api-change
What this PR does / why we need it
This KEP proposes adding DisaggDeployment as a higher-level API for managing disaggregated inference workloads. DisaggDeployment orchestrates multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle management, including two-dimensional rolling updates and service orchestration.
Fixes #766
Special notes for your reviewer
As discussed offline, the implementation of this KEP already exists, and will be pushed tomorrow. I will change my implementation based on the review of this KEP and the review of the coming implementation.
Does this PR introduce a user-facing change?
Yes and no. It's a new API.