Skip to content

KEP-766: Add DisaggregatedSet proposal#767

Open
hasB4K wants to merge 5 commits intokubernetes-sigs:mainfrom
hasB4K:kep-disagg-deployment
Open

KEP-766: Add DisaggregatedSet proposal#767
hasB4K wants to merge 5 commits intokubernetes-sigs:mainfrom
hasB4K:kep-disagg-deployment

Conversation

@hasB4K
Copy link

@hasB4K hasB4K commented Mar 5, 2026

This KEP proposes adding DisaggDeployment as a higher-level API for managing disaggregated inference workloads. DisaggDeployment orchestrates multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle management, including two-dimensional rolling updates and service orchestration.

What type of PR is this?

/kind feature
/kind api-change

What this PR does / why we need it

This KEP proposes adding DisaggDeployment as a higher-level API for managing disaggregated inference workloads. DisaggDeployment orchestrates multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle management, including two-dimensional rolling updates and service orchestration.

Fixes #766

Special notes for your reviewer

As discussed offline, the implementation of this KEP already exists, and will be pushed tomorrow. I will change my implementation based on the review of this KEP and the review of the coming implementation.

Does this PR introduce a user-facing change?

Yes and no. It's a new API.

@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Mar 5, 2026
@k8s-ci-robot
Copy link
Contributor

@hasB4K: The label(s) kind//kind cannot be applied, because the repository doesn't have them.

Details

In response to this:

This KEP proposes adding DisaggDeployment as a higher-level API for managing disaggregated inference workloads. DisaggDeployment orchestrates multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle management, including two-dimensional rolling updates and service orchestration.

What type of PR is this?

/kind feature /kind api-change

What this PR does / why we need it

This KEP proposes adding DisaggDeployment as a higher-level API for managing disaggregated inference workloads. DisaggDeployment orchestrates multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle management, including two-dimensional rolling updates and service orchestration.

Fixes #766

Special notes for your reviewer

As discussed offline, the implementation of this KEP already exists, and will be pushed tomorrow. I will change my implementation based on the review of this KEP and the review of the coming implementation.

Does this PR introduce a user-facing change?

Yes and no. It's a new API.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API label Mar 5, 2026
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Mar 5, 2026

CLA Signed

The committers listed above are authorized under a signed CLA.

@k8s-ci-robot k8s-ci-robot requested review from ahg-g and ardaguclu March 5, 2026 17:02
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: hasB4K
Once this PR has been reviewed and has the lgtm label, please assign edwinhr716 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

Welcome @hasB4K!

It looks like this is your first PR to kubernetes-sigs/lws 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/lws has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@netlify
Copy link

netlify bot commented Mar 5, 2026

Deploy Preview for kubernetes-sigs-lws ready!

Name Link
🔨 Latest commit f9718f4
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-lws/deploys/69ae95647eb31500092a641a
😎 Deploy Preview https://deploy-preview-767--kubernetes-sigs-lws.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Mar 5, 2026
@k8s-ci-robot
Copy link
Contributor

Hi @hasB4K. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Mar 5, 2026
@Edwinhr716
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Mar 5, 2026
@Edwinhr716
Copy link
Contributor

Can you run this script https://github.com/kubernetes-sigs/lws/blob/main/hack/update-toc.sh to fix the formatting in the KEP?

This KEP proposes adding DisaggDeployment as a higher-level API for
managing disaggregated inference workloads. DisaggDeployment orchestrates
multiple LeaderWorkerSets (prefill and decode) with coordinated lifecycle
management, including two-dimensional rolling updates and service
orchestration.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
@hasB4K hasB4K force-pushed the kep-disagg-deployment branch from 98fa70f to a6d7172 Compare March 5, 2026 18:00
@kannon92
Copy link
Contributor

kannon92 commented Mar 5, 2026

And please sign the CLA.

Update KEP metadata with actual reviewers and approvers, and remove
the feature gate from Alpha requirements since DisaggDeployment
doesn't require a feature gate.

Signed-off-by: Mathis Felardos <mathis@mistral.ai>
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Mar 5, 2026
Signed-off-by: Mathis Felardos <mathis@mistral.ai>
@hasB4K hasB4K requested a review from Edwinhr716 March 5, 2026 19:24

### Non-Goals

1. **Auto-scaling support**: HPA/VPA integration or automatic scaling based on inference load metrics is out of scope.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can still autoscale at the prefill / decode level right?

Copy link
Author

@hasB4K hasB4K Mar 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it should work. I haven't tested it in my implementation, but I don't see any reason why it couldn't work. I put it as Non-Goal since it's not the focus of the operator for now, and I actually think that scaling Prefill and Decode independently makes more sense than scaling the entire DisaggDeployment directly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think that scaling Prefill and Decode independently

Yeah that's the one that I was referring to, still be able to scale independently while treating it as a unit


// LeaderWorkerTemplate defines the template for leader/worker pods.
// +required
LeaderWorkerTemplate LeaderWorkerTemplate `json:"leaderWorkerTemplate"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to pass annotations that will be injected to the LWS object? The exclusive-placement feature relies on that to work https://lws.sigs.k8s.io/docs/concepts/#exclusive-lws-to-topology-placement

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An idea would be to add a field here called ExclusivePlacement and if set, have the controller inject the annotation. Though would need some sort of config to set subgroup-exclusive-placement as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it doesn't work out of the box in my implementation, we could add it. I'll see if we can add annotations without changing anything tomorrow morning.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is a future work type of thing, just thinking out loud. We can help with the implementation as well.

**Example 2: Asymmetric Replicas** (9 prefill, 3 decode → 9 prefill, 3 decode, maxSurge=1):

When prefill and decode have different replica counts, the algorithm computes steps based on the larger dimension (9 in this case). The smaller side (decode) progresses proportionally slower:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, this means that during updates we are keeping track of four different LWS instances correct?

Can you expand more on why you opted for this implementation instead of for example using the partition field in LWS similar to how we use the partition field in Sts at the LWS level? Is the only gap #710?

You can add it to the alternatives considered section, and we can discuss later if there are any improvements we can make at the LWS level to simplify this algorithm, or if it is worth it at all.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I was not aware of the partition field. So just read about it, and maybe I'm wrong here, but I don't think it would work:

  • Revision-aware traffic routing: DisaggDeployment is designed for disaggregated inference, where a load balancer (e.g., a modified LLM-d Endpoint Picker) must route prefill requests to backends whose decode counterparts are on the same revision. With separate LWS (and service) per revision, each pod's revision is explicit via labels (disaggdeployment.x-k8s.io/revision). The LB can count backends per revision across both pools and distribute traffic proportionally. With partition-based updates, pods within the same LWS have different templates based on ordinal, making revision-aware routing significantly more complex. The goal is to avoid having a prefill talk to an incompatible decode. Both sides must be treated as totally incompatible. This special routing requires work on the LLM-d side.

  • LWS as a Read-Only resource: Treating LWS as a Read-Only resource (like a ReplicaSet) makes more sense, as you want to update Prefill and Decode at different paces depending on the step you are at. It's a tied update. I don't think you can achieve this with partition.

  • Ops observability: It's simpler for ops observability. You can see directly at which stage your update is, since you can see the version right away during updates. (see scrennshot)

image

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a section about this

Copy link
Contributor

@Edwinhr716 Edwinhr716 Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding the section in the KEP. A few follow up questions that I have:

With partition-based updates, pods within the same LWS have different templates based on ordinal, making revision-aware routing significantly more complex

I'm assuming that the labels are injected at the LWS level, but can't we achieve the same logic if the disagg revision label is injected to the leader pod instead? Then, those above the partition will have the newer revision label, those below will have the old label, so the router can still do the mapping between old version and newer version. This might depend on the router's implementation, but I don't think it make a difference at what level the label is injected.

as you want to update Prefill and Decode at different paces depending on the step you are at. It's a tied update. I don't think you can achieve this with partition.

Why not? I don't see the difference between scaling prefill & decode vs changing the partition value independently.

Ops observability

This is I think would be the main benefit. But I wonder how much it matters to have this be easily readable. Neither StatefulSet nor LWS have this.

Just to make it clear, this is not a blocker for the merger of the KEP or the code implementation. I just think that it is worth discussing for the future if there's a simpler path here. The algorithm needed for rolling updates is complex already, and would be easier to understand if we stick to existing Kubernetes concepts like partition instead of trying to maintain four LWS replicas.


4. **Configuration drift**: Without a unified resource, prefill and decode configurations can drift apart, leading to subtle incompatibilities.

### Goals
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Has there been any thoughts on how this would work with Kueue?

What I am wondering is if this DisaggDeployment is quota limited do you want to avoid creating the LWS's?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding Kueue is working thanks to annotations right? I would say that it works out of the box, and if adding annotation doesn't work directly, we could add support:

see:
#767 (comment)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well Kueue works with annotations for LWS because it has a internal controller that works with it. https://github.com/kubernetes-sigs/kueue/tree/main/pkg/controller/jobs/leaderworkerset

But I guess I'm mostly curious on things like Kueue's TopologyAwareScheduling (ie #665 ) with dissaggreatedserving

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I'm not sure 🙂 - I guess we will need to experiment

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea np. Its not blocking.

@hasB4K hasB4K force-pushed the kep-disagg-deployment branch 2 times, most recently from 4764822 to 5eb8a46 Compare March 6, 2026 18:21
Address reviewer feedback explaining why we use multiple LWS per
revision instead of the LWS partition field:

- Revision-aware traffic routing for LLM-d Endpoint Picker
- LWS as read-only resource (like Deployment/ReplicaSet pattern)
- Simpler ops observability during rolling updates
@hasB4K hasB4K force-pushed the kep-disagg-deployment branch from 5eb8a46 to fcc3b9c Compare March 6, 2026 18:27
// +kubebuilder:default=1
Replicas *int32 `json:"replicas,omitempty"`

// LeaderWorkerTemplate defines the template for leader/worker pods.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tested if setting size = 1 for both the prefill and the decode server works?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean replicas==1 for both side? Yes I did, and it works.

DisaggDeployment simplifies the deployment of disaggregated LLM inference by:
- Managing two LeaderWorkerSets (prefill and decode) as a single logical unit
- Providing coordinated two-dimensional rolling updates across both sides
- Automatically creating Services when both sides are ready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have generally avoided bundling/instantiating easily composable resources. For example, in upstream, you don't see the workload apis (like Deployment, StatefulSet etc.) automatically create a Service.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service is optional, and I thought it was okay to be able to create one since LWS already does this.

But anyway, we can discuss about this. it was easier to handle our scheduler on our side, but I understand your concerns.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LWS only creates a headless service to trigger programming DNS entries for the individual workers, this is useful to standardize discovery among the leader-workers of a group (e.g., typically workers need to know the network address of their leader, for other the frameworks it could be the opposite, the leader needs to know the network addresses of its workers).


4. **Traffic management / routing**: Integration with service meshes or ingress controllers for traffic splitting during rollouts is out of scope.

5. **Generic group support**: Supporting arbitrary groups beyond prefill/decode (e.g., `decode-long-context`, `encode`) is deferred to a future enhancement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should at least make the api future proof to support that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's already future proof since:

apiVersion: disaggregatedset.x-k8s.io/v1alpha1
  kind: DisaggregatedSet
  metadata:
    name: my-llm
  spec:
    prefill:
      replicas: 3                      # Number of LWS groups
      rolloutStrategy:
        maxSurge: 1                    # Extra replicas during update
        maxUnavailable: 0              # Don't remove until new is ready
      serviceTemplate:                 # Optional: create Service when ready
        spec:
          ports:
          - port: 8080
      leaderWorkerTemplate:
        size: 2                        # 1 leader + 1 worker per group
        workerTemplate:
          spec:
            containers:
            - name: prefill
              image: vllm:latest
              resources:
                limits:
                  nvidia.com/gpu: 1

    decode:
      replicas: 6                      # Often more decode than prefill
      rolloutStrategy:
        maxSurge: 2
        maxUnavailable: 1
      serviceTemplate:
        spec:
          ports:
          - port: 8080
      leaderWorkerTemplate:
        size: 1                        # Single pod per group
        workerTemplate:
          spec:
            containers:
            - name: decode
              image: vllm:latest
              resources:
                limits:
                  nvidia.com/gpu: 1

    decode-long-context:  # <== adding a special group should work out of the box in the config.
        [ ... ]

What I mean is prefill and decode are currently hardcoded, but they could simply be treated as a standalone "side" with a name (either prefill or decode).

So except if I miss something (which is totally possible), I think it's already future proof?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still hardcoding the prefill and decode as part of the api, we could instead have a list with named entries to make the api more generic. But we can follow up on that possibility when we merge it to run with the lws controller.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are suggesting this solution:

  type DisaggregatedSetSpec struct {
      Sides []DisaggSideConfig `json:"sides"`
  }

  type DisaggSideConfig struct {
      Name string `json:"name"` // "prefill", "decode", etc.
      // ... rest of fields
  }

YAML:

spec:
  sides:
    - name: prefill
      replicas: 10
      ...
    - name: decode
      replicas: 2
      ...
    - name: custom_stage
      replicas: 5

I need to say that this is an easy change, and I could do that (while only supporting prefill and decode for now).


What I had in mind however was this:

  Option 2: Pure Map (Breaking Change)

  type DisaggregatedSetSpec struct {
      Sides map[string]*DisaggSideConfig `json:"sides"`
  }

YAML:

  spec:
    sides:
      prefill: <lws1>
      decode: <lws2>
      custom_stage: <lws3>

This would work with older YAML configs too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could instead have a list with named entries to make the api more generic

If we go this route, we should make it clear somewhere that we only support synced rolling updates between the first two items in the list, not all of them.


#### Story 3: Coordinated Service Discovery

As an application developer, I want Services to be automatically created for my prefill and decode endpoints, but only when both sides are ready to serve traffic, so that my application doesn't route requests to partially-ready infrastructure.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not think relying on the existence of the Service resource is the proper approach to address this problem. The P/D coordinator should handle this case, not the workload api.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service is optional, and I thought it was okay to be able to create one since LWS already does this.

But anyway, we can discuss about this. it was easier to handle our scheduler on our side, but I understand your concerns.

old + new <= target + maxSurge
```

5. **Scale-Up Priority**: New replicas are scaled up before old replicas are scaled down, ensuring capacity is never below the minimum.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we not concerned about limited hardware capacity? does this mean the cluster must have hardware capacity for at least one extra replica to conduct and update (vs doing in-place)? Can we do surge =0 and maxUnavailable = 1 to trigger an in-place rollout?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this works, and it's part of the test suite. Simply we favor maxSurge over maxUnavailable. We could change this behavior if you guys want.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think preferring maxSurge seems like the better semantics.

| 5 | 1 | 1 | 3 | 3 |
| 6 | 0 | 0 | 3 | 3 |

**Example 2: Asymmetric Replicas** (9 prefill, 3 decode → 9 prefill, 3 decode, maxSurge=1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we spill out the maxSurge semantics? is it for both prefill and decode or for each?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for each side.

@hasB4K
Copy link
Author

hasB4K commented Mar 9, 2026

The implementation has been pushed here 🙂:
#773

I wanted this PR to be on top of that branch, but since my branch is on my fork, I couldn't do it 😬 (it created a PR on my forked repo instead 🫤)

@hasB4K hasB4K changed the title KEP-766: Add DisaggDeployment proposal KEP-766: Add DisaggregatedSet proposal Mar 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/api-change Categorizes issue or PR as related to adding, removing, or otherwise changing an API kind/feature Categorizes issue or PR as related to a new feature. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add DisaggDeployment CRD for coordinated prefill/decode workload management

6 participants