KEP-5598: Opportunistic scheduling cache #5599

bwsalmon · 2025-10-01T23:11:36Z

One-line PR description: First version of the KEP for scheduler cache

Issue link: Opportunistic scheduling cache #5598

Other comments:

linux-foundation-easycla · 2025-10-01T23:11:43Z

❌ - login: @bsalmon-goog / name: Brandon Salmon . The commit (68a3857, b833002, fd731ed, 47bc765, 2518927) is not authorized under a signed CLA. Please click here to be authorized. For further assistance with EasyCLA, please submit a support request ticket.

k8s-ci-robot · 2025-10-01T23:11:45Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bwsalmon
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

keps/sig-scheduling/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-10-01T23:11:46Z

Welcome @bwsalmon!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

k8s-ci-robot · 2025-10-01T23:11:47Z

Hi @bwsalmon. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

wojtek-t · 2025-10-03T12:39:07Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+ * We are not adding gang scheduling of any kind in this KEP. This is purely a performance
+improvement, although we hope the work on this KEP will help us with gang scheduling as we build it.
+
+## Proposal


Historically we've had something that was called "Equivalence Class Cache" (ecache) that seemed to be almost exactly what you're doing now.
We need a bit more archeology to find it, but I think this is the PR that removed at least some stage of it in 2018:
kubernetes/kubernetes#71399

I can't remember exactly what were the primary reasons behind that, but I would like to see a dedicated section describing:

what problems did we have back then (that we decided to remove it)

how this proposal is different and we won't hit the same ones)

Excellent thought. I'll look up ecache and see what I can find.

Here's another goody from the meeting where it seems to have been discussed early on:

"Gang Scheduling: API design is almost finalized. We will arrange with the rest of API approvers and hopefully we can have an alpha version of the API in 1.12."

Added a section on the eCache. PTAL.

wojtek-t · 2025-10-03T12:49:03Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+node off the list. We then go down the nominated node path. Just as we would with a pod
+with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.
+
+Since we assume 1-pod-per-node, we know that the node used by the current pod 


How we're going to verify that it's indeed pod-per-node configuration?

I think that just relying on users is not enough here...

So the current thought is this (I'll make this more explicit here).

The user has to opt-in with the scheduling profile, we should mention this clearly there.

If they get it wrong it will not necessarily lead to incorrect behavior, just a shift of the scoring somewhat. So I think the cost is comparatively low.

I'm open to more tracking if we think that is appropriate. We can tell if a pod class is 1-pod-per-node, so we can flag that case if we'd like, and we should also be able to tell if multiple non-daemon set pods land on the same nodes after the fact.

Thoughts on how much you think is necessary?

wojtek-t · 2025-10-03T12:49:45Z

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

bwsalmon · 2025-10-03T19:48:30Z

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

helayoty · 2025-10-06T20:36:13Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+If plugin changes prove to be an issue, we could codify the signature as a new "Scheduling" object that only has a subset
+of the fields of the pod. Plugins that "opt-in" could only be given access to this reduced scheduling object, and we could then use the entire scheduling object as the signature. This would make it more or less impossible for the signature and plugins to be out of sync, and would
+naturally surface new dependencies as additions to the scheduling object. However, as we expect plugin changes to be relatively 
+modest, we don't believe the complexity of making the interface changes is worth the risk today.


Duplicate paragraph R323-R326

helayoty · 2025-10-06T21:33:57Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+Note that the signature does not need to be stable across versions, or even invocations of the scheduler. 
+It only needs to be comparable between pods on a given running scheduler instance.
+
+ * DynamicResources: For now we mark a pod unsignable if it has dynamic resource claims. We should improve this in the future, since most


Why does this plugin will makr pods as unsignable?

helayoty · 2025-10-06T21:40:33Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+ * VolumeBinding: Same as NodeVolumeLimits.
+ * VolumeRestrictions: Same as NodeVolumeLimits.
+ * VolumeZone: Same as NodeVolumeLimits.
+


Do we need to take the defaultPreemption into our consideration?

helayoty · 2025-10-06T21:42:36Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+extending the production code to implement this enhancement.
+-->
+
+- `<package>`: `<date>` - `<test coverage>`


Please add the test files that this KEP will add/update, along with the current coverage for the existing ones.

helayoty · 2025-10-06T21:43:27Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
+-->
+
+- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)


Same, I expect we will add new integration tests as well.

helayoty · 2025-10-06T21:44:12Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
+-->
+
+- [ ] Feature gate (also fill in values in `kep.yaml`)


Please add the feature gate and update it in the kep.yaml as well.

dom4ha · 2025-10-06T22:51:53Z

/cc @sanposhiho @macsko

dom4ha · 2025-10-06T22:55:34Z

/label lead-opted-in
/milestone v1.35

dom4ha · 2025-10-06T22:36:12Z

keps/sig-scheduling/5598-scheduling-cache/kep.yaml

+# The target maturity stage in the current dev cycle for this KEP.
+# If the purpose of this KEP is to deprecate a user-visible feature
+# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
+stage: alpha|beta|stable


Suggested change

stage: alpha|beta|stable

stage: alpha

dom4ha · 2025-10-06T22:40:51Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+To simplify the first version, we only attempt this optimization for pods that have a single pod 
+of this type on a node, and we exclude pods that use complex constraints like
+PodTopologySpread and PodAffinity / AntiAffinity. We also have tight
+eviction rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if


Suggested change

eviction rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if

invalidation rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if

dom4ha · 2025-10-06T22:45:54Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+
+Another change is the shift towards 1-pod-per-node in batch and ML environments. Many
+of these environments (among others) only attempt to run a single customer pod on each node, along
+with a complement of daemon set pods. This simplifies our scheduling needs significantly.


Suggested change

with a complement of daemon set pods. This simplifies our scheduling needs significantly.

with a complement of daemon set pods. This simplifies our scheduling needs significantly, as it allows to reuse not only filtering, but also scoring results.

dom4ha · 2025-10-06T22:57:59Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+cannot be used by subsequent pods (of any signature).
+Thus we remove the host from all signatures in the cache. The cache is built
+in a way that makes it easy to remove entries by either pod signature or host
+so we can efficiently invalide entries. If we are not in a 1-pod-per-node


Suggested change

so we can efficiently invalide entries. If we are not in a 1-pod-per-node

so we can efficiently invalidate entries. If we are not in a 1-pod-per-node

dom4ha · 2025-10-06T23:00:32Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+in a way that makes it easy to remove entries by either pod signature or host
+so we can efficiently invalide entries. If we are not in a 1-pod-per-node
+we could get "sub-optimal" results if the node just used is the best node for
+some other pod, but this should be the only issue.


Suggested change

some other pod, but this should be the only issue.

any of the following pods, but this should be the only issue.

dom4ha · 2025-10-06T23:29:51Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+
+The pod scheduling signature is used to determine if two pods are "the same"
+from a scheduling perspective. In specific, what this means is that any pod
+with the given signature will get the same scores / feasibility results from


@sanposhiho @macsko This is an important element of the proposal that we should review carefully. The assumption here is that assigning a pod does not change the feasibility nor scoring results for "the same" following pods.

By caching the results of the first pod, we are in fact capturing the cluster state and ignore any cluster changes that happens in the meantime (there is no guaranteed ordering of the following vs other pods). It should be equivalent to scheduling those pods in a single scheduling cycle.

Note that each pod cached assignment will be reassessed in the scheduling cycle, so the only potential consequence is that new feasible nodes that appeared in the meantime may be missed or scores may be different. The aggressive cache invalidation timeout is supposed to mitigate it.

dom4ha · 2025-10-06T23:34:22Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+more complex constraints. While we expect the cache to always be on, many pods will not be 
+able to take advantage of it.
+ * We are not adding gang scheduling of any kind in this KEP. This is purely a performance
+improvement, although we hope the work on this KEP will help us with gang scheduling as we build it.


Suggested change

improvement, although we hope the work on this KEP will help us with gang scheduling as we build it.

improvement without adding dependency on the Workload API

[KEP-4671](https://github.com/kubernetes/enhancements/pull/5558), although we hope the work on this KEP will help us with gang scheduling as we build it.

dom4ha · 2025-10-06T23:36:24Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+   * Pod affinity rules (affinity or anti-affinity)
+   * Topology spread rules (including inherited rules from the system default) This constraint we should attempt to lift in the future.
+
+To construct a signature, we add a new function for each plugin to implement.


Suggested change

To construct a signature, we add a new function for each plugin to implement.

To allow non in-tree plugins to construct a signature, we add a new framework function to implement.

dom4ha · 2025-10-06T23:50:50Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+ resource claims will allow for a signature.
+ * ImageLocality: We use the canonicalized image names from the Volumes as the signature.
+ * InterPodAffinity: If either the PodAffinity or PodAntiAffinity fields are set, the pod is marked unsignable, otherwise an empty signature.
+ * NodeAffinity: We use the NodeAffinity and NodeSelector fields as the signature.


NodeAffinity might be set also in scheduler profile (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity-per-scheduling-profile), so either the profile should be captured in the signature or NodeAffinity from the scheduler configuration.

dom4ha · 2025-10-06T23:56:47Z

keps/sig-scheduling/5598-scheduling-cache/README.md

+ * NodeResourcesFit: We use the output of the computePodResourceRequest function as the signature.
+ * NodeUnschedulable: We use the Tolerations field as the signature.
+ * NodeVolumeLimits: We use all Volume information except from Volumes of type ConfigMap or Secret.
+ * PodTopologySpread: If the PodTopologySpead field is set, or it is not set but a default set of rules are applied, we mark the pod unsignable, otherwise it returns an empty signature.


or it is not set but a default set of rules are applied

How shall we tell whether the default rules applies to a pod?

KEP-5598: Opportunistic scheduling cache

fd731ed

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Oct 1, 2025

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 1, 2025

k8s-ci-robot requested a review from dom4ha October 1, 2025 23:11

k8s-ci-robot requested a review from macsko October 1, 2025 23:11

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 1, 2025

github-project-automation bot added this to SIG Scheduling Oct 1, 2025

k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 1, 2025

helayoty moved this to Needs Review in SIG Scheduling Oct 2, 2025

wojtek-t reviewed Oct 3, 2025

View reviewed changes

wojtek-t self-assigned this Oct 3, 2025

Add section on eCache.

68a3857

bsalmon-goog added 3 commits October 3, 2025 19:50

Wording clarifications.

2518927

More cleanup.

47bc765

Cleanup and add comparison to ecache.

b833002

helayoty reviewed Oct 6, 2025

View reviewed changes

k8s-ci-robot requested a review from sanposhiho October 6, 2025 22:51

k8s-ci-robot added this to the v1.35 milestone Oct 6, 2025

k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Oct 6, 2025

dom4ha reviewed Oct 6, 2025

View reviewed changes

	eviction rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if
	invalidation rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if

	with a complement of daemon set pods. This simplifies our scheduling needs significantly.
	with a complement of daemon set pods. This simplifies our scheduling needs significantly, as it allows to reuse not only filtering, but also scoring results.

	so we can efficiently invalide entries. If we are not in a 1-pod-per-node
	so we can efficiently invalidate entries. If we are not in a 1-pod-per-node

	some other pod, but this should be the only issue.
	any of the following pods, but this should be the only issue.

	improvement, although we hope the work on this KEP will help us with gang scheduling as we build it.
	improvement without adding dependency on the Workload API
	[KEP-4671](https://github.com/kubernetes/enhancements/pull/5558), although we hope the work on this KEP will help us with gang scheduling as we build it.

	To construct a signature, we add a new function for each plugin to implement.
	To allow non in-tree plugins to construct a signature, we add a new framework function to implement.

KEP-5598: Opportunistic scheduling cache #5599

Are you sure you want to change the base?

KEP-5598: Opportunistic scheduling cache #5599

Conversation

bwsalmon commented Oct 1, 2025

Uh oh!

linux-foundation-easycla bot commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

k8s-ci-robot commented Oct 1, 2025

Uh oh!

k8s-ci-robot commented Oct 1, 2025

Uh oh!

k8s-ci-robot commented Oct 1, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wojtek-t commented Oct 3, 2025

Uh oh!

bwsalmon commented Oct 3, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dom4ha commented Oct 6, 2025

Uh oh!

dom4ha commented Oct 6, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linux-foundation-easycla bot commented Oct 1, 2025 •

edited

Loading