-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5598: Opportunistic scheduling cache #5599
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
bwsalmon
commented
Oct 1, 2025
- One-line PR description: First version of the KEP for scheduler cache
- Issue link: Opportunistic scheduling cache #5598
- Other comments:
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: bwsalmon The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Welcome @bwsalmon! |
Hi @bwsalmon. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
* We are not adding gang scheduling of any kind in this KEP. This is purely a performance | ||
improvement, although we hope the work on this KEP will help us with gang scheduling as we build it. | ||
|
||
## Proposal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Historically we've had something that was called "Equivalence Class Cache" (ecache) that seemed to be almost exactly what you're doing now.
We need a bit more archeology to find it, but I think this is the PR that removed at least some stage of it in 2018:
kubernetes/kubernetes#71399
I can't remember exactly what were the primary reasons behind that, but I would like to see a dedicated section describing:
- what problems did we have back then (that we decided to remove it)
- how this proposal is different and we won't hit the same ones)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Excellent thought. I'll look up ecache and see what I can find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's another goody from the meeting where it seems to have been discussed early on:
"Gang Scheduling: API design is almost finalized. We will arrange with the rest of API approvers and hopefully we can have an alpha version of the API in 1.12."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a section on the eCache. PTAL.
node off the list. We then go down the nominated node path. Just as we would with a pod | ||
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it. | ||
|
||
Since we assume 1-pod-per-node, we know that the node used by the current pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How we're going to verify that it's indeed pod-per-node configuration?
I think that just relying on users is not enough here...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the current thought is this (I'll make this more explicit here).
The user has to opt-in with the scheduling profile, we should mention this clearly there.
If they get it wrong it will not necessarily lead to incorrect behavior, just a shift of the scoring somewhat. So I think the cost is comparatively low.
I'm open to more tracking if we think that is appropriate. We can tell if a pod class is 1-pod-per-node, so we can flag that case if we'd like, and we should also be able to tell if multiple non-daemon set pods land on the same nodes after the fact.
Thoughts on how much you think is necessary?
@bwsalmon - please sign the CLA (and I'm happy to take the production readiness approval for it) |
Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong? |
If plugin changes prove to be an issue, we could codify the signature as a new "Scheduling" object that only has a subset | ||
of the fields of the pod. Plugins that "opt-in" could only be given access to this reduced scheduling object, and we could then use the entire scheduling object as the signature. This would make it more or less impossible for the signature and plugins to be out of sync, and would | ||
naturally surface new dependencies as additions to the scheduling object. However, as we expect plugin changes to be relatively | ||
modest, we don't believe the complexity of making the interface changes is worth the risk today. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Duplicate paragraph R323-R326
Note that the signature does not need to be stable across versions, or even invocations of the scheduler. | ||
It only needs to be comparable between pods on a given running scheduler instance. | ||
|
||
* DynamicResources: For now we mark a pod unsignable if it has dynamic resource claims. We should improve this in the future, since most |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does this plugin will makr pods as unsignable?
* VolumeBinding: Same as NodeVolumeLimits. | ||
* VolumeRestrictions: Same as NodeVolumeLimits. | ||
* VolumeZone: Same as NodeVolumeLimits. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to take the defaultPreemption into our consideration?
extending the production code to implement this enhancement. | ||
--> | ||
|
||
- `<package>`: `<date>` - `<test coverage>` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the test files that this KEP will add/update, along with the current coverage for the existing ones.
- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html) | ||
--> | ||
|
||
- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same, I expect we will add new integration tests as well.
[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/ | ||
--> | ||
|
||
- [ ] Feature gate (also fill in values in `kep.yaml`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the feature gate and update it in the kep.yaml
as well.
/cc @sanposhiho @macsko |
/label lead-opted-in |
# The target maturity stage in the current dev cycle for this KEP. | ||
# If the purpose of this KEP is to deprecate a user-visible feature | ||
# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. | ||
stage: alpha|beta|stable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stage: alpha|beta|stable | |
stage: alpha |
To simplify the first version, we only attempt this optimization for pods that have a single pod | ||
of this type on a node, and we exclude pods that use complex constraints like | ||
PodTopologySpread and PodAffinity / AntiAffinity. We also have tight | ||
eviction rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
eviction rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if | |
invalidation rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if |
|
||
Another change is the shift towards 1-pod-per-node in batch and ML environments. Many | ||
of these environments (among others) only attempt to run a single customer pod on each node, along | ||
with a complement of daemon set pods. This simplifies our scheduling needs significantly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
with a complement of daemon set pods. This simplifies our scheduling needs significantly. | |
with a complement of daemon set pods. This simplifies our scheduling needs significantly, as it allows to reuse not only filtering, but also scoring results. |
cannot be used by subsequent pods (of any signature). | ||
Thus we remove the host from all signatures in the cache. The cache is built | ||
in a way that makes it easy to remove entries by either pod signature or host | ||
so we can efficiently invalide entries. If we are not in a 1-pod-per-node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so we can efficiently invalide entries. If we are not in a 1-pod-per-node | |
so we can efficiently invalidate entries. If we are not in a 1-pod-per-node |
in a way that makes it easy to remove entries by either pod signature or host | ||
so we can efficiently invalide entries. If we are not in a 1-pod-per-node | ||
we could get "sub-optimal" results if the node just used is the best node for | ||
some other pod, but this should be the only issue. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some other pod, but this should be the only issue. | |
any of the following pods, but this should be the only issue. |
|
||
The pod scheduling signature is used to determine if two pods are "the same" | ||
from a scheduling perspective. In specific, what this means is that any pod | ||
with the given signature will get the same scores / feasibility results from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sanposhiho @macsko This is an important element of the proposal that we should review carefully. The assumption here is that assigning a pod does not change the feasibility nor scoring results for "the same" following pods.
By caching the results of the first pod, we are in fact capturing the cluster state and ignore any cluster changes that happens in the meantime (there is no guaranteed ordering of the following vs other pods). It should be equivalent to scheduling those pods in a single scheduling cycle.
Note that each pod cached assignment will be reassessed in the scheduling cycle, so the only potential consequence is that new feasible nodes that appeared in the meantime may be missed or scores may be different. The aggressive cache invalidation timeout is supposed to mitigate it.
more complex constraints. While we expect the cache to always be on, many pods will not be | ||
able to take advantage of it. | ||
* We are not adding gang scheduling of any kind in this KEP. This is purely a performance | ||
improvement, although we hope the work on this KEP will help us with gang scheduling as we build it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
improvement, although we hope the work on this KEP will help us with gang scheduling as we build it. | |
improvement without adding dependency on the Workload API | |
[KEP-4671](https://github.com/kubernetes/enhancements/pull/5558), although we hope the work on this KEP will help us with gang scheduling as we build it. |
* Pod affinity rules (affinity or anti-affinity) | ||
* Topology spread rules (including inherited rules from the system default) This constraint we should attempt to lift in the future. | ||
|
||
To construct a signature, we add a new function for each plugin to implement. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To construct a signature, we add a new function for each plugin to implement. | |
To allow non in-tree plugins to construct a signature, we add a new framework function to implement. |
resource claims will allow for a signature. | ||
* ImageLocality: We use the canonicalized image names from the Volumes as the signature. | ||
* InterPodAffinity: If either the PodAffinity or PodAntiAffinity fields are set, the pod is marked unsignable, otherwise an empty signature. | ||
* NodeAffinity: We use the NodeAffinity and NodeSelector fields as the signature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeAffinity might be set also in scheduler profile (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity-per-scheduling-profile), so either the profile should be captured in the signature or NodeAffinity from the scheduler configuration.
* NodeResourcesFit: We use the output of the computePodResourceRequest function as the signature. | ||
* NodeUnschedulable: We use the Tolerations field as the signature. | ||
* NodeVolumeLimits: We use all Volume information except from Volumes of type ConfigMap or Secret. | ||
* PodTopologySpread: If the PodTopologySpead field is set, or it is not set but a default set of rules are applied, we mark the pod unsignable, otherwise it returns an empty signature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or it is not set but a default set of rules are applied
How shall we tell whether the default rules applies to a pod?