Skip to content

Conversation

bwsalmon
Copy link

@bwsalmon bwsalmon commented Oct 1, 2025

  • One-line PR description: First version of the KEP for scheduler cache
  • Other comments:

Copy link

linux-foundation-easycla bot commented Oct 1, 2025

CLA Not Signed

@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Oct 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: bwsalmon
Once this PR has been reviewed and has the lgtm label, please assign sanposhiho for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Oct 1, 2025
@k8s-ci-robot k8s-ci-robot requested a review from dom4ha October 1, 2025 23:11
@k8s-ci-robot
Copy link
Contributor

Welcome @bwsalmon!

It looks like this is your first PR to kubernetes/enhancements 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes/enhancements has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. 😃

@k8s-ci-robot k8s-ci-robot requested a review from macsko October 1, 2025 23:11
@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Oct 1, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @bwsalmon. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Oct 1, 2025
@helayoty helayoty moved this to Needs Review in SIG Scheduling Oct 2, 2025
* We are not adding gang scheduling of any kind in this KEP. This is purely a performance
improvement, although we hope the work on this KEP will help us with gang scheduling as we build it.

## Proposal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically we've had something that was called "Equivalence Class Cache" (ecache) that seemed to be almost exactly what you're doing now.
We need a bit more archeology to find it, but I think this is the PR that removed at least some stage of it in 2018:
kubernetes/kubernetes#71399

I can't remember exactly what were the primary reasons behind that, but I would like to see a dedicated section describing:

  • what problems did we have back then (that we decided to remove it)
  • how this proposal is different and we won't hit the same ones)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent thought. I'll look up ecache and see what I can find.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's another goody from the meeting where it seems to have been discussed early on:

"Gang Scheduling: API design is almost finalized. We will arrange with the rest of API approvers and hopefully we can have an alpha version of the API in 1.12."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section on the eCache. PTAL.

node off the list. We then go down the nominated node path. Just as we would with a pod
with a NominatedNodeName in it, we only re-evaluate the feasibility of this node before using it.

Since we assume 1-pod-per-node, we know that the node used by the current pod
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How we're going to verify that it's indeed pod-per-node configuration?

I think that just relying on users is not enough here...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the current thought is this (I'll make this more explicit here).

The user has to opt-in with the scheduling profile, we should mention this clearly there.

If they get it wrong it will not necessarily lead to incorrect behavior, just a shift of the scoring somewhat. So I think the cost is comparatively low.

I'm open to more tracking if we think that is appropriate. We can tell if a pod class is 1-pod-per-node, so we can flag that case if we'd like, and we should also be able to tell if multiple non-daemon set pods land on the same nodes after the fact.

Thoughts on how much you think is necessary?

@wojtek-t
Copy link
Member

wojtek-t commented Oct 3, 2025

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

@wojtek-t wojtek-t self-assigned this Oct 3, 2025
@bwsalmon
Copy link
Author

bwsalmon commented Oct 3, 2025

@bwsalmon - please sign the CLA

(and I'm happy to take the production readiness approval for it)

Cool. I tried to but it forwarded me to some Google reps to sign it... I sent the request but haven't heard back. Maybe I did it wrong?

Comment on lines +340 to +343
If plugin changes prove to be an issue, we could codify the signature as a new "Scheduling" object that only has a subset
of the fields of the pod. Plugins that "opt-in" could only be given access to this reduced scheduling object, and we could then use the entire scheduling object as the signature. This would make it more or less impossible for the signature and plugins to be out of sync, and would
naturally surface new dependencies as additions to the scheduling object. However, as we expect plugin changes to be relatively
modest, we don't believe the complexity of making the interface changes is worth the risk today.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicate paragraph R323-R326

Note that the signature does not need to be stable across versions, or even invocations of the scheduler.
It only needs to be comparable between pods on a given running scheduler instance.

* DynamicResources: For now we mark a pod unsignable if it has dynamic resource claims. We should improve this in the future, since most
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does this plugin will makr pods as unsignable?

* VolumeBinding: Same as NodeVolumeLimits.
* VolumeRestrictions: Same as NodeVolumeLimits.
* VolumeZone: Same as NodeVolumeLimits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to take the defaultPreemption into our consideration?

extending the production code to implement this enhancement.
-->

- `<package>`: `<date>` - `<test coverage>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the test files that this KEP will add/update, along with the current coverage for the existing ones.

- a search in the Kubernetes bug triage tool (https://storage.googleapis.com/k8s-triage/index.html)
-->

- [test name](https://github.com/kubernetes/kubernetes/blob/2334b8469e1983c525c0c6382125710093a25883/test/integration/...): [integration master](https://testgrid.k8s.io/sig-release-master-blocking#integration-master?include-filter-by-regex=MyCoolFeature), [triage search](https://storage.googleapis.com/k8s-triage/index.html?test=MyCoolFeature)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I expect we will add new integration tests as well.

[existing list]: https://kubernetes.io/docs/reference/command-line-tools-reference/feature-gates/
-->

- [ ] Feature gate (also fill in values in `kep.yaml`)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add the feature gate and update it in the kep.yaml as well.

@dom4ha
Copy link
Member

dom4ha commented Oct 6, 2025

/cc @sanposhiho @macsko

@dom4ha
Copy link
Member

dom4ha commented Oct 6, 2025

/label lead-opted-in
/milestone v1.35

@k8s-ci-robot k8s-ci-robot added this to the v1.35 milestone Oct 6, 2025
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label Oct 6, 2025
# The target maturity stage in the current dev cycle for this KEP.
# If the purpose of this KEP is to deprecate a user-visible feature
# and a Deprecated feature gates are added, they should be deprecated|disabled|removed.
stage: alpha|beta|stable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stage: alpha|beta|stable
stage: alpha

To simplify the first version, we only attempt this optimization for pods that have a single pod
of this type on a node, and we exclude pods that use complex constraints like
PodTopologySpread and PodAffinity / AntiAffinity. We also have tight
eviction rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
eviction rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if
invalidation rules on the cache (O(seconds)) to ensure it doesn't get stale. To tell if


Another change is the shift towards 1-pod-per-node in batch and ML environments. Many
of these environments (among others) only attempt to run a single customer pod on each node, along
with a complement of daemon set pods. This simplifies our scheduling needs significantly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
with a complement of daemon set pods. This simplifies our scheduling needs significantly.
with a complement of daemon set pods. This simplifies our scheduling needs significantly, as it allows to reuse not only filtering, but also scoring results.

cannot be used by subsequent pods (of any signature).
Thus we remove the host from all signatures in the cache. The cache is built
in a way that makes it easy to remove entries by either pod signature or host
so we can efficiently invalide entries. If we are not in a 1-pod-per-node
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
so we can efficiently invalide entries. If we are not in a 1-pod-per-node
so we can efficiently invalidate entries. If we are not in a 1-pod-per-node

in a way that makes it easy to remove entries by either pod signature or host
so we can efficiently invalide entries. If we are not in a 1-pod-per-node
we could get "sub-optimal" results if the node just used is the best node for
some other pod, but this should be the only issue.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
some other pod, but this should be the only issue.
any of the following pods, but this should be the only issue.


The pod scheduling signature is used to determine if two pods are "the same"
from a scheduling perspective. In specific, what this means is that any pod
with the given signature will get the same scores / feasibility results from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sanposhiho @macsko This is an important element of the proposal that we should review carefully. The assumption here is that assigning a pod does not change the feasibility nor scoring results for "the same" following pods.

By caching the results of the first pod, we are in fact capturing the cluster state and ignore any cluster changes that happens in the meantime (there is no guaranteed ordering of the following vs other pods). It should be equivalent to scheduling those pods in a single scheduling cycle.

Note that each pod cached assignment will be reassessed in the scheduling cycle, so the only potential consequence is that new feasible nodes that appeared in the meantime may be missed or scores may be different. The aggressive cache invalidation timeout is supposed to mitigate it.

more complex constraints. While we expect the cache to always be on, many pods will not be
able to take advantage of it.
* We are not adding gang scheduling of any kind in this KEP. This is purely a performance
improvement, although we hope the work on this KEP will help us with gang scheduling as we build it.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
improvement, although we hope the work on this KEP will help us with gang scheduling as we build it.
improvement without adding dependency on the Workload API
[KEP-4671](https://github.com/kubernetes/enhancements/pull/5558), although we hope the work on this KEP will help us with gang scheduling as we build it.

* Pod affinity rules (affinity or anti-affinity)
* Topology spread rules (including inherited rules from the system default) This constraint we should attempt to lift in the future.

To construct a signature, we add a new function for each plugin to implement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To construct a signature, we add a new function for each plugin to implement.
To allow non in-tree plugins to construct a signature, we add a new framework function to implement.

resource claims will allow for a signature.
* ImageLocality: We use the canonicalized image names from the Volumes as the signature.
* InterPodAffinity: If either the PodAffinity or PodAntiAffinity fields are set, the pod is marked unsignable, otherwise an empty signature.
* NodeAffinity: We use the NodeAffinity and NodeSelector fields as the signature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeAffinity might be set also in scheduler profile (https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#node-affinity-per-scheduling-profile), so either the profile should be captured in the signature or NodeAffinity from the scheduler configuration.

* NodeResourcesFit: We use the output of the computePodResourceRequest function as the signature.
* NodeUnschedulable: We use the Tolerations field as the signature.
* NodeVolumeLimits: We use all Volume information except from Volumes of type ConfigMap or Secret.
* PodTopologySpread: If the PodTopologySpead field is set, or it is not set but a default set of rules are applied, we mark the pod unsignable, otherwise it returns an empty signature.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or it is not set but a default set of rules are applied

How shall we tell whether the default rules applies to a pod?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lead-opted-in Denotes that an issue has been opted in to a release needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
Status: Needs Review
Development

Successfully merging this pull request may close these issues.

6 participants