-
Notifications
You must be signed in to change notification settings - Fork 1.6k
KEP-5759: Memory Manager Hugepages Availability Verification #5753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
KEP-5759: Memory Manager Hugepages Availability Verification #5753
Conversation
This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission. Problem: The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted. Solution: - Add FreePages field to cadvisor's HugePagesInfo (PR google/cadvisor#3804) - Verify OS-reported free hugepages during Allocate() in Static policy - Reject pods when insufficient free hugepages are available Related: kubernetes/kubernetes#134395
|
@srikalyan: The label(s) DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: srikalyan The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @srikalyan! |
|
Hi @srikalyan. Thanks for your PR. I'm waiting for a github.com member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/remove-area kubelet |
|
@srikalyan: Those labels are not set on the issue: DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/cc |
|
/ok-to-test |
ffromani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your contribution! I'm in favor of improving the accounting and making the memory manager/kubelet more predictable. I think we can benefit from some clarifications before to deep dive into further details.
keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Outdated
Show resolved
Hide resolved
| 1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or | ||
| `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this makes me think we need a better accounting/validation mechanism in general, not just for memory manager. Because the very issue we are attacking here is also relevant to burstable pods, and to some extent to best effort pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. This KEP focuses on the Memory Manager Static policy as a targeted fix, but the underlying issue of consistent hugepage accounting across QoS classes is worth discussing as a broader improvement. Perhaps a follow-up KEP for unified hugepage tracking?
| This creates a problem when: | ||
| 1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or | ||
| `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager | ||
| 2. External processes or other system components consume hugepages |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One of the key assumptions of how the kubelet operates in general, is that is the sole owner of a node. There is some leeway in some cases: we can pre-partition CPUs and make the kubelet assume it is the sole owner of the resource pool it got when started, and we can probably should do the same for hugepages. But in general, dynamic co-sharing of resources (kubelet races with other daemons or programs) is not supported and it's unlikely it ever will
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. To clarify: the issue here isn't external daemons racing with kubelet. It's that both pods are managed by kubelet and properly request hugepages. The gap is internal - scheduler tracks at node level, Memory Manager tracks at per-NUMA level but only for Guaranteed pods. The Burstable pod's hugepages are tracked by scheduler but not by Memory Manager's Static policy.
| 1. Burstable or BestEffort pods consume hugepages (via hugetlbfs mounts or | ||
| `mmap` with `MAP_HUGETLB`) without being tracked by the Memory Manager | ||
| 2. External processes or other system components consume hugepages | ||
| 3. The Memory Manager's internal state becomes stale or inconsistent with reality |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how can this happen? do we have examples or scenarios?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! See kubernetes/kubernetes#134395 for the real-world scenario:
- m6id.32xlarge with 2 NUMA nodes, 16GB of 2MB hugepages per node
- Burstable pod requests ~12GB of 2MB hugepages → scheduled and runs
- Guaranteed pod requests ~12GB of 2MB hugepages → admitted to NUMA node 1
- Memory Manager thought node1 had ~15.2GB free
- OS actually reported node1 had only ~3.2GB free
The gap: Memory Manager only tracks Guaranteed pods for NUMA placement.
| ### Goals | ||
|
|
||
| - Verify OS-reported free hugepages during pod admission for the Static policy | ||
| - Reject pods requesting hugepages when insufficient free hugepages are available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to be mindful this will cause another opportunity for rejection loops like kubernetes/kubernetes#84869
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll review that issue. The key difference here is that the rejection would be based on actual OS state (sysfs free_hugepages), not internal tracking discrepancy. This should make the rejection more accurate and actionable - the message would indicate "insufficient free hugepages on NUMA node X" rather than a vague resource conflict.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true, but it's still very likely that controllers will just create runaway pods, because this is another kubelet-local rejection the scheduler doesn't predict or expect, caused by information inbalance between the scheduler and the kubelet (cc @wojtek-t - we talked this in the context of the kubelet-driven pod reschedules)
keps/sig-node/5759-memory-manager-hugepages-verification/README.md
Outdated
Show resolved
Hide resolved
keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml
Outdated
Show resolved
Hide resolved
keps/sig-node/NNNN-memory-manager-hugepages-verification/kep.yaml
Outdated
Show resolved
Hide resolved
5f71eb8 to
fed79ac
Compare
Key changes: - Update milestones to v1.36/v1.37/v1.38 - Clarify sysfs reading: add GetCurrentHugepagesInfo() for fresh reads (GetMachineInfo() is cached at startup, would be stale) - Add Integration with Topology Manager section with policy behavior table - Add Interaction with CPU Manager section - Address reserved hugepages (free_hugepages is correct metric) - Expand race condition discussion with failure handling details - Rewrite Story 2 as "Rapid Pod Churn" with clear timeline - Add "Static policy only" note (None policy not applicable) - Specify error message format with example - Add kubelet restart behavior note - Update Risks table with new mitigations - Fix unit test description (removed nil reference) - Update TOC with new sections - Link enhancement issue kubernetes#5759 Related: kubernetes#5759
fed79ac to
9a89040
Compare
|
/retitle KEP-5759: Memory Manager Hugepages Availability Verification |
- Add two implementation approaches: Option A (direct sysfs) and Option B (cadvisor) - Present pros/cons for each option neutrally for KEP review - Remove cadvisor-specific sections, replace with options discussion - Add Observability section with metrics, events, logs, alerting - Update TOC to pass CI verification - Update KEP number to 5759 throughout The choice between implementation approaches is left to KEP reviewers based on maintainability preferences and timeline considerations.
c40cb0b to
8e6ae09
Compare
|
Thanks @srikalyan for leading this effort. I'm in general supportive of this memory manager enhancement and, pending further review and elaborating, I do see the benefit of the proposed approach about checking free hugepages. Because there's some time left before the 1.36 cycle begins, I'd like to explore other options to solve this problem before we commit to the proposed direction. I'll have another review iteration ASAP. |
|
@ffromani Happy new year to you. Can I request you for another review? |
ffromani
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.
| @@ -0,0 +1,697 @@ | |||
| # KEP-5759: Memory Manager Hugepages Availability Verification | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we agree at sig-node level about this work, we need to add the prod-readiness tracking file.
I think this work deserves to be talked about on a sig-node meeting for coordination.
| ### Goals | ||
|
|
||
| - Verify OS-reported free hugepages during pod admission for the Static policy | ||
| - Reject pods requesting hugepages when insufficient free hugepages are available |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is true, but it's still very likely that controllers will just create runaway pods, because this is another kubelet-local rejection the scheduler doesn't predict or expect, caused by information inbalance between the scheduler and the kubelet (cc @wojtek-t - we talked this in the context of the kubelet-driven pod reschedules)
| From [issue #134395](https://github.com/kubernetes/kubernetes/issues/134395), | ||
| on an m6id.32xlarge instance with 2 NUMA nodes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This alone justifies the fix. The memory manager and the kubelet admission process failed to honor the pod contract. The pod was admitted, letting the workload to believe that resources are avaialble and allocatable, and they were not.
So this part is fine. What I'm circling around is the implications. I'm still thinking if we need to fix the admission phase in general, and if we should mitigate (or fix) the scheduler problem (https://github.com/kubernetes/enhancements/pull/5753/files#r2649077040)
| - Track hugepage usage by Burstable or BestEffort pods in the Memory Manager | ||
| - Modify scheduler behavior or add hugepage awareness to the scheduler | ||
| - Provide hugepage reservation or preemption mechanisms | ||
| - Support platforms other than Linux |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks. It's possible sig-windows and sig-node are fine with a linux-first (or linux-only) solution, but it's good to ask the question nevertheless
| **Desired behavior**: The Guaranteed pod admission fails immediately with a clear | ||
| error indicating insufficient free hugepages, allowing the scheduler to try | ||
| another node or the administrator to take corrective action. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the alternative I'm thinking about is to extend the memory manager and admission logic to listen to each and every pod admission to track where (= which NUMA node) the hugepages are allocated from. However, If the kubelet doesn't enforce a cpuset.mems restriction, however, there's no way to know from where the hugepage is gonna be taken till the container processes go running, therefore past admission stage. Therefore, the proposed approach to check the actual free resources before each and every allocation attempt seems to be the best compromise (bar the only possible approach) in the current architecture.
We should probably document this in the "discarded alternatives" section.
| - **Race condition window**: A window exists between verification and actual | ||
| container startup where hugepages could be consumed by another process. This is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still unconvinced this is a real problem because the fundamental kubelet assumption that the kubelet owns the node. Therefore no other relevant process should take resources behind kubelet's back. We can enhance the kubelet pre-partition/reservation logic to consider only a portion of hugepages, mimicing the --reserved-cpus logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re: --reserved-hugepages: This could help for external consumers, but doesn't solve the Burstable/Guaranteed tracking gap since both are kubelet-owned. The reservation approach would require users to manually reserve hugepages equal to their Burstable workloads - which defeats the purpose of dynamic scheduling.
| **Why this is still valuable**: Without verification, the failure window spans | ||
| from pod scheduling to container startup (seconds to minutes). With verification, | ||
| the window is reduced to milliseconds between sysfs read and container start. | ||
| The vast majority of failures are prevented. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as commented above, is not just a startup failure or increased time to go running, a serious issue is that the kubelet/workload contract is breached. The (too implicit) contract is that once a pod is admitted, the requested resources are available. The very issue linked here demonstrates this is not the case until we fix this behavior
|
|
||
| # The milestone at which this feature was, or is targeted to be, at each stage. | ||
| milestone: | ||
| alpha: "v1.36" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pending SIG discussion and approval, there's a good chance this work can start as beta per recent KEP graduation guidelines. The change is quite self contained and targeted, so it qualifies.
How do you recommend, I approach this? |
- Add ffromani, derekwaynecarr, mrunalp as reviewers - Add dchen1107 as approver (sig-node OWNERS)
2be55a9 to
36099e3
Compare
Hi @srikalyan , sig node meeting weekly on Tuesdays at 10:00 PT (Pacific Time) so you can attend this week meeting to discuss more with SIG Node tech leads and chairs. Zoom link and detail can be viewed in here: https://github.com/kubernetes/community/tree/master/sig-node. This KEP has /lead-opted-in and /milestone v1.36 label from SIG Node for it already, so I think we will target for first deadline is Production Readiness Freeze - 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC. |
|
Thank you Wendy,Will join this Tuesday.Sent from my iPhoneOn Jan 25, 2026, at 11:47 PM, Wendy Ha ***@***.***> wrote:wendy-ha18 left a comment (kubernetes/enhancements#5753)
thanks for the updates. The next step is to bring this up on the larger sig-node and in the 1.36 SIG planning. I think this work would be well accepted by the SIG, but let's make sure.
How do you recommend, I approach this?
Hi @srikalyan , sig node meeting weekly on Tuesdays at 10:00 PT (Pacific Time) so you can attend this week meeting to discuss more with SIG Node tech leads and chairs. Zoom link and detail can be viewed in here: https://github.com/kubernetes/community/tree/master/sig-node.
This KEP has /lead-opted-in and /milestone v1.36 label from SIG Node for it already, so I think we will target for first deadline is Production Readiness Freeze - 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC.
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>
|
|
Hello Wendy,
I am sorry for the late notice, I will be able to join around 10:45 am, is
that ok?
You have a great day :).
Thanks,
-SK.
On Mon, Jan 26, 2026 at 6:00 AM srikalyan swayampakula <
***@***.***> wrote:
… Thank you Wendy,
Will join this Tuesday.
Sent from my iPhone
On Jan 25, 2026, at 11:47 PM, Wendy Ha ***@***.***> wrote:
*wendy-ha18* left a comment (kubernetes/enhancements#5753)
<#5753 (comment)>
thanks for the updates. The next step is to bring this up on the larger
sig-node and in the 1.36 SIG planning. I think this work would be well
accepted by the SIG, but let's make sure.
How do you recommend, I approach this?
Hi @srikalyan <https://github.com/srikalyan> , sig node meeting weekly on Tuesdays
at 10:00 PT (Pacific Time)
<https://zoom.us/j/91437143016?pwd=pUErbBMFOiIAN3ZE9gpPR6skpHYeuo.1> so
you can attend this week meeting to discuss more with SIG Node tech leads
and chairs. Zoom link and detail can be viewed in here:
https://github.com/kubernetes/community/tree/master/sig-node.
This KEP has /lead-opted-in and /milestone v1.36 label from SIG Node for
it already, so I think we will target for first deadline is Production
Readiness Freeze
<https://github.com/kubernetes/sig-release/blob/master/releases/release_phases.md#prr-freeze>
- 4th February 2026 (AoE) / Thursday 5th February 2026, 12:00 UTC.
—
Reply to this email directly, view it on GitHub
<#5753 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AADPQQE7LUZDW543T2YFMED4IXBABAVCNFSM6AAAAACP6MPEVOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOOJYGI3TCMBSGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Summary
This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission.
Problem
The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages (via hugetlbfs mounts or
mmapwithMAP_HUGETLB) without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted.Solution
FreePagesfield toHugePagesInfowith newGetCurrentHugepagesInfo()method for fresh sysfs reads (PR: Add FreePages to HugePagesInfo for hugepage availability reporting google/cadvisor#3804)Allocate()in Static policyRelated
KEP Metadata
MemoryManagerHugepagesVerification/sig node
/kind kep