1717 - [ Risks and Mitigations] ( #risks-and-mitigations )
1818- [ Design Details] ( #design-details )
1919 - [ Implementation Overview] ( #implementation-overview )
20- - [ cadvisor Changes] ( #cadvisor-changes )
20+ - [ Implementation Approaches] ( #implementation-approaches )
21+ - [ Option A: Direct sysfs Reading in Memory Manager] ( #option-a-direct-sysfs-reading-in-memory-manager )
22+ - [ Option B: Add Fresh-Read Method to cadvisor] ( #option-b-add-fresh-read-method-to-cadvisor )
23+ - [ sysfs Interface] ( #sysfs-interface )
2124 - [ Memory Manager Changes] ( #memory-manager-changes )
2225 - [ Integration with Topology Manager] ( #integration-with-topology-manager )
2326 - [ Interaction with CPU Manager] ( #interaction-with-cpu-manager )
2427 - [ Observability] ( #observability )
28+ - [ Metrics] ( #metrics )
29+ - [ Events] ( #events )
30+ - [ Kubelet Logs] ( #kubelet-logs )
31+ - [ Alerting Recommendations] ( #alerting-recommendations )
2532 - [ Test Plan] ( #test-plan )
2633 - [ Prerequisite testing updates] ( #prerequisite-testing-updates )
2734 - [ Unit tests] ( #unit-tests )
4451- [ Drawbacks] ( #drawbacks )
4552- [ Alternatives] ( #alternatives )
4653 - [ Alternative 1: Track all pod hugepage usage] ( #alternative-1-track-all-pod-hugepage-usage )
47- - [ Alternative 2: Query sysfs directly in Memory Manager] ( #alternative-2-query-sysfs-directly-in-memory-manager )
48- - [ Alternative 3: Scheduler-level hugepage awareness] ( #alternative-3-scheduler-level-hugepage-awareness )
54+ - [ Alternative 2: Scheduler-level hugepage awareness] ( #alternative-2-scheduler-level-hugepage-awareness )
4955<!-- /toc -->
5056
5157## Release Signoff Checklist
@@ -144,14 +150,14 @@ Memory Manager wasn't tracking.
144150## Proposal
145151
146152Enhance the Memory Manager's Static policy to verify actual hugepage availability
147- by querying sysfs during pod admission. This involves :
153+ by querying sysfs during pod admission:
148154
149- 1 . ** cadvisor enhancement** : Add a ` FreePages ` field to ` HugePagesInfo ` struct
150- that reports free hugepages per NUMA node, read from sysfs
155+ ** Memory Manager enhancement** : During ` Allocate() ` in the Static policy,
156+ verify that OS-reported free hugepages (read from sysfs) meets or exceeds the
157+ requested amount before admitting the pod.
151158
152- 2 . ** Memory Manager enhancement** : During ` Allocate() ` in the Static policy,
153- verify that OS-reported free hugepages meet or exceed the requested amount
154- before admitting the pod
159+ See [ Implementation Approaches] ( #implementation-approaches ) for options on how
160+ the sysfs reading is performed.
155161
156162### Current Admission Flow
157163
@@ -257,56 +263,62 @@ Job B can be rescheduled to another node with sufficient hugepages.
257263
258264### Implementation Overview
259265
260- The implementation consists of two parts:
266+ The core enhancement is adding a ` verifyOSHugepagesAvailability() ` function to
267+ the Memory Manager's Static policy, called during ` Allocate() ` . This function
268+ reads fresh hugepage availability and rejects pods when insufficient.
261269
262- 1 . ** cadvisor** : Add ` FreePages uint64 ` field to ` HugePagesInfo ` struct, populated
263- from sysfs. Also expose a method to read current free hugepages on-demand.
270+ ### Implementation Approaches
264271
265- 2 . ** kubelet Memory Manager** : Add ` verifyOSHugepagesAvailability() ` function
266- called during ` Allocate() ` that reads ** fresh** hugepage availability from sysfs.
272+ There are two approaches for reading free hugepages:
267273
268- ** Important** : cadvisor's ` GetMachineInfo() ` is called once at startup and cached.
269- The ` FreePages ` field in cached machine info would be stale. Therefore, verification
270- must read sysfs directly during each ` Allocate() ` call, not rely on cached values.
271- We will add a ` GetCurrentHugepagesInfo() ` method to cadvisor's ` Manager ` interface
272- that performs a fresh sysfs read.
274+ #### Option A: Direct sysfs Reading in Memory Manager
273275
274- ### cadvisor Changes
276+ Read sysfs directly in the Memory Manager without cadvisor changes.
275277
276- ** Struct update** :
277- ``` go
278- type HugePagesInfo struct {
279- // huge page size (in kB)
280- PageSize uint64 ` json:"page_size"`
281- // number of huge pages
282- NumPages uint64 ` json:"num_pages"`
283- // number of free huge pages
284- FreePages uint64 ` json:"free_pages"`
285- }
286- ```
278+ ** Pros:**
279+ - No external dependencies on critical admission path
280+ - Simple implementation (~ 10 lines of sysfs reading)
281+ - Faster to implement and merge (single repo)
282+ - Memory Manager already reads memory topology from sysfs (precedent)
287283
288- ** New method on Manager interface** :
289- ``` go
290- // GetCurrentHugepagesInfo returns fresh hugepage info per NUMA node by reading sysfs.
291- // This is separate from GetMachineInfo() which returns cached startup data.
292- func (m *manager ) GetCurrentHugepagesInfo () (map [int ][]HugePagesInfo , error )
293- ```
284+ ** Cons:**
285+ - Duplicates sysfs reading logic (though trivial)
286+ - Other cadvisor consumers don't benefit
287+
288+ #### Option B: Add Fresh-Read Method to cadvisor
289+
290+ Add ` GetCurrentHugepagesInfo() ` method to cadvisor that reads sysfs on-demand.
291+
292+ ** Note** : cadvisor's existing ` GetMachineInfo() ` is cached at startup, so simply
293+ adding a ` FreePages ` field there would be stale. A new method for fresh reads
294+ would be required.
295+
296+ ** Pros:**
297+ - Single source of truth for hugepage info
298+ - Benefits other cadvisor consumers
299+ - Cleaner abstraction
294300
295- The `FreePages` field is populated by reading from:
301+ ** Cons:**
302+ - Cross-repo dependency (cadvisor PR must merge first)
303+ - Adds API surface to cadvisor
304+ - Longer timeline
305+
306+ The choice between options should be made during KEP review based on
307+ maintainability preferences and timeline considerations.
308+
309+ ### sysfs Interface
310+
311+ Regardless of approach, free hugepages are read from:
296312```
297313/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages
298314```
299315
300316** Note on reserved hugepages** : Linux tracks ` resv_hugepages ` (reserved but not
301- yet faulted). For this implementation, we use `free_hugepages` directly because:
317+ yet faulted). We use ` free_hugepages ` directly because:
302318- Reserved pages are committed to specific processes
303319- A new pod cannot use reserved pages
304320- ` free_hugepages ` accurately reflects what's available for new allocations
305321
306- **Note**: Since sysfs is always available on Linux systems with hugepages configured,
307- we use a simple `uint64` rather than a pointer. A value of 0 means zero free
308- hugepages are available.
309-
310322### Memory Manager Changes
311323
312324During ` Allocate() ` in the Static policy:
@@ -317,7 +329,7 @@ func (p *staticPolicy) verifyOSHugepagesAvailability(
317329 pod *v1.Pod,
318330 container *v1.Container,
319331) error {
320- // 1. Call cadvisor's GetCurrentHugepagesInfo() to get fresh sysfs data
332+ // 1. Read free hugepages directly from sysfs for each NUMA node
321333 // 2. For each hugepage size requested by the container:
322334 // a. Sum free hugepages across candidateNUMANodes only
323335 // b. Compare against the requested amount
@@ -421,7 +433,7 @@ to implement this enhancement.
421433##### Prerequisite testing updates
422434
423435- Existing Memory Manager unit tests cover allocation logic
424- - cadvisor tests cover sysfs reading functionality
436+ - For Option B: cadvisor tests cover sysfs reading functionality
425437
426438##### Unit tests
427439
@@ -435,7 +447,7 @@ to implement this enhancement.
435447
436448##### Integration tests
437449
438- - Test Memory Manager with mocked cadvisor returning various FreePages values
450+ - Test Memory Manager with mocked hugepage availability (sysfs or cadvisor depending on chosen approach)
439451- Test admission flow with hugepage verification enabled/disabled
440452
441453##### e2e tests
@@ -483,12 +495,11 @@ will correctly verify against current OS hugepage availability.
483495
484496### Version Skew Strategy
485497
486- The feature is entirely within the kubelet and depends on cadvisor (vendored).
487- No control plane or cross-component version skew concerns.
498+ The feature is entirely within the kubelet. No control plane or cross-component
499+ version skew concerns.
488500
489- Since cadvisor is vendored into kubelet, the kubelet and cadvisor versions are
490- always synchronized. The ` FreePages ` field will be available when the feature
491- gate is enabled.
501+ - ** Option A** : No version skew concerns (direct sysfs reading)
502+ - ** Option B** : Since cadvisor is vendored into kubelet, versions are synchronized
492503
493504## Production Readiness Review Questionnaire
494505
547558
548559###### How can an operator determine if the feature is in use by workloads?
549560
550- - Feature gate is enabled
551- - Pods request hugepages resources
561+ - Feature gate ` MemoryManagerHugepagesVerification ` is enabled
562+ - Metric ` memory_manager_hugepages_verification_total ` is incrementing (indicates verification checks are being performed)
563+ - Pods with Guaranteed QoS requesting hugepages resources are being scheduled
552564
553565###### How can someone using this feature know that it is working for their instance?
554566
@@ -590,16 +602,16 @@ Additional metrics that could be added in Beta:
590602
591603###### Does this feature depend on any specific services running in the cluster?
592604
593- - cadvisor (bundled with kubelet)
594- - Usage: Provides machine info including hugepage free counts
595- - Impact of outage: Verification skipped, graceful degradation
596- - Impact of degraded performance: Slightly increased admission latency
605+ Depends on the implementation approach chosen (see [ Implementation Approaches ] ( #implementation-approaches ) ):
606+
607+ - ** Option A (Direct sysfs) ** : No external dependencies. Reads directly from Linux sysfs.
608+ - ** Option B (cadvisor) ** : Depends on cadvisor (bundled with kubelet) for fresh hugepage reads.
597609
598610### Scalability
599611
600612###### Will enabling / using this feature result in any new API calls?
601613
602- No new API calls. The feature reads from local sysfs and cadvisor machine info .
614+ No new API calls. The feature reads from local sysfs files .
603615
604616###### Will enabling / using this feature result in introducing new API types?
605617
@@ -653,10 +665,10 @@ No impact. The feature operates entirely within kubelet using local sysfs.
653665## Implementation History
654666
655667- 2024-12-24: Initial KEP draft
656- - 2024-12-27: KEP updated based on reviewer feedback
668+ - 2024-12-27: KEP updated based on reviewer feedback; added implementation options
657669- Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759
658670- Related issue: https://github.com/kubernetes/kubernetes/issues/134395
659- - cadvisor PR: https://github.com/google/cadvisor/pull/3804
671+ - cadvisor PR (for Option B) : https://github.com/google/cadvisor/pull/3804 (draft)
660672
661673## Drawbacks
662674
@@ -675,16 +687,7 @@ Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods.
675687- Would not catch external (non-Kubernetes) hugepage consumers
676688- Changes the scope and purpose of Memory Manager
677689
678- ### Alternative 2: Query sysfs directly in Memory Manager
679-
680- Read sysfs directly in Memory Manager without cadvisor changes.
681-
682- ** Rejected because** :
683- - Duplicates sysfs reading logic already in cadvisor
684- - cadvisor already provides machine info abstraction
685- - Adding to cadvisor benefits other consumers of machine info
686-
687- ### Alternative 3: Scheduler-level hugepage awareness
690+ ### Alternative 2: Scheduler-level hugepage awareness
688691
689692Add hugepage availability awareness to the Kubernetes scheduler.
690693
0 commit comments