Skip to content

Commit 8e6ae09

Browse files
committed
Present implementation options without recommendation
- Add two implementation approaches: Option A (direct sysfs) and Option B (cadvisor) - Present pros/cons for each option neutrally for KEP review - Remove cadvisor-specific sections, replace with options discussion - Add Observability section with metrics, events, logs, alerting - Update TOC to pass CI verification - Update KEP number to 5759 throughout The choice between implementation approaches is left to KEP reviewers based on maintainability preferences and timeline considerations.
1 parent 9a89040 commit 8e6ae09

File tree

1 file changed

+73
-70
lines changed
  • keps/sig-node/5759-memory-manager-hugepages-verification

1 file changed

+73
-70
lines changed

keps/sig-node/5759-memory-manager-hugepages-verification/README.md

Lines changed: 73 additions & 70 deletions
Original file line numberDiff line numberDiff line change
@@ -17,11 +17,18 @@
1717
- [Risks and Mitigations](#risks-and-mitigations)
1818
- [Design Details](#design-details)
1919
- [Implementation Overview](#implementation-overview)
20-
- [cadvisor Changes](#cadvisor-changes)
20+
- [Implementation Approaches](#implementation-approaches)
21+
- [Option A: Direct sysfs Reading in Memory Manager](#option-a-direct-sysfs-reading-in-memory-manager)
22+
- [Option B: Add Fresh-Read Method to cadvisor](#option-b-add-fresh-read-method-to-cadvisor)
23+
- [sysfs Interface](#sysfs-interface)
2124
- [Memory Manager Changes](#memory-manager-changes)
2225
- [Integration with Topology Manager](#integration-with-topology-manager)
2326
- [Interaction with CPU Manager](#interaction-with-cpu-manager)
2427
- [Observability](#observability)
28+
- [Metrics](#metrics)
29+
- [Events](#events)
30+
- [Kubelet Logs](#kubelet-logs)
31+
- [Alerting Recommendations](#alerting-recommendations)
2532
- [Test Plan](#test-plan)
2633
- [Prerequisite testing updates](#prerequisite-testing-updates)
2734
- [Unit tests](#unit-tests)
@@ -44,8 +51,7 @@
4451
- [Drawbacks](#drawbacks)
4552
- [Alternatives](#alternatives)
4653
- [Alternative 1: Track all pod hugepage usage](#alternative-1-track-all-pod-hugepage-usage)
47-
- [Alternative 2: Query sysfs directly in Memory Manager](#alternative-2-query-sysfs-directly-in-memory-manager)
48-
- [Alternative 3: Scheduler-level hugepage awareness](#alternative-3-scheduler-level-hugepage-awareness)
54+
- [Alternative 2: Scheduler-level hugepage awareness](#alternative-2-scheduler-level-hugepage-awareness)
4955
<!-- /toc -->
5056

5157
## Release Signoff Checklist
@@ -144,14 +150,14 @@ Memory Manager wasn't tracking.
144150
## Proposal
145151

146152
Enhance the Memory Manager's Static policy to verify actual hugepage availability
147-
by querying sysfs during pod admission. This involves:
153+
by querying sysfs during pod admission:
148154

149-
1. **cadvisor enhancement**: Add a `FreePages` field to `HugePagesInfo` struct
150-
that reports free hugepages per NUMA node, read from sysfs
155+
**Memory Manager enhancement**: During `Allocate()` in the Static policy,
156+
verify that OS-reported free hugepages (read from sysfs) meets or exceeds the
157+
requested amount before admitting the pod.
151158

152-
2. **Memory Manager enhancement**: During `Allocate()` in the Static policy,
153-
verify that OS-reported free hugepages meet or exceed the requested amount
154-
before admitting the pod
159+
See [Implementation Approaches](#implementation-approaches) for options on how
160+
the sysfs reading is performed.
155161

156162
### Current Admission Flow
157163

@@ -257,56 +263,62 @@ Job B can be rescheduled to another node with sufficient hugepages.
257263

258264
### Implementation Overview
259265

260-
The implementation consists of two parts:
266+
The core enhancement is adding a `verifyOSHugepagesAvailability()` function to
267+
the Memory Manager's Static policy, called during `Allocate()`. This function
268+
reads fresh hugepage availability and rejects pods when insufficient.
261269

262-
1. **cadvisor**: Add `FreePages uint64` field to `HugePagesInfo` struct, populated
263-
from sysfs. Also expose a method to read current free hugepages on-demand.
270+
### Implementation Approaches
264271

265-
2. **kubelet Memory Manager**: Add `verifyOSHugepagesAvailability()` function
266-
called during `Allocate()` that reads **fresh** hugepage availability from sysfs.
272+
There are two approaches for reading free hugepages:
267273

268-
**Important**: cadvisor's `GetMachineInfo()` is called once at startup and cached.
269-
The `FreePages` field in cached machine info would be stale. Therefore, verification
270-
must read sysfs directly during each `Allocate()` call, not rely on cached values.
271-
We will add a `GetCurrentHugepagesInfo()` method to cadvisor's `Manager` interface
272-
that performs a fresh sysfs read.
274+
#### Option A: Direct sysfs Reading in Memory Manager
273275

274-
### cadvisor Changes
276+
Read sysfs directly in the Memory Manager without cadvisor changes.
275277

276-
**Struct update**:
277-
```go
278-
type HugePagesInfo struct {
279-
// huge page size (in kB)
280-
PageSize uint64 `json:"page_size"`
281-
// number of huge pages
282-
NumPages uint64 `json:"num_pages"`
283-
// number of free huge pages
284-
FreePages uint64 `json:"free_pages"`
285-
}
286-
```
278+
**Pros:**
279+
- No external dependencies on critical admission path
280+
- Simple implementation (~10 lines of sysfs reading)
281+
- Faster to implement and merge (single repo)
282+
- Memory Manager already reads memory topology from sysfs (precedent)
287283

288-
**New method on Manager interface**:
289-
```go
290-
// GetCurrentHugepagesInfo returns fresh hugepage info per NUMA node by reading sysfs.
291-
// This is separate from GetMachineInfo() which returns cached startup data.
292-
func (m *manager) GetCurrentHugepagesInfo() (map[int][]HugePagesInfo, error)
293-
```
284+
**Cons:**
285+
- Duplicates sysfs reading logic (though trivial)
286+
- Other cadvisor consumers don't benefit
287+
288+
#### Option B: Add Fresh-Read Method to cadvisor
289+
290+
Add `GetCurrentHugepagesInfo()` method to cadvisor that reads sysfs on-demand.
291+
292+
**Note**: cadvisor's existing `GetMachineInfo()` is cached at startup, so simply
293+
adding a `FreePages` field there would be stale. A new method for fresh reads
294+
would be required.
295+
296+
**Pros:**
297+
- Single source of truth for hugepage info
298+
- Benefits other cadvisor consumers
299+
- Cleaner abstraction
294300

295-
The `FreePages` field is populated by reading from:
301+
**Cons:**
302+
- Cross-repo dependency (cadvisor PR must merge first)
303+
- Adds API surface to cadvisor
304+
- Longer timeline
305+
306+
The choice between options should be made during KEP review based on
307+
maintainability preferences and timeline considerations.
308+
309+
### sysfs Interface
310+
311+
Regardless of approach, free hugepages are read from:
296312
```
297313
/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages
298314
```
299315

300316
**Note on reserved hugepages**: Linux tracks `resv_hugepages` (reserved but not
301-
yet faulted). For this implementation, we use `free_hugepages` directly because:
317+
yet faulted). We use `free_hugepages` directly because:
302318
- Reserved pages are committed to specific processes
303319
- A new pod cannot use reserved pages
304320
- `free_hugepages` accurately reflects what's available for new allocations
305321

306-
**Note**: Since sysfs is always available on Linux systems with hugepages configured,
307-
we use a simple `uint64` rather than a pointer. A value of 0 means zero free
308-
hugepages are available.
309-
310322
### Memory Manager Changes
311323

312324
During `Allocate()` in the Static policy:
@@ -317,7 +329,7 @@ func (p *staticPolicy) verifyOSHugepagesAvailability(
317329
pod *v1.Pod,
318330
container *v1.Container,
319331
) error {
320-
// 1. Call cadvisor's GetCurrentHugepagesInfo() to get fresh sysfs data
332+
// 1. Read free hugepages directly from sysfs for each NUMA node
321333
// 2. For each hugepage size requested by the container:
322334
// a. Sum free hugepages across candidateNUMANodes only
323335
// b. Compare against the requested amount
@@ -421,7 +433,7 @@ to implement this enhancement.
421433
##### Prerequisite testing updates
422434

423435
- Existing Memory Manager unit tests cover allocation logic
424-
- cadvisor tests cover sysfs reading functionality
436+
- For Option B: cadvisor tests cover sysfs reading functionality
425437

426438
##### Unit tests
427439

@@ -435,7 +447,7 @@ to implement this enhancement.
435447

436448
##### Integration tests
437449

438-
- Test Memory Manager with mocked cadvisor returning various FreePages values
450+
- Test Memory Manager with mocked hugepage availability (sysfs or cadvisor depending on chosen approach)
439451
- Test admission flow with hugepage verification enabled/disabled
440452

441453
##### e2e tests
@@ -483,12 +495,11 @@ will correctly verify against current OS hugepage availability.
483495

484496
### Version Skew Strategy
485497

486-
The feature is entirely within the kubelet and depends on cadvisor (vendored).
487-
No control plane or cross-component version skew concerns.
498+
The feature is entirely within the kubelet. No control plane or cross-component
499+
version skew concerns.
488500

489-
Since cadvisor is vendored into kubelet, the kubelet and cadvisor versions are
490-
always synchronized. The `FreePages` field will be available when the feature
491-
gate is enabled.
501+
- **Option A**: No version skew concerns (direct sysfs reading)
502+
- **Option B**: Since cadvisor is vendored into kubelet, versions are synchronized
492503

493504
## Production Readiness Review Questionnaire
494505

@@ -547,8 +558,9 @@ No.
547558

548559
###### How can an operator determine if the feature is in use by workloads?
549560

550-
- Feature gate is enabled
551-
- Pods request hugepages resources
561+
- Feature gate `MemoryManagerHugepagesVerification` is enabled
562+
- Metric `memory_manager_hugepages_verification_total` is incrementing (indicates verification checks are being performed)
563+
- Pods with Guaranteed QoS requesting hugepages resources are being scheduled
552564

553565
###### How can someone using this feature know that it is working for their instance?
554566

@@ -590,16 +602,16 @@ Additional metrics that could be added in Beta:
590602

591603
###### Does this feature depend on any specific services running in the cluster?
592604

593-
- cadvisor (bundled with kubelet)
594-
- Usage: Provides machine info including hugepage free counts
595-
- Impact of outage: Verification skipped, graceful degradation
596-
- Impact of degraded performance: Slightly increased admission latency
605+
Depends on the implementation approach chosen (see [Implementation Approaches](#implementation-approaches)):
606+
607+
- **Option A (Direct sysfs)**: No external dependencies. Reads directly from Linux sysfs.
608+
- **Option B (cadvisor)**: Depends on cadvisor (bundled with kubelet) for fresh hugepage reads.
597609

598610
### Scalability
599611

600612
###### Will enabling / using this feature result in any new API calls?
601613

602-
No new API calls. The feature reads from local sysfs and cadvisor machine info.
614+
No new API calls. The feature reads from local sysfs files.
603615

604616
###### Will enabling / using this feature result in introducing new API types?
605617

@@ -653,10 +665,10 @@ No impact. The feature operates entirely within kubelet using local sysfs.
653665
## Implementation History
654666

655667
- 2024-12-24: Initial KEP draft
656-
- 2024-12-27: KEP updated based on reviewer feedback
668+
- 2024-12-27: KEP updated based on reviewer feedback; added implementation options
657669
- Enhancement issue: https://github.com/kubernetes/enhancements/issues/5759
658670
- Related issue: https://github.com/kubernetes/kubernetes/issues/134395
659-
- cadvisor PR: https://github.com/google/cadvisor/pull/3804
671+
- cadvisor PR (for Option B): https://github.com/google/cadvisor/pull/3804 (draft)
660672

661673
## Drawbacks
662674

@@ -675,16 +687,7 @@ Extend Memory Manager to track hugepage usage by Burstable and BestEffort pods.
675687
- Would not catch external (non-Kubernetes) hugepage consumers
676688
- Changes the scope and purpose of Memory Manager
677689

678-
### Alternative 2: Query sysfs directly in Memory Manager
679-
680-
Read sysfs directly in Memory Manager without cadvisor changes.
681-
682-
**Rejected because**:
683-
- Duplicates sysfs reading logic already in cadvisor
684-
- cadvisor already provides machine info abstraction
685-
- Adding to cadvisor benefits other consumers of machine info
686-
687-
### Alternative 3: Scheduler-level hugepage awareness
690+
### Alternative 2: Scheduler-level hugepage awareness
688691

689692
Add hugepage availability awareness to the Kubernetes scheduler.
690693

0 commit comments

Comments
 (0)