-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add FreePages to HugePagesInfo for hugepage availability reporting #3804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
This change adds a FreePages field to HugePagesInfo, populated from /sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages This enables consumers like the Kubernetes Memory Manager to verify actual hugepage availability during pod admission, rather than only tracking allocations which can miss consumption by untracked workloads. The field uses *uint64 with omitempty to distinguish between: - nil: free_hugepages data unavailable (file missing or unreadable) - 0: zero free hugepages available - N: N free hugepages available Related: kubernetes/kubernetes#134395
This KEP proposes enhancing the Memory Manager's Static policy to verify OS-reported free hugepages availability during pod admission. Problem: The Memory Manager only tracks hugepage allocations for Guaranteed QoS pods. Burstable/BestEffort pods can consume hugepages without being tracked, causing subsequent Guaranteed pods to be admitted but fail at runtime when hugepages are exhausted. Solution: - Add FreePages field to cadvisor's HugePagesInfo (PR google/cadvisor#3804) - Verify OS-reported free hugepages during Allocate() in Static policy - Reject pods when insufficient free hugepages are available Related: kubernetes/kubernetes#134395
|
Based on KEP review feedback, I'm considering changing Rationale: On Linux systems with hugepages configured, the sysfs interface ( Current implementation: Uses Proposed change: Use plain What are your thoughts on this? I'm happy to update the PR either way based on cadvisor's conventions and your preference. cc @iwankgb |
|
@srikalyan, are you planning to expose this metric via Prometheus endpoint? |
I would love to do it waiting on the KEP. If KEP is not a blocker, I'm happy to help doing it soon |
Exposes free hugepage count per NUMA node via Prometheus endpoint: - Adds machine_node_hugepages_free gauge metric with node_id and page_size labels - Only emits metrics when FreePages data is available (nil-safe) - Follows same pattern as existing machine_node_hugepages_count metric This enables monitoring and alerting on hugepage availability across NUMA nodes, complementing the HugePagesInfo.FreePages field added in the previous commit.
|
@iwankgb Done! I've added the The metric follows the same pattern as
Example output: This enables monitoring/alerting on hugepage availability alongside the existing total count metric. |
Summary
FreePages *uint64field toHugePagesInfostruct, populated from/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepagesomitemptyto distinguish between "0 free pages" and "data unavailable"machine_node_hugepages_freePrometheus metric to expose free hugepage count per NUMA nodeMotivation
The Kubernetes Static Memory Manager currently only tracks hugepage allocations for Guaranteed QoS pods. However, Burstable and BestEffort pods can consume hugepages (via hugetlbfs mounts or mmap with MAP_HUGETLB) without being tracked. This causes Guaranteed pods to be admitted based on stale allocation data, only to fail at runtime when hugepages are exhausted.
By exposing
free_hugepagesfrom sysfs, consumers can verify actual OS-reported availability before making admission decisions.Design
The field uses
*uint64withomitempty(following v2 convention) to distinguish:nil: free_hugepages data unavailable (file missing or unreadable)0: zero free hugepages availableN: N free hugepages availableThis allows consumers to detect when the data isn't available and fall back appropriately.
Note: Since
GetMachineInfo()is cached at startup, theFreePagesvalue represents point-in-time data. Consumers requiring real-time availability may need to read sysfs directly or use a dedicated fresh-read method (pending KEP outcome).Prometheus Metric
New metric
machine_node_hugepages_freeexposes free hugepage count:Labels match
machine_node_hugepages_countfor easy correlation. The metric is only emitted whenFreePagesdata is available (nil-safe).Test Plan
GetHugePagesFree()in sysfsTestGetHugePagesInfoto verify FreePages is correctly populatedomitemptybehaviorTestGetHugePagesFree()for Prometheus metric extractionTestPrometheusMachineCollectorexpected outputRelated