Skip to content

Conversation

@srikalyan
Copy link

@srikalyan srikalyan commented Dec 24, 2025

Summary

  • Adds FreePages *uint64 field to HugePagesInfo struct, populated from /sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages
  • Uses pointer type with omitempty to distinguish between "0 free pages" and "data unavailable"
  • Adds machine_node_hugepages_free Prometheus metric to expose free hugepage count per NUMA node
  • Enables consumers like the Kubernetes Memory Manager to verify actual hugepage availability during pod admission

Motivation

The Kubernetes Static Memory Manager currently only tracks hugepage allocations for Guaranteed QoS pods. However, Burstable and BestEffort pods can consume hugepages (via hugetlbfs mounts or mmap with MAP_HUGETLB) without being tracked. This causes Guaranteed pods to be admitted based on stale allocation data, only to fail at runtime when hugepages are exhausted.

By exposing free_hugepages from sysfs, consumers can verify actual OS-reported availability before making admission decisions.

Design

The field uses *uint64 with omitempty (following v2 convention) to distinguish:

  • nil: free_hugepages data unavailable (file missing or unreadable)
  • 0: zero free hugepages available
  • N: N free hugepages available

This allows consumers to detect when the data isn't available and fall back appropriately.

Note: Since GetMachineInfo() is cached at startup, the FreePages value represents point-in-time data. Consumers requiring real-time availability may need to read sysfs directly or use a dedicated fresh-read method (pending KEP outcome).

Prometheus Metric

New metric machine_node_hugepages_free exposes free hugepage count:

# HELP machine_node_hugepages_free Number of free hugepages on NUMA node.
# TYPE machine_node_hugepages_free gauge
machine_node_hugepages_free{node_id="0",page_size="2048",...} 512
machine_node_hugepages_free{node_id="1",page_size="1048576",...} 2

Labels match machine_node_hugepages_count for easy correlation. The metric is only emitted when FreePages data is available (nil-safe).

Test Plan

  • Added unit tests for GetHugePagesFree() in sysfs
  • Updated TestGetHugePagesInfo to verify FreePages is correctly populated
  • Verified JSON serialization with omitempty behavior
  • Added TestGetHugePagesFree() for Prometheus metric extraction
  • Updated TestPrometheusMachineCollector expected output
  • All existing tests pass

Related

This change adds a FreePages field to HugePagesInfo, populated from
/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages

This enables consumers like the Kubernetes Memory Manager to verify
actual hugepage availability during pod admission, rather than only
tracking allocations which can miss consumption by untracked workloads.

The field uses *uint64 with omitempty to distinguish between:
- nil: free_hugepages data unavailable (file missing or unreadable)
- 0: zero free hugepages available
- N: N free hugepages available

Related: kubernetes/kubernetes#134395
srikalyan added a commit to srikalyan/enhancements that referenced this pull request Dec 24, 2025
This KEP proposes enhancing the Memory Manager's Static policy to
verify OS-reported free hugepages availability during pod admission.

Problem:
The Memory Manager only tracks hugepage allocations for Guaranteed QoS
pods. Burstable/BestEffort pods can consume hugepages without being
tracked, causing subsequent Guaranteed pods to be admitted but fail
at runtime when hugepages are exhausted.

Solution:
- Add FreePages field to cadvisor's HugePagesInfo (PR google/cadvisor#3804)
- Verify OS-reported free hugepages during Allocate() in Static policy
- Reject pods when insufficient free hugepages are available

Related: kubernetes/kubernetes#134395
@srikalyan
Copy link
Author

Based on KEP review feedback, I'm considering changing FreePages from *uint64 to uint64.

Rationale: On Linux systems with hugepages configured, the sysfs interface (/sys/devices/system/node/node<N>/hugepages/hugepages-<size>kB/free_hugepages) is always available. We don't need to distinguish between "0 free hugepages" and "data unavailable" since sysfs won't be unavailable.

Current implementation: Uses *uint64 with omitempty to distinguish nil (unavailable) from 0 (zero free).

Proposed change: Use plain uint64. A value of 0 simply means zero free hugepages.

What are your thoughts on this? I'm happy to update the PR either way based on cadvisor's conventions and your preference.

cc @iwankgb

@srikalyan srikalyan marked this pull request as draft December 27, 2025 17:35
@iwankgb
Copy link
Collaborator

iwankgb commented Jan 16, 2026

@srikalyan, are you planning to expose this metric via Prometheus endpoint?

@srikalyan
Copy link
Author

@srikalyan, are you planning to expose this metric via Prometheus endpoint?

I would love to do it waiting on the KEP. If KEP is not a blocker, I'm happy to help doing it soon

Exposes free hugepage count per NUMA node via Prometheus endpoint:
- Adds machine_node_hugepages_free gauge metric with node_id and page_size labels
- Only emits metrics when FreePages data is available (nil-safe)
- Follows same pattern as existing machine_node_hugepages_count metric

This enables monitoring and alerting on hugepage availability across
NUMA nodes, complementing the HugePagesInfo.FreePages field added in
the previous commit.
@srikalyan
Copy link
Author

@iwankgb Done! I've added the machine_node_hugepages_free Prometheus metric in the latest commit.

The metric follows the same pattern as machine_node_hugepages_count:

  • Name: machine_node_hugepages_free
  • Type: Gauge
  • Labels: node_id, page_size (same as machine_node_hugepages_count)
  • Behavior: Only emits when FreePages data is available (nil-safe)

Example output:

# HELP machine_node_hugepages_free Number of free hugepages on NUMA node.
# TYPE machine_node_hugepages_free gauge
machine_node_hugepages_free{node_id="0",page_size="2048",...} 512
machine_node_hugepages_free{node_id="1",page_size="1048576",...} 2

This enables monitoring/alerting on hugepage availability alongside the existing total count metric.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants