Skip to content

CP-309998: ignore small amount of pages in other nodes#6884

Merged
edwintorok merged 1 commit intomasterfrom
private/marcusg/cp-309998
Feb 4, 2026
Merged

CP-309998: ignore small amount of pages in other nodes#6884
edwintorok merged 1 commit intomasterfrom
private/marcusg/cp-309998

Conversation

@mg12
Copy link
Member

@mg12 mg12 commented Feb 3, 2026

This is an optimisation for the calculation of the number of nodes until we fix all the small xen internal allocations for the VM outside the main guest's memory. This synchronises the behaviour in the RRD VM.numa_nodes calculation.

rrdp-squeezed currently has a threshold of >4096 to count the number
of nodes. This is to ignore small number of internal data structures
that xen or other kernel devices may sometimes allocate for the VM
outside the node where the VM's main memory is allocated.

This is a temporary fix until we account in CP-311303 about these
small number of pages that sometimes appear out of the VM's main node.
In experiments, it's usually a single-digit number like 1. The maximum
number observed was around ~2200 pages.

Without this fix, the VM.numa_nodes calculation is different of the
one returned by rrdp-squeezed, and VM.numa_nodes is over-sensitive to
these small number of pages in other nodes.

@mg12 mg12 requested review from edwintorok and lindig February 3, 2026 17:02
@psafont
Copy link
Member

psafont commented Feb 4, 2026

Could you explain in the commit why 16 MiB is used? I think it's because it's the minimum allowed, but it would be good to confirm document it

@mg12
Copy link
Member Author

mg12 commented Feb 4, 2026

4096 is used because it's the value used in the RRD behaviour of VM.numa_nodes calculation, so that both code paths (RRD and field VM.numa_nodes return the same result).

Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would agree with @psafont that this needs a comment here what this calculation does.

rrdp-squeezed currently has a threshold of >4096 to count the number
of nodes. This is to ignore small number of internal data structures
that xen or other kernel devices may sometimes allocate for the VM
outside the node where the VM's main memory is allocated.

This is a temporary fix until we account in CP-311303 about these
small number of pages that sometimes appear out of the VM's main node.
In experiments, it's usually a single-digit number like 1. The maximum
number observed was around ~2200 pages.

Without this fix, the VM.numa_nodes calculation is different of the
one returned by rrdp-squeezed, and VM.numa_nodes is over-sensitive to
these small number of pages in other nodes.

Signed-off-by: Marcus Granado <marcus.granado@cloud.com>
@mg12 mg12 force-pushed the private/marcusg/cp-309998 branch from b662272 to 464de44 Compare February 4, 2026 12:18
@mg12
Copy link
Member Author

mg12 commented Feb 4, 2026

I've updated the PR description and the commit message to convey more clearly what this calculation does.

@edwintorok
Copy link
Member

dss_numa_info should call XenctrlExt.DomainNuma.state to calculate the number of NUMA nodes and available memory, sharing the code with xenopsd. That way workarounds like this can be centralized, avoid code duplication and when dropping the workaround we won't reintroduce this difference.

Lets merge this workaround for now and then open another PR to share the code and reduce the tech debt being introduced here.

@edwintorok edwintorok added this pull request to the merge queue Feb 4, 2026
Merged via the queue into master with commit ba43e00 Feb 4, 2026
28 checks passed
@edwintorok
Copy link
Member

edwintorok commented Feb 4, 2026

To avoid a magic constant (4096) we could ignore pages up to memory_overhead+video_memory maybe (we only claim the main memory, any overhead , the p2m pages, etc could be allocated from somewhere else if we ran out).

Try filtering out all 0 sized nodes in the VM's memory usage, then sort memory in decreasing order, and use a fold to go through them and subtract from a running main_memory_used and increment main_memory_nodes. Once main_memory_used hits 0, then from the next numa node (if any) then increment overhead_memory_nodes instead, but stop incrementing main_memory_nodes.

there is a xapi function to compute the overhead, although rrdp probably can't use that, but maybe it can link memory.ml and calculate it itself.

Copy link
Contributor

@lindig lindig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still think there should be a code comment: (* see commit message *)

in
let optimised = nodes = 1 || nodes < host_nodes in
{optimised; nodes; memory}
with Failure _ -> default
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While you are here: is this unexpected and should be logged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants