Skip to content

Conversation

@klihub
Copy link
Collaborator

@klihub klihub commented Dec 3, 2025

Note: this PR is stacked on #601, which should be reviewed first.

This PR updates the topology-aware policy accounting and allocation algorithm to allow slicing an idle shared CPU pool empty for exclusive CPU allocations.

@klihub klihub force-pushed the devel/allow-slicing-idle-pools-empty branch from 40ee962 to 93489be Compare December 3, 2025 10:35
@klihub klihub force-pushed the devel/allow-slicing-idle-pools-empty branch from 93489be to 5f734aa Compare December 4, 2025 07:33
@klihub klihub marked this pull request as ready for review December 5, 2025 08:51
Copy link
Collaborator

@askervin askervin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran one more test in my environment, with the purpose of validating a case where burstable containers are assigned to socket-level shared pool instead of leaf (NUMA)nodes, and then guaranteed containers flowing in.

CONTCOUNT=2 CPUREQ=3000m CPULIM=6000m create burstable
CPU=4 MEM=100M CONTCOUNT=2 create guaranteed
report allowed
verify 'disjoint_sets(cpus["pod0c0"],cpus["pod0c1"],cpus["pod1c0"],cpus["pod1c1"])'

and this works as expected. Finally, pushed this over the limit by changing the guaranteed pod to: CPU=3 MEM=100M CONTCOUNT=3 create guaranteed, which I first thought would succeed, but then again, the last container in this pod would empty a socket-level shared pool that runs a burstable container, so in the end, it was expected to fail.

A reason I wanted to mention these tests is that if you find the passing test would truly add value to capture regression in the future, let's have it, too. But if not, then I'm happy to merge #601 and #602 as is.

In other words, LGTM.

Thanks for these patches @klihub! I think this is better than just a lipstick. :)

@askervin
Copy link
Collaborator

askervin commented Dec 5, 2025

On top of these PRs, the e2e test in #599 fails, yet on top of #598 it passed (most likely due to reserved shared pool CPUs getting empty).

The reason is that kube-proxy, that is the only besteffort reserved pod that is running in our initial setup, gets assigned to the shared root pool instead of the reserved pool (cpuset:1535) like all the other reserved containers. And therefore allocating 5 exclusive CPUs fails --- as it should.

D: [              policy              ] <post-alloc><virtual root>
D: [              policy              ] <post-alloc>  - <root capacity: CPU: reserved:1535 (1000m), sharable:0-2,511,4095 (5000m), MemLimit: 13.65G>
D: [              policy              ] <post-alloc>  - <root allocatable: CPU: reserved:1535 (allocatable: 101m), grantedReserved:899m, sharable:0-2,511,4095 (allocatable:4999m)/sliceable:0-2,511,4095 (5000m), MemLimit: 13.52G>
D: [              policy              ] <post-alloc>  - normal memory: 0,2,7
D: [              policy              ] <post-alloc>  - PMEM memory: 1,3,4,5,6reserved
D: [              policy              ] <post-alloc>    + <grant for kube-system/kube-apiserver-s8c4k-fedora-42-containerd/kube-apiserver from root: cputype: reserved, reserved: 1535 (250m), shared: 0-2,511,4095 (0m), memory: nodes{0-7} (0.00)>
D: [              policy              ] <post-alloc>    + <grant for kube-system/kube-scheduler-s8c4k-fedora-42-containerd/kube-scheduler from root: cputype: reserved, reserved: 1535 (100m), shared: 0-2,511,4095 (0m), memory: nodes{0-7} (0.00)>
D: [              policy              ] <post-alloc>    + <grant for kube-system/kube-controller-manager-s8c4k-fedora-42-containerd/kube-controller-manager from root: cputype: reserved, reserved: 1535 (199m), shared: 0-2,511,4095 (0m), memory: nodes{0-7} (0.00)>
D: [              policy              ] <post-alloc>    + <grant for kube-system/etcd-s8c4k-fedora-42-containerd/etcd from root: cputype: reserved, reserved: 1535 (100m), shared: 0-2,511,4095 (0m), memory: nodes{0-7} (0.00)>
D: [              policy              ] <post-alloc>    + <grant for kube-system/kube-proxy-7gjs6/kube-proxy from root: **cputype: reserved, shared: 0-2,511,4095** (0m), memory: nodes{0-7} (0.00)>
D: [              policy              ] <post-alloc>    + <grant for kube-system/coredns-66bc5c9577-4v4mt/coredns from root: cputype: reserved, reserved: 1535 (100m), shared: 0-2,511,4095 (0m), memory: nodes{0-7} (69.90M)>
D: [              policy              ] <post-alloc>    + <grant for kube-system/coredns-66bc5c9577-dq845/coredns from root: cputype: reserved, reserved: 1535 (100m), shared: 0-2,511,4095 (0m), memory: nodes{0-7} (69.90M)>
D: [              policy              ] <post-alloc>    + <grant for kube-system/nri-resource-policy-topology-aware-7g4hb/nri-resource-policy-topology-aware from root: cputype: reserved, reserved: 1535 (50m), shared: 0-2,511,4095 (0m), memory: nodes{0-7} (0.00)>
D: [              policy              ] <post-alloc>  - children:
D: [              policy              ] <post-alloc>    <socket #0>
D: [              policy              ] <post-alloc>      - <socket #0 capacity: CPU: sharable:0-2,511 (4000m), MemLimit: 10.23G>
D: [              policy              ] <post-alloc>      - <socket #0 allocatable: CPU: sharable:0-2,511 (allocatable:4000m)/sliceable:0-2,511 (4000m), MemLimit: 10.23G>
D: [              policy              ] <post-alloc>      - normal memory: 0
D: [              policy              ] <post-alloc>      - PMEM memory: 1,3,4,5,6
D: [              policy              ] <post-alloc>      - parent: <root>
D: [              policy              ] <post-alloc>    <socket #2>
D: [              policy              ] <post-alloc>      - <socket #2 capacity: CPU: reserved:1535 (1000m), MemLimit: 10.73G>
D: [              policy              ] <post-alloc>      - <socket #2 allocatable: CPU: reserved:1535 (allocatable: 101m), MemLimit: 10.73G>
D: [              policy              ] <post-alloc>      - normal memory: 2
D: [              policy              ] <post-alloc>      - PMEM memory: 1,3,4,5,6
D: [              policy              ] <post-alloc>      - parent: <root>
D: [              policy              ] <post-alloc>    <socket #7>
D: [              policy              ] <post-alloc>      - <socket #7 capacity: CPU: sharable:4095 (1000m), MemLimit: 10.56G>
D: [              policy              ] <post-alloc>      - <socket #7 allocatable: CPU: sharable:4095 (allocatable:1000m)/sliceable:4095 (1000m), MemLimit: 10.56G>
D: [              policy              ] <post-alloc>      - normal memory: 7
D: [              policy              ] <post-alloc>      - PMEM memory: 1,3,4,5,6
D: [              policy              ] <post-alloc>      - parent: <root>

Giving a CPU request, for instance 10m, to kube-proxy would solve the problem. Then it runs on the reserved CPU, too, and allocation of all 5 free CPUs works fine.

Probably I'll just need to modify the test... unless we want to run besteffort reserved containers on reserved CPUs, too.

@klihub
Copy link
Collaborator Author

klihub commented Dec 5, 2025

Probably I'll just need to modify the test... unless we want to run besteffort reserved containers on reserved CPUs, too.

@askervin I think we want to do that and I thought that we already do. I'll try to check why it does not end up there...

@askervin
Copy link
Collaborator

askervin commented Dec 12, 2025

@klihub, I merged #601, but it seems I can't cleanly edit/merge this PR so that github would realize stacked commits are already there.

(I don't dare to try what "web editor" conflict resolution would look like... I'm afraid it might create duplicates on already merged stacked commits with very questionable-looking changes in them....)

But I think we could merge this, once it can be done cleanly, and handle the issue of besteffort-reserved container not going to reserved CPUs separately.

@klihub
Copy link
Collaborator Author

klihub commented Dec 12, 2025

@klihub, I merged #601, but it seems I can't cleanly edit/merge this PR so that github would realize stacked commits are already there.

(I don't dare to try what "web editor" conflict resolution would look like... I'm afraid it might create duplicates on already merged stacked commits with very questionable-looking changes in them....)

But I think we could merge this, once it can be done cleanly, and handle the issue of besteffort-reserved container not going to reserved CPUs separately.

@askervin Thank ! Just gimme a sec and I rebase.

Allow slicing idle shared pools empty for exclusive allocations.

Signed-off-by: Krisztian Litkey <[email protected]>
@klihub klihub force-pushed the devel/allow-slicing-idle-pools-empty branch from 5f734aa to f0b7b26 Compare December 12, 2025 06:59
Copy link
Collaborator

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@askervin askervin merged commit f8824ba into containers:main Dec 12, 2025
9 checks passed
@klihub klihub deleted the devel/allow-slicing-idle-pools-empty branch December 16, 2025 11:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants