Skip to content

Commit 58c8107

Browse files
authored
Merge pull request #48797 from ffromani/issue-38121-cpu-manager
node: start moving the resource management docs to concepts
2 parents 01eccc6 + 96a2e17 commit 58c8107

File tree

2 files changed

+289
-197
lines changed

2 files changed

+289
-197
lines changed

content/en/docs/concepts/policy/node-resource-managers.md

Lines changed: 275 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,282 @@ In order to support latency-critical and high-throughput workloads, Kubernetes o
1313

1414
<!-- body -->
1515

16-
The main manager, the Topology Manager, is a Kubelet component that co-ordinates the overall resource management process through its [policy](/docs/tasks/administer-cluster/topology-manager/).
16+
## Hardware topology alignment policies
17+
18+
_Topology Manager_ is a kubelet component that aims to coordinate the set of components that are
19+
responsible for these optimizations. The the overall resource management process is governed using
20+
the policy you specify.
21+
To learn more, read [Control Topology Management Policies on a Node](/docs/tasks/administer-cluster/topology-manager/).
22+
23+
## Policies for assigning CPUs to Pods
24+
25+
{{< feature-state feature_gate_name="CPUManager" >}}
26+
27+
Once a Pod is bound to a Node, the kubelet on that node may need to either multiplex the existing
28+
hardware (for example, sharing CPUs across multiple Pods) or allocate hardware by dedicating some
29+
resource (for example, assigning one of more CPUs for a Pod's exclusive use).
30+
31+
By default, the kubelet uses [CFS quota](https://en.wikipedia.org/wiki/Completely_Fair_Scheduler)
32+
to enforce pod CPU limits.  When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on
33+
whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus
34+
work fine without any intervention.
35+
36+
However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU
37+
management policies to determine some placement preferences on the node.
38+
This is implemented using the _CPU Manager_ and its policy.
39+
There are two available policies:
40+
41+
- `none`: the `none` policy explicitly enables the existing default CPU
42+
affinity scheme, providing no affinity beyond what the OS scheduler does
43+
automatically.  Limits on CPU usage for
44+
[Guaranteed pods](/docs/concepts/workloads/pods/pod-qos/) and
45+
[Burstable pods](/docs/concepts/workloads/pods/pod-qos/)
46+
are enforced using CFS quota.
47+
- `static`: the `static` policy allows containers in `Guaranteed` pods with integer CPU
48+
`requests` access to exclusive CPUs on the node. This exclusivity is enforced
49+
using the [cpuset cgroup controller](https://www.kernel.org/doc/Documentation/cgroup-v2.txt).
50+
51+
{{< note >}}
52+
System services such as the container runtime and the kubelet itself can continue to run on these exclusive CPUs.  The exclusivity only extends to other pods.
53+
{{< /note >}}
54+
55+
CPU Manager doesn't support offlining and onlining of CPUs at runtime.
56+
57+
### Static policy
58+
59+
The static policy enables finer-grained CPU management and exclusive CPU assignment.
60+
This policy manages a shared pool of CPUs that initially contains all CPUs in the
61+
node. The amount of exclusively allocatable CPUs is equal to the total
62+
number of CPUs in the node minus any CPU reservations set by the kubelet configuration.
63+
CPUs reserved by these options are taken, in integer quantity, from the initial shared pool in ascending order by physical
64+
core ID.  This shared pool is the set of CPUs on which any containers in
65+
`BestEffort` and `Burstable` pods run. Containers in `Guaranteed` pods with fractional
66+
CPU `requests` also run on CPUs in the shared pool. Only containers that are
67+
both part of a `Guaranteed` pod and have integer CPU `requests` are assigned
68+
exclusive CPUs.
69+
70+
{{< note >}}
71+
The kubelet requires a CPU reservation greater than zero when the static policy is enabled.
72+
This is because zero CPU reservation would allow the shared pool to become empty.
73+
{{< /note >}}
74+
75+
As `Guaranteed` pods whose containers fit the requirements for being statically
76+
assigned are scheduled to the node, CPUs are removed from the shared pool and
77+
placed in the cpuset for the container. CFS quota is not used to bound
78+
the CPU usage of these containers as their usage is bound by the scheduling domain
79+
itself. In others words, the number of CPUs in the container cpuset is equal to the integer
80+
CPU `limit` specified in the pod spec. This static assignment increases CPU
81+
affinity and decreases context switches due to throttling for the CPU-bound
82+
workload.
83+
84+
Consider the containers in the following pod specs:
85+
86+
```yaml
87+
spec:
88+
containers:
89+
- name: nginx
90+
image: nginx
91+
```
92+
93+
The pod above runs in the `BestEffort` QoS class because no resource `requests` or
94+
`limits` are specified. It runs in the shared pool.
95+
96+
```yaml
97+
spec:
98+
containers:
99+
- name: nginx
100+
image: nginx
101+
resources:
102+
limits:
103+
memory: "200Mi"
104+
requests:
105+
memory: "100Mi"
106+
```
107+
108+
The pod above runs in the `Burstable` QoS class because resource `requests` do not
109+
equal `limits` and the `cpu` quantity is not specified. It runs in the shared
110+
pool.
111+
112+
```yaml
113+
spec:
114+
containers:
115+
- name: nginx
116+
image: nginx
117+
resources:
118+
limits:
119+
memory: "200Mi"
120+
cpu: "2"
121+
requests:
122+
memory: "100Mi"
123+
cpu: "1"
124+
```
125+
126+
The pod above runs in the `Burstable` QoS class because resource `requests` do not
127+
equal `limits`. It runs in the shared pool.
128+
129+
```yaml
130+
spec:
131+
containers:
132+
- name: nginx
133+
image: nginx
134+
resources:
135+
limits:
136+
memory: "200Mi"
137+
cpu: "2"
138+
requests:
139+
memory: "200Mi"
140+
cpu: "2"
141+
```
142+
143+
The pod above runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
144+
And the container's resource limit for the CPU resource is an integer greater than
145+
or equal to one. The `nginx` container is granted 2 exclusive CPUs.
146+
147+
148+
```yaml
149+
spec:
150+
containers:
151+
- name: nginx
152+
image: nginx
153+
resources:
154+
limits:
155+
memory: "200Mi"
156+
cpu: "1.5"
157+
requests:
158+
memory: "200Mi"
159+
cpu: "1.5"
160+
```
161+
162+
The pod above runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
163+
But the container's resource limit for the CPU resource is a fraction. It runs in
164+
the shared pool.
165+
166+
167+
```yaml
168+
spec:
169+
containers:
170+
- name: nginx
171+
image: nginx
172+
resources:
173+
limits:
174+
memory: "200Mi"
175+
cpu: "2"
176+
```
177+
178+
The pod above runs in the `Guaranteed` QoS class because only `limits` are specified
179+
and `requests` are set equal to `limits` when not explicitly specified. And the
180+
container's resource limit for the CPU resource is an integer greater than or
181+
equal to one. The `nginx` container is granted 2 exclusive CPUs.
182+
183+
#### Static policy options {#cpu-policy-static--options}
184+
185+
The behavior of the static policy can be fine-tuned using the CPU Manager policy options.
186+
The following policy options exist for the static CPU management policy:
187+
{{/* options in alphabetical order */}}
188+
189+
`align-by-socket` (alpha, hidden by default)
190+
: Align CPUs by physical package / socket boundary, rather than logical NUMA boundaries (available since Kubernetes v1.25)
191+
`distribute-cpus-across-cores` (alpha, hidden by default)
192+
: allocate virtual cores, sometimes called hardware threads, across different physical cores (available since Kubernetes v1.31)
193+
`distribute-cpus-across-numa` (alpha, hidden by default)
194+
: spread CPUs across different NUMA domains, aiming for an even balance between the selected domains (available since Kubernetes v1.23)
195+
`full-pcpus-only` (beta, visible by default)
196+
: Always allocate full physical cores (available since Kubernetes v1.22)
197+
198+
You can toggle groups of options on and off based upon their maturity level
199+
using the following feature gates:
200+
* `CPUManagerPolicyBetaOptions` (default enabled). Disable to hide beta-level options.
201+
* `CPUManagerPolicyAlphaOptions` (default disabled). Enable to show alpha-level options.
202+
You will still have to enable each option using the `cpuManagerPolicyOptions` field in the
203+
kubelet configuration file.
204+
205+
For more detail about the individual options you can configure, read on.
206+
207+
##### `full-pcpus-only`
208+
209+
If the `full-pcpus-only` policy option is specified, the static policy will always allocate full physical cores.
210+
By default, without this option, the static policy allocates CPUs using a topology-aware best-fit allocation.
211+
On SMT enabled systems, the policy can allocate individual virtual cores, which correspond to hardware threads.
212+
This can lead to different containers sharing the same physical cores; this behaviour in turn contributes
213+
to the [noisy neighbours problem](https://en.wikipedia.org/wiki/Cloud_computing_issues#Performance_interference_and_noisy_neighbors).
214+
With the option enabled, the pod will be admitted by the kubelet only if the CPU request of all its containers
215+
can be fulfilled by allocating full physical cores.
216+
If the pod does not pass the admission, it will be put in Failed state with the message `SMTAlignmentError`.
217+
218+
##### `distribute-cpus-across-numa`
219+
220+
If the `distribute-cpus-across-numa`policy option is specified, the static
221+
policy will evenly distribute CPUs across NUMA nodes in cases where more than
222+
one NUMA node is required to satisfy the allocation.
223+
By default, the `CPUManager` will pack CPUs onto one NUMA node until it is
224+
filled, with any remaining CPUs simply spilling over to the next NUMA node.
225+
This can cause undesired bottlenecks in parallel code relying on barriers (and
226+
similar synchronization primitives), as this type of code tends to run only as
227+
fast as its slowest worker (which is slowed down by the fact that fewer CPUs
228+
are available on at least one NUMA node).
229+
By distributing CPUs evenly across NUMA nodes, application developers can more
230+
easily ensure that no single worker suffers from NUMA effects more than any
231+
other, improving the overall performance of these types of applications.
232+
233+
##### `align-by-socket`
234+
235+
If the `align-by-socket` policy option is specified, CPUs will be considered
236+
aligned at the socket boundary when deciding how to allocate CPUs to a
237+
container. By default, the `CPUManager` aligns CPU allocations at the NUMA
238+
boundary, which could result in performance degradation if CPUs need to be
239+
pulled from more than one NUMA node to satisfy the allocation. Although it
240+
tries to ensure that all CPUs are allocated from the _minimum_ number of NUMA
241+
nodes, there is no guarantee that those NUMA nodes will be on the same socket.
242+
By directing the `CPUManager` to explicitly align CPUs at the socket boundary
243+
rather than the NUMA boundary, we are able to avoid such issues. Note, this
244+
policy option is not compatible with `TopologyManager` `single-numa-node`
245+
policy and does not apply to hardware where the number of sockets is greater
246+
than number of NUMA nodes.
247+
248+
##### `distribute-cpus-across-cores`
249+
250+
If the `distribute-cpus-across-cores` policy option is specified, the static policy
251+
will attempt to allocate virtual cores (hardware threads) across different physical cores.
252+
By default, the `CPUManager` tends to pack cpus onto as few physical cores as possible,
253+
which can lead to contention among cpus on the same physical core and result
254+
in performance bottlenecks. By enabling the `distribute-cpus-across-cores` policy,
255+
the static policy ensures that cpus are distributed across as many physical cores
256+
as possible, reducing the contention on the same physical core and thereby
257+
improving overall performance. However, it is important to note that this strategy
258+
might be less effective when the system is heavily loaded. Under such conditions,
259+
the benefit of reducing contention diminishes. Conversely, default behavior
260+
can help in reducing inter-core communication overhead, potentially providing
261+
better performance under high load conditions.
262+
263+
##### `strict-cpu-reservation`
264+
265+
The `reservedSystemCPUs` parameter in [KubeletConfiguration](/docs/reference/config-api/kubelet-config.v1beta1/),
266+
or the deprecated kubelet command line option `--reserved-cpus`, defines an explicit CPU set for OS system daemons
267+
and kubernetes system daemons. More details of this parameter can be found on the
268+
[Explicitly Reserved CPU List](/docs/tasks/administer-cluster/reserve-compute-resources/#explicitly-reserved-cpu-list) page.
269+
By default this isolation is implemented only for guaranteed pods with integer CPU requests not for burstable and best-effort pods
270+
(and guaranteed pods with fractional CPU requests). Admission is only comparing the cpu requests against the allocatable cpus.
271+
Since the cpu limit is higher than the request, the default behaviour allows burstable and best-effort pods to use up the capacity
272+
of `reservedSystemCPUs` and cause host OS services to starve in real life deployments.
273+
If the `strict-cpu-reservation` policy option is enabled, the static policy will not allow
274+
any workload to use the CPU cores specified in `reservedSystemCPUs`.
275+
276+
## Memory Management Policies
277+
278+
{{< feature-state feature_gate_name="MemoryManager" >}}
279+
280+
The Kubernetes *Memory Manager* enables the feature of guaranteed memory (and hugepages)
281+
allocation for pods in the `Guaranteed` {{< glossary_tooltip text="QoS class" term_id="qos-class" >}}.
282+
283+
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod.
284+
The Memory Manager feeds the central manager (*Topology Manager*) with these affinity hints.
285+
Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
286+
287+
Moreover, the Memory Manager ensures that the memory which a pod requests
288+
is allocated from a minimum number of NUMA nodes.
289+
290+
## Other resource managers
17291

18292
The configuration of individual managers is elaborated in dedicated documents:
19293

20-
- [CPU Manager Policies](/docs/tasks/administer-cluster/cpu-management-policies/)
21294
- [Device Manager](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-integration-with-the-topology-manager)
22-
- [Memory Manager Policies](/docs/tasks/administer-cluster/memory-manager/)

0 commit comments

Comments
 (0)