Skip to content

Commit 96a2e17

Browse files
committed
node: start moving resource management docs to concepts
We have reached a point where the existing CPU management task page is quite hard to follow. Start moving the resource management concepts to the concept page. We begin with the CPU management policies, the worst offender right now. Over time, the plan is to move all the concepts from tasks in the concepts page. Signed-off-by: Francesco Romani <[email protected]>
1 parent 69cce03 commit 96a2e17

File tree

2 files changed

+289
-197
lines changed

2 files changed

+289
-197
lines changed

content/en/docs/concepts/policy/node-resource-managers.md

Lines changed: 275 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,10 +13,282 @@ In order to support latency-critical and high-throughput workloads, Kubernetes o
1313

1414
<!-- body -->
1515

16-
The main manager, the Topology Manager, is a Kubelet component that co-ordinates the overall resource management process through its [policy](/docs/tasks/administer-cluster/topology-manager/).
16+
## Hardware topology alignment policies
17+
18+
_Topology Manager_ is a kubelet component that aims to coordinate the set of components that are
19+
responsible for these optimizations. The the overall resource management process is governed using
20+
the policy you specify.
21+
To learn more, read [Control Topology Management Policies on a Node](/docs/tasks/administer-cluster/topology-manager/).
22+
23+
## Policies for assigning CPUs to Pods
24+
25+
{{< feature-state feature_gate_name="CPUManager" >}}
26+
27+
Once a Pod is bound to a Node, the kubelet on that node may need to either multiplex the existing
28+
hardware (for example, sharing CPUs across multiple Pods) or allocate hardware by dedicating some
29+
resource (for example, assigning one of more CPUs for a Pod's exclusive use).
30+
31+
By default, the kubelet uses [CFS quota](https://en.wikipedia.org/wiki/Completely_Fair_Scheduler)
32+
to enforce pod CPU limits.  When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on
33+
whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus
34+
work fine without any intervention.
35+
36+
However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU
37+
management policies to determine some placement preferences on the node.
38+
This is implemented using the _CPU Manager_ and its policy.
39+
There are two available policies:
40+
41+
- `none`: the `none` policy explicitly enables the existing default CPU
42+
affinity scheme, providing no affinity beyond what the OS scheduler does
43+
automatically.  Limits on CPU usage for
44+
[Guaranteed pods](/docs/concepts/workloads/pods/pod-qos/) and
45+
[Burstable pods](/docs/concepts/workloads/pods/pod-qos/)
46+
are enforced using CFS quota.
47+
- `static`: the `static` policy allows containers in `Guaranteed` pods with integer CPU
48+
`requests` access to exclusive CPUs on the node. This exclusivity is enforced
49+
using the [cpuset cgroup controller](https://www.kernel.org/doc/Documentation/cgroup-v2.txt).
50+
51+
{{< note >}}
52+
System services such as the container runtime and the kubelet itself can continue to run on these exclusive CPUs.  The exclusivity only extends to other pods.
53+
{{< /note >}}
54+
55+
CPU Manager doesn't support offlining and onlining of CPUs at runtime.
56+
57+
### Static policy
58+
59+
The static policy enables finer-grained CPU management and exclusive CPU assignment.
60+
This policy manages a shared pool of CPUs that initially contains all CPUs in the
61+
node. The amount of exclusively allocatable CPUs is equal to the total
62+
number of CPUs in the node minus any CPU reservations set by the kubelet configuration.
63+
CPUs reserved by these options are taken, in integer quantity, from the initial shared pool in ascending order by physical
64+
core ID.  This shared pool is the set of CPUs on which any containers in
65+
`BestEffort` and `Burstable` pods run. Containers in `Guaranteed` pods with fractional
66+
CPU `requests` also run on CPUs in the shared pool. Only containers that are
67+
both part of a `Guaranteed` pod and have integer CPU `requests` are assigned
68+
exclusive CPUs.
69+
70+
{{< note >}}
71+
The kubelet requires a CPU reservation greater than zero when the static policy is enabled.
72+
This is because zero CPU reservation would allow the shared pool to become empty.
73+
{{< /note >}}
74+
75+
As `Guaranteed` pods whose containers fit the requirements for being statically
76+
assigned are scheduled to the node, CPUs are removed from the shared pool and
77+
placed in the cpuset for the container. CFS quota is not used to bound
78+
the CPU usage of these containers as their usage is bound by the scheduling domain
79+
itself. In others words, the number of CPUs in the container cpuset is equal to the integer
80+
CPU `limit` specified in the pod spec. This static assignment increases CPU
81+
affinity and decreases context switches due to throttling for the CPU-bound
82+
workload.
83+
84+
Consider the containers in the following pod specs:
85+
86+
```yaml
87+
spec:
88+
containers:
89+
- name: nginx
90+
image: nginx
91+
```
92+
93+
The pod above runs in the `BestEffort` QoS class because no resource `requests` or
94+
`limits` are specified. It runs in the shared pool.
95+
96+
```yaml
97+
spec:
98+
containers:
99+
- name: nginx
100+
image: nginx
101+
resources:
102+
limits:
103+
memory: "200Mi"
104+
requests:
105+
memory: "100Mi"
106+
```
107+
108+
The pod above runs in the `Burstable` QoS class because resource `requests` do not
109+
equal `limits` and the `cpu` quantity is not specified. It runs in the shared
110+
pool.
111+
112+
```yaml
113+
spec:
114+
containers:
115+
- name: nginx
116+
image: nginx
117+
resources:
118+
limits:
119+
memory: "200Mi"
120+
cpu: "2"
121+
requests:
122+
memory: "100Mi"
123+
cpu: "1"
124+
```
125+
126+
The pod above runs in the `Burstable` QoS class because resource `requests` do not
127+
equal `limits`. It runs in the shared pool.
128+
129+
```yaml
130+
spec:
131+
containers:
132+
- name: nginx
133+
image: nginx
134+
resources:
135+
limits:
136+
memory: "200Mi"
137+
cpu: "2"
138+
requests:
139+
memory: "200Mi"
140+
cpu: "2"
141+
```
142+
143+
The pod above runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
144+
And the container's resource limit for the CPU resource is an integer greater than
145+
or equal to one. The `nginx` container is granted 2 exclusive CPUs.
146+
147+
148+
```yaml
149+
spec:
150+
containers:
151+
- name: nginx
152+
image: nginx
153+
resources:
154+
limits:
155+
memory: "200Mi"
156+
cpu: "1.5"
157+
requests:
158+
memory: "200Mi"
159+
cpu: "1.5"
160+
```
161+
162+
The pod above runs in the `Guaranteed` QoS class because `requests` are equal to `limits`.
163+
But the container's resource limit for the CPU resource is a fraction. It runs in
164+
the shared pool.
165+
166+
167+
```yaml
168+
spec:
169+
containers:
170+
- name: nginx
171+
image: nginx
172+
resources:
173+
limits:
174+
memory: "200Mi"
175+
cpu: "2"
176+
```
177+
178+
The pod above runs in the `Guaranteed` QoS class because only `limits` are specified
179+
and `requests` are set equal to `limits` when not explicitly specified. And the
180+
container's resource limit for the CPU resource is an integer greater than or
181+
equal to one. The `nginx` container is granted 2 exclusive CPUs.
182+
183+
#### Static policy options {#cpu-policy-static--options}
184+
185+
The behavior of the static policy can be fine-tuned using the CPU Manager policy options.
186+
The following policy options exist for the static CPU management policy:
187+
{{/* options in alphabetical order */}}
188+
189+
`align-by-socket` (alpha, hidden by default)
190+
: Align CPUs by physical package / socket boundary, rather than logical NUMA boundaries (available since Kubernetes v1.25)
191+
`distribute-cpus-across-cores` (alpha, hidden by default)
192+
: allocate virtual cores, sometimes called hardware threads, across different physical cores (available since Kubernetes v1.31)
193+
`distribute-cpus-across-numa` (alpha, hidden by default)
194+
: spread CPUs across different NUMA domains, aiming for an even balance between the selected domains (available since Kubernetes v1.23)
195+
`full-pcpus-only` (beta, visible by default)
196+
: Always allocate full physical cores (available since Kubernetes v1.22)
197+
198+
You can toggle groups of options on and off based upon their maturity level
199+
using the following feature gates:
200+
* `CPUManagerPolicyBetaOptions` (default enabled). Disable to hide beta-level options.
201+
* `CPUManagerPolicyAlphaOptions` (default disabled). Enable to show alpha-level options.
202+
You will still have to enable each option using the `cpuManagerPolicyOptions` field in the
203+
kubelet configuration file.
204+
205+
For more detail about the individual options you can configure, read on.
206+
207+
##### `full-pcpus-only`
208+
209+
If the `full-pcpus-only` policy option is specified, the static policy will always allocate full physical cores.
210+
By default, without this option, the static policy allocates CPUs using a topology-aware best-fit allocation.
211+
On SMT enabled systems, the policy can allocate individual virtual cores, which correspond to hardware threads.
212+
This can lead to different containers sharing the same physical cores; this behaviour in turn contributes
213+
to the [noisy neighbours problem](https://en.wikipedia.org/wiki/Cloud_computing_issues#Performance_interference_and_noisy_neighbors).
214+
With the option enabled, the pod will be admitted by the kubelet only if the CPU request of all its containers
215+
can be fulfilled by allocating full physical cores.
216+
If the pod does not pass the admission, it will be put in Failed state with the message `SMTAlignmentError`.
217+
218+
##### `distribute-cpus-across-numa`
219+
220+
If the `distribute-cpus-across-numa`policy option is specified, the static
221+
policy will evenly distribute CPUs across NUMA nodes in cases where more than
222+
one NUMA node is required to satisfy the allocation.
223+
By default, the `CPUManager` will pack CPUs onto one NUMA node until it is
224+
filled, with any remaining CPUs simply spilling over to the next NUMA node.
225+
This can cause undesired bottlenecks in parallel code relying on barriers (and
226+
similar synchronization primitives), as this type of code tends to run only as
227+
fast as its slowest worker (which is slowed down by the fact that fewer CPUs
228+
are available on at least one NUMA node).
229+
By distributing CPUs evenly across NUMA nodes, application developers can more
230+
easily ensure that no single worker suffers from NUMA effects more than any
231+
other, improving the overall performance of these types of applications.
232+
233+
##### `align-by-socket`
234+
235+
If the `align-by-socket` policy option is specified, CPUs will be considered
236+
aligned at the socket boundary when deciding how to allocate CPUs to a
237+
container. By default, the `CPUManager` aligns CPU allocations at the NUMA
238+
boundary, which could result in performance degradation if CPUs need to be
239+
pulled from more than one NUMA node to satisfy the allocation. Although it
240+
tries to ensure that all CPUs are allocated from the _minimum_ number of NUMA
241+
nodes, there is no guarantee that those NUMA nodes will be on the same socket.
242+
By directing the `CPUManager` to explicitly align CPUs at the socket boundary
243+
rather than the NUMA boundary, we are able to avoid such issues. Note, this
244+
policy option is not compatible with `TopologyManager` `single-numa-node`
245+
policy and does not apply to hardware where the number of sockets is greater
246+
than number of NUMA nodes.
247+
248+
##### `distribute-cpus-across-cores`
249+
250+
If the `distribute-cpus-across-cores` policy option is specified, the static policy
251+
will attempt to allocate virtual cores (hardware threads) across different physical cores.
252+
By default, the `CPUManager` tends to pack cpus onto as few physical cores as possible,
253+
which can lead to contention among cpus on the same physical core and result
254+
in performance bottlenecks. By enabling the `distribute-cpus-across-cores` policy,
255+
the static policy ensures that cpus are distributed across as many physical cores
256+
as possible, reducing the contention on the same physical core and thereby
257+
improving overall performance. However, it is important to note that this strategy
258+
might be less effective when the system is heavily loaded. Under such conditions,
259+
the benefit of reducing contention diminishes. Conversely, default behavior
260+
can help in reducing inter-core communication overhead, potentially providing
261+
better performance under high load conditions.
262+
263+
##### `strict-cpu-reservation`
264+
265+
The `reservedSystemCPUs` parameter in [KubeletConfiguration](/docs/reference/config-api/kubelet-config.v1beta1/),
266+
or the deprecated kubelet command line option `--reserved-cpus`, defines an explicit CPU set for OS system daemons
267+
and kubernetes system daemons. More details of this parameter can be found on the
268+
[Explicitly Reserved CPU List](/docs/tasks/administer-cluster/reserve-compute-resources/#explicitly-reserved-cpu-list) page.
269+
By default this isolation is implemented only for guaranteed pods with integer CPU requests not for burstable and best-effort pods
270+
(and guaranteed pods with fractional CPU requests). Admission is only comparing the cpu requests against the allocatable cpus.
271+
Since the cpu limit is higher than the request, the default behaviour allows burstable and best-effort pods to use up the capacity
272+
of `reservedSystemCPUs` and cause host OS services to starve in real life deployments.
273+
If the `strict-cpu-reservation` policy option is enabled, the static policy will not allow
274+
any workload to use the CPU cores specified in `reservedSystemCPUs`.
275+
276+
## Memory Management Policies
277+
278+
{{< feature-state feature_gate_name="MemoryManager" >}}
279+
280+
The Kubernetes *Memory Manager* enables the feature of guaranteed memory (and hugepages)
281+
allocation for pods in the `Guaranteed` {{< glossary_tooltip text="QoS class" term_id="qos-class" >}}.
282+
283+
The Memory Manager employs hint generation protocol to yield the most suitable NUMA affinity for a pod.
284+
The Memory Manager feeds the central manager (*Topology Manager*) with these affinity hints.
285+
Based on both the hints and Topology Manager policy, the pod is rejected or admitted to the node.
286+
287+
Moreover, the Memory Manager ensures that the memory which a pod requests
288+
is allocated from a minimum number of NUMA nodes.
289+
290+
## Other resource managers
17291

18292
The configuration of individual managers is elaborated in dedicated documents:
19293

20-
- [CPU Manager Policies](/docs/tasks/administer-cluster/cpu-management-policies/)
21294
- [Device Manager](/docs/concepts/extend-kubernetes/compute-storage-net/device-plugins/#device-plugin-integration-with-the-topology-manager)
22-
- [Memory Manager Policies](/docs/tasks/administer-cluster/memory-manager/)

0 commit comments

Comments
 (0)