Skip to content

Commit 2286455

Browse files
Jay PipesTravisivart
andcommitted
add how-to guide for disabling cgroups v2
Adds a HOWTO guide explaining how to apply a Daemonset to Nexus Kubernetes worker nodes that replaces the cgroups v2 kernel boot parameters with the cgroups v1 parameters and reboots the machine. Work item: https://dev.azure.com/msazuredev/AzureForOperatorsIndustry/_workitems/edit/881134 Co-authored-by: Travis Neely <[email protected]> Signed-off-by: Jay Pipes <[email protected]>
1 parent c93a9c6 commit 2286455

File tree

2 files changed

+215
-0
lines changed

2 files changed

+215
-0
lines changed

articles/operator-nexus/TOC.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -89,6 +89,8 @@
8989
href: howto-kubernetes-cluster-aad-rbac.md
9090
- name: Connect to the cluster
9191
href: howto-kubernetes-cluster-connect.md
92+
- name: Disable cgroupsv2 in Nexus Kubernetes 1.27+
93+
href: howto-disable-cgroupsv2.md
9294
- name: Nexus Virtual Machine
9395
expanded: false
9496
items:
Lines changed: 213 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,213 @@
1+
---
2+
title: "Azure Operator Nexus: Disable cgroupsv2 on a Nexus Kubernetes Node"
3+
description: How-to guide for disabling support for cgroupsv2 on a Nexus Kubernetes Node
4+
author: jaypipes
5+
ms.author: jaypipes
6+
ms.service: azure-operator-nexus
7+
ms.topic: how-to
8+
ms.date: 09/18/2023
9+
ms.custom: template-how-to
10+
---
11+
12+
# Disable `cgroupsv2` on Nexus Kubernetes Node
13+
14+
[Control groups][cgroups], or "`cgroups`" allow the Linux operating system to
15+
allocate resources--CPU shares, memory, I/O, etc.--to a hierarchy of operating
16+
system processes. These resources can be isolated from other processes and in
17+
this way enable containerization of workloads.
18+
19+
An enhanced version 2 of control groups ("[cgroupsv2][cgroups2]") was included
20+
in Linux kernel 4.5. The primary difference between the original `cgroups` v1
21+
and the newer `cgroups` v2 is that only a single hierarchy of `cgroups` is
22+
allowed in the `cgroups` v2. In addition to this single-hierarchy difference,
23+
`cgroups` v2 makes some backwards-incompatible changes to the pseudo-filesystem
24+
that `cgroups` v1 used, for example removing the `tasks` pseudofile and the
25+
`clone_children` functionality.
26+
27+
Some applications may rely on older `cgroups` v1 behavior, however, and this
28+
documentation explains how to disable `cgroups` v2 on newer Linux operating
29+
system images used for Operator Nexus Kubernetes worker nodes.
30+
31+
[cgroups]: https://en.wikipedia.org/wiki/Cgroups
32+
[cgroups2]: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html
33+
34+
## Nexus Kubernetes 1.27 and beyond
35+
36+
While Kubernetes 1.25 [added support][k8s-cgroupsv2] for `cgroups` v2 within
37+
the kubelet, in order for `cgroups` v2 to be used it must be enabled in the
38+
Linux kernel.
39+
40+
Operator Nexus Kubernetes worker nodes run special versions of Microsoft Azure
41+
Linux (previously called CBL Mariner OS) that correspond to the Kubernetes
42+
version enabled by that image. The Linux OS image for worker nodes *enables*
43+
`cgroups` v2 by default in Nexus Kubernetes version 1.27.
44+
45+
`cgroups` v2 *isn't enabled* in versions of Nexus Kubernetes *before* 1.27.
46+
Therefore you don't need to perform the steps in this guide to disable
47+
`cgroups` v2.
48+
49+
[k8s-cgroupsv2]: https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/
50+
51+
## Prerequisites
52+
53+
Before proceeding with this how-to guide, it's recommended that you:
54+
55+
* Refer to the Nexus Kubernetes cluster [QuickStart guide][qs] for a
56+
comprehensive overview and steps involved.
57+
* Ensure that you meet the outlined prerequisites to ensure smooth
58+
implementation of the guide.
59+
60+
[qs]: ./quickstarts-kubernetes-cluster-deployment-bicep.md
61+
62+
## Apply cgroupv2-disabling `Daemonset`
63+
64+
> [!WARNING]
65+
> If you perform this step on a Kubernetes cluster that already has workloads
66+
> running on it, any workloads that are running on Kubernetes cluster nodes
67+
> will be terminated because the `Daemonset` reboots the host machine.
68+
> Therefore it is highly recommmended that you apply this `Daemonset` on a new
69+
> Nexus Kubernetes cluster before workloads are scheduled on it.
70+
71+
Copy the following `Daemonset` definition to a file on a computer where you can
72+
execute `kubectl` commands against the Nexus Kubernetes cluster on which you
73+
wish to disable `cgroups` v2.
74+
75+
```yaml
76+
apiVersion: apps/v1
77+
kind: DaemonSet
78+
metadata:
79+
name: revert-cgroups
80+
namespace: kube-system
81+
spec:
82+
selector:
83+
matchLabels:
84+
name: revert-cgroups
85+
template:
86+
metadata:
87+
labels:
88+
name: revert-cgroups
89+
spec:
90+
affinity:
91+
nodeAffinity:
92+
requiredDuringSchedulingIgnoredDuringExecution:
93+
nodeSelectorTerms:
94+
- matchExpressions:
95+
- key: cgroup-version
96+
operator: NotIn
97+
values:
98+
- v1
99+
tolerations:
100+
- operator: Exists
101+
effect: NoSchedule
102+
containers:
103+
- name: revert-cgroups
104+
image: mcr.microsoft.com/cbl-mariner/base/core:1.0
105+
command:
106+
- nsenter
107+
- --target
108+
- "1"
109+
- --mount
110+
- --uts
111+
- --ipc
112+
- --net
113+
- --pid
114+
- --
115+
- bash
116+
- -exc
117+
- |
118+
CGROUP_VERSION=`stat -fc %T /sys/fs/cgroup/`
119+
if [ "$CGROUP_VERSION" == "cgroup2fs" ]; then
120+
echo "Using v2, reverting..."
121+
sed -i 's/systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all/systemd.unified_cgroup_hierarchy=0/' /boot/grub2/grub.cfg
122+
reboot
123+
fi
124+
125+
sleep infinity
126+
securityContext:
127+
privileged: true
128+
hostNetwork: true
129+
hostPID: true
130+
hostIPC: true
131+
terminationGracePeriodSeconds: 0
132+
```
133+
134+
And apply the `Daemonset`:
135+
136+
```bash
137+
kubectl apply -f /path/to/daemonset.yaml
138+
```
139+
140+
The above `Daemonset` applies to all Kubernetes worker nodes in the cluster
141+
except ones where a `cgroup-version=v1` label has been applied. For those
142+
worker nodes with `cgroups` v2 enabled, the `Daemonset` modifies the boot
143+
configuration of the Linux kernel and reboots the machine.
144+
145+
You can monitor the rollout of the `Daemonset` and its effects by executing the
146+
following script:
147+
148+
```bash
149+
#!/bin/bash
150+
151+
set -x
152+
153+
# Set the DaemonSet name and label key-value pair
154+
DAEMONSET_NAME="revert-cgroups"
155+
NAMESPACE="kube-system"
156+
LABEL_KEY="cgroup-version"
157+
LABEL_VALUE="v1"
158+
LOG_PATTERN="sleep infinity"
159+
160+
# Function to check if all pods are completed
161+
check_pods_completed() {
162+
local pods_completed=0
163+
164+
# Get the list of DaemonSet pods
165+
pod_list=$(kubectl get pods -n "${NAMESPACE}" -l name="${DAEMONSET_NAME}" -o jsonpath='{range.items[*]}{.metadata.name}{"\n"}{end}')
166+
167+
# Loop through each pod
168+
for pod in $pod_list; do
169+
170+
# Get the logs from the pod
171+
logs=$(kubectl logs -n "${NAMESPACE}" "${pod}")
172+
173+
# Check if the logs end with the specified pattern
174+
if [[ $logs == *"${LOG_PATTERN}"* ]]; then
175+
((pods_completed++))
176+
fi
177+
178+
done
179+
180+
# Return the number of completed pods
181+
echo $pods_completed
182+
}
183+
184+
# Loop until all pods are completed
185+
while true; do
186+
pods_completed=$(check_pods_completed)
187+
188+
# Get the total number of pods
189+
total_pods=$(kubectl get pods -n "${NAMESPACE}" -l name=${DAEMONSET_NAME} --no-headers | wc -l)
190+
191+
if [ "$pods_completed" -eq "$total_pods" ]; then
192+
echo "All pods are completed."
193+
break
194+
else
195+
echo "Waiting for pods to complete ($pods_completed/$total_pods)..."
196+
sleep 10
197+
fi
198+
done
199+
200+
# Once all pods are completed, add the label to the nodes
201+
node_list=$(kubectl get pods -n "${NAMESPACE}" -l name=${DAEMONSET_NAME} -o jsonpath='{range.items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u)
202+
203+
for node in $node_list; do
204+
kubectl label nodes "${node}" ${LABEL_KEY}=${LABEL_VALUE}
205+
echo "Added label '${LABEL_KEY}:${LABEL_VALUE}' to node '${node}'."
206+
done
207+
208+
echo "Script completed."
209+
```
210+
211+
The above script labels the nodes that have had `cgroups` v2 disabled. This
212+
labeling removes the `Daemonset` from nodes that have already been rebooted
213+
with the `cgroups` v1 kernel settings.

0 commit comments

Comments
 (0)