|
| 1 | +--- |
| 2 | +title: "Azure Operator Nexus: Disable cgroupsv2 on a Nexus Kubernetes Node" |
| 3 | +description: How-to guide for disabling support for cgroupsv2 on a Nexus Kubernetes Node |
| 4 | +author: jaypipes |
| 5 | +ms.author: jaypipes |
| 6 | +ms.service: azure-operator-nexus |
| 7 | +ms.topic: how-to |
| 8 | +ms.date: 09/18/2023 |
| 9 | +ms.custom: template-how-to |
| 10 | +--- |
| 11 | + |
| 12 | +# Disable `cgroupsv2` on Nexus Kubernetes Node |
| 13 | + |
| 14 | +[Control groups][cgroups], or "`cgroups`" allow the Linux operating system to |
| 15 | +allocate resources--CPU shares, memory, I/O, etc.--to a hierarchy of operating |
| 16 | +system processes. These resources can be isolated from other processes and in |
| 17 | +this way enable containerization of workloads. |
| 18 | + |
| 19 | +An enhanced version 2 of control groups ("[cgroupsv2][cgroups2]") was included |
| 20 | +in Linux kernel 4.5. The primary difference between the original `cgroups` v1 |
| 21 | +and the newer `cgroups` v2 is that only a single hierarchy of `cgroups` is |
| 22 | +allowed in the `cgroups` v2. In addition to this single-hierarchy difference, |
| 23 | +`cgroups` v2 makes some backwards-incompatible changes to the pseudo-filesystem |
| 24 | +that `cgroups` v1 used, for example removing the `tasks` pseudofile and the |
| 25 | +`clone_children` functionality. |
| 26 | + |
| 27 | +Some applications may rely on older `cgroups` v1 behavior, however, and this |
| 28 | +documentation explains how to disable `cgroups` v2 on newer Linux operating |
| 29 | +system images used for Operator Nexus Kubernetes worker nodes. |
| 30 | + |
| 31 | +[cgroups]: https://en.wikipedia.org/wiki/Cgroups |
| 32 | +[cgroups2]: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html |
| 33 | + |
| 34 | +## Nexus Kubernetes 1.27 and beyond |
| 35 | + |
| 36 | +While Kubernetes 1.25 [added support][k8s-cgroupsv2] for `cgroups` v2 within |
| 37 | +the kubelet, in order for `cgroups` v2 to be used it must be enabled in the |
| 38 | +Linux kernel. |
| 39 | + |
| 40 | +Operator Nexus Kubernetes worker nodes run special versions of Microsoft Azure |
| 41 | +Linux (previously called CBL Mariner OS) that correspond to the Kubernetes |
| 42 | +version enabled by that image. The Linux OS image for worker nodes *enables* |
| 43 | +`cgroups` v2 by default in Nexus Kubernetes version 1.27. |
| 44 | + |
| 45 | +`cgroups` v2 *isn't enabled* in versions of Nexus Kubernetes *before* 1.27. |
| 46 | +Therefore you don't need to perform the steps in this guide to disable |
| 47 | +`cgroups` v2. |
| 48 | + |
| 49 | +[k8s-cgroupsv2]: https://kubernetes.io/blog/2022/08/31/cgroupv2-ga-1-25/ |
| 50 | + |
| 51 | +## Prerequisites |
| 52 | + |
| 53 | +Before proceeding with this how-to guide, it's recommended that you: |
| 54 | + |
| 55 | + * Refer to the Nexus Kubernetes cluster [QuickStart guide][qs] for a |
| 56 | + comprehensive overview and steps involved. |
| 57 | + * Ensure that you meet the outlined prerequisites to ensure smooth |
| 58 | + implementation of the guide. |
| 59 | + |
| 60 | +[qs]: ./quickstarts-kubernetes-cluster-deployment-bicep.md |
| 61 | + |
| 62 | +## Apply cgroupv2-disabling `Daemonset` |
| 63 | + |
| 64 | +> [!WARNING] |
| 65 | +> If you perform this step on a Kubernetes cluster that already has workloads |
| 66 | +> running on it, any workloads that are running on Kubernetes cluster nodes |
| 67 | +> will be terminated because the `Daemonset` reboots the host machine. |
| 68 | +> Therefore it is highly recommmended that you apply this `Daemonset` on a new |
| 69 | +> Nexus Kubernetes cluster before workloads are scheduled on it. |
| 70 | +
|
| 71 | +Copy the following `Daemonset` definition to a file on a computer where you can |
| 72 | +execute `kubectl` commands against the Nexus Kubernetes cluster on which you |
| 73 | +wish to disable `cgroups` v2. |
| 74 | + |
| 75 | +```yaml |
| 76 | +apiVersion: apps/v1 |
| 77 | +kind: DaemonSet |
| 78 | +metadata: |
| 79 | + name: revert-cgroups |
| 80 | + namespace: kube-system |
| 81 | +spec: |
| 82 | + selector: |
| 83 | + matchLabels: |
| 84 | + name: revert-cgroups |
| 85 | + template: |
| 86 | + metadata: |
| 87 | + labels: |
| 88 | + name: revert-cgroups |
| 89 | + spec: |
| 90 | + affinity: |
| 91 | + nodeAffinity: |
| 92 | + requiredDuringSchedulingIgnoredDuringExecution: |
| 93 | + nodeSelectorTerms: |
| 94 | + - matchExpressions: |
| 95 | + - key: cgroup-version |
| 96 | + operator: NotIn |
| 97 | + values: |
| 98 | + - v1 |
| 99 | + tolerations: |
| 100 | + - operator: Exists |
| 101 | + effect: NoSchedule |
| 102 | + containers: |
| 103 | + - name: revert-cgroups |
| 104 | + image: mcr.microsoft.com/cbl-mariner/base/core:1.0 |
| 105 | + command: |
| 106 | + - nsenter |
| 107 | + - --target |
| 108 | + - "1" |
| 109 | + - --mount |
| 110 | + - --uts |
| 111 | + - --ipc |
| 112 | + - --net |
| 113 | + - --pid |
| 114 | + - -- |
| 115 | + - bash |
| 116 | + - -exc |
| 117 | + - | |
| 118 | + CGROUP_VERSION=`stat -fc %T /sys/fs/cgroup/` |
| 119 | + if [ "$CGROUP_VERSION" == "cgroup2fs" ]; then |
| 120 | + echo "Using v2, reverting..." |
| 121 | + sed -i 's/systemd.unified_cgroup_hierarchy=1 cgroup_no_v1=all/systemd.unified_cgroup_hierarchy=0/' /boot/grub2/grub.cfg |
| 122 | + reboot |
| 123 | + fi |
| 124 | +
|
| 125 | + sleep infinity |
| 126 | + securityContext: |
| 127 | + privileged: true |
| 128 | + hostNetwork: true |
| 129 | + hostPID: true |
| 130 | + hostIPC: true |
| 131 | + terminationGracePeriodSeconds: 0 |
| 132 | +``` |
| 133 | +
|
| 134 | +And apply the `Daemonset`: |
| 135 | + |
| 136 | +```bash |
| 137 | +kubectl apply -f /path/to/daemonset.yaml |
| 138 | +``` |
| 139 | + |
| 140 | +The above `Daemonset` applies to all Kubernetes worker nodes in the cluster |
| 141 | +except ones where a `cgroup-version=v1` label has been applied. For those |
| 142 | +worker nodes with `cgroups` v2 enabled, the `Daemonset` modifies the boot |
| 143 | +configuration of the Linux kernel and reboots the machine. |
| 144 | + |
| 145 | +You can monitor the rollout of the `Daemonset` and its effects by executing the |
| 146 | +following script: |
| 147 | + |
| 148 | +```bash |
| 149 | +#!/bin/bash |
| 150 | +
|
| 151 | +set -x |
| 152 | +
|
| 153 | +# Set the DaemonSet name and label key-value pair |
| 154 | +DAEMONSET_NAME="revert-cgroups" |
| 155 | +NAMESPACE="kube-system" |
| 156 | +LABEL_KEY="cgroup-version" |
| 157 | +LABEL_VALUE="v1" |
| 158 | +LOG_PATTERN="sleep infinity" |
| 159 | +
|
| 160 | +# Function to check if all pods are completed |
| 161 | +check_pods_completed() { |
| 162 | + local pods_completed=0 |
| 163 | +
|
| 164 | + # Get the list of DaemonSet pods |
| 165 | + pod_list=$(kubectl get pods -n "${NAMESPACE}" -l name="${DAEMONSET_NAME}" -o jsonpath='{range.items[*]}{.metadata.name}{"\n"}{end}') |
| 166 | +
|
| 167 | + # Loop through each pod |
| 168 | + for pod in $pod_list; do |
| 169 | +
|
| 170 | + # Get the logs from the pod |
| 171 | + logs=$(kubectl logs -n "${NAMESPACE}" "${pod}") |
| 172 | +
|
| 173 | + # Check if the logs end with the specified pattern |
| 174 | + if [[ $logs == *"${LOG_PATTERN}"* ]]; then |
| 175 | + ((pods_completed++)) |
| 176 | + fi |
| 177 | +
|
| 178 | + done |
| 179 | +
|
| 180 | + # Return the number of completed pods |
| 181 | + echo $pods_completed |
| 182 | +} |
| 183 | +
|
| 184 | +# Loop until all pods are completed |
| 185 | +while true; do |
| 186 | + pods_completed=$(check_pods_completed) |
| 187 | +
|
| 188 | + # Get the total number of pods |
| 189 | + total_pods=$(kubectl get pods -n "${NAMESPACE}" -l name=${DAEMONSET_NAME} --no-headers | wc -l) |
| 190 | +
|
| 191 | + if [ "$pods_completed" -eq "$total_pods" ]; then |
| 192 | + echo "All pods are completed." |
| 193 | + break |
| 194 | + else |
| 195 | + echo "Waiting for pods to complete ($pods_completed/$total_pods)..." |
| 196 | + sleep 10 |
| 197 | + fi |
| 198 | +done |
| 199 | +
|
| 200 | +# Once all pods are completed, add the label to the nodes |
| 201 | +node_list=$(kubectl get pods -n "${NAMESPACE}" -l name=${DAEMONSET_NAME} -o jsonpath='{range.items[*]}{.spec.nodeName}{"\n"}{end}' | sort -u) |
| 202 | +
|
| 203 | +for node in $node_list; do |
| 204 | + kubectl label nodes "${node}" ${LABEL_KEY}=${LABEL_VALUE} |
| 205 | + echo "Added label '${LABEL_KEY}:${LABEL_VALUE}' to node '${node}'." |
| 206 | +done |
| 207 | +
|
| 208 | +echo "Script completed." |
| 209 | +``` |
| 210 | + |
| 211 | +The above script labels the nodes that have had `cgroups` v2 disabled. This |
| 212 | +labeling removes the `Daemonset` from nodes that have already been rebooted |
| 213 | +with the `cgroups` v1 kernel settings. |
0 commit comments