|
| 1 | +--- |
| 2 | +layout: blog |
| 3 | +title: "The Cloud Controller Manager Chicken and Egg Problem" |
| 4 | +date: 2025-02-14 |
| 5 | +slug: cloud-controller-manager-chicken-egg-problem |
| 6 | +author: > |
| 7 | + Antonio Ojea, |
| 8 | + Michael McCune |
| 9 | +--- |
| 10 | + |
| 11 | +Kubernetes 1.31 |
| 12 | +[completed the largest migration in Kubernetes history][migration-blog], removing the in-tree |
| 13 | +cloud provider. While the component migration is now done, this leaves some additional |
| 14 | +complexity for users and installer projects (for example, kOps or Cluster API) . We will go |
| 15 | +over those additional steps and failure points and make recommendations for cluster owners. |
| 16 | +This migration was complex and some logic had to be extracted from the core components, |
| 17 | +building four new subsystems. |
| 18 | + |
| 19 | +1. **Cloud controller manager** ([KEP-2392][kep2392]) |
| 20 | +2. **API server network proxy** ([KEP-1281][kep1281]) |
| 21 | +3. **kubelet credential provider plugins** ([KEP-2133][kep2133]) |
| 22 | +4. **Storage migration to use [CSI][csi]** ([KEP-625][kep625]) |
| 23 | + |
| 24 | +The [cloud controller manager is part of the control plane][ccm]. It is a critical component |
| 25 | +that replaces some functionality that existed previously in the kube-controller-manager and the |
| 26 | +kubelet. |
| 27 | + |
| 28 | + |
| 29 | + |
| 30 | +One of the most critical functionalities of the cloud controller manager is the node controller, |
| 31 | +which is responsible for the initialization of the nodes. |
| 32 | + |
| 33 | +As you can see in the following diagram, when the **kubelet** starts, it registers the `Node` |
| 34 | +object with the apiserver, Tainting the node so it can be processed first by the |
| 35 | +cloud-controller-manager. The initial `Node` is missing the cloud-provider specific information, |
| 36 | +like the Node Addresses and the Labels with the cloud provider specific information like the |
| 37 | +Node, Region and Instance type information. |
| 38 | + |
| 39 | +```mermaid |
| 40 | +sequenceDiagram |
| 41 | + autonumber |
| 42 | + rect rgb(191, 223, 255) |
| 43 | + Kubelet->>+Kube-apiserver: Create Node |
| 44 | + Note over Kubelet: Taint:<br/> node.cloudprovider.kubernetes.io |
| 45 | + Kube-apiserver->>-Kubelet: Node Created |
| 46 | + end |
| 47 | + Note over Kube-apiserver: Node is Not Ready<br/> Tainted, Missing Node Addresses*, ... |
| 48 | + Note over Kube-apiserver: Send Updates |
| 49 | + rect rgb(200, 150, 255) |
| 50 | + Kube-apiserver->>+Cloud-controller-manager: Watch: New Node Created |
| 51 | + Note over Cloud-controller-manager: Initialize Node:<br/>Cloud Provider Labels, Node Addresses, ... |
| 52 | + Cloud-controller-manager->>-Kube-apiserver: Update Node |
| 53 | + end |
| 54 | + Note over Kube-apiserver: Node is Ready |
| 55 | +``` |
| 56 | + |
| 57 | +This new initialization process adds some latency to the node readiness. Previously, the kubelet |
| 58 | +was able to initialize the node at the same time it created the node. Since the logic has moved |
| 59 | +to the cloud-controller-manager, this can cause a [chicken and egg problem][chicken-and-egg] |
| 60 | +during the cluster bootstrapping for those Kubernetes architectures that do not deploy the |
| 61 | +controller manager as the other components of the control plane, commonly as static pods, |
| 62 | +standalone binaries or daemonsets/deployments with tolerations to the taints and using |
| 63 | +`hostNetwork` (more on this below) |
| 64 | + |
| 65 | +## Examples of the dependency problem |
| 66 | + |
| 67 | +As noted above, it is possible during bootstrapping for the cloud-controller-manager to be |
| 68 | +unschedulable and as such the cluster will not initialize properly. The following are a few |
| 69 | +concrete examples of how this problem can be expressed and the root causes for why they might |
| 70 | +occur. |
| 71 | + |
| 72 | +These examples assume you are running your cloud-controller-manager using a Kubernetes resource |
| 73 | +(e.g. Deployment, DaemonSet, or similar) to control its lifecycle. Because these methods |
| 74 | +rely on Kubernetes to schedule the cloud-controller-manager, care must be taken to ensure it |
| 75 | +will schedule properly. |
| 76 | + |
| 77 | +### Example: Cloud controller manager not scheduling due to uninitialized taint |
| 78 | + |
| 79 | +As [noted in the Kubernetes documentation][kubedocs0], when the kubelet is started with the command line |
| 80 | +flag `--cloud-provider=external`, its corresponding `Node` object will have a no schedule taint |
| 81 | +named `node.cloudprovider.kubernetes.io/uninitialized` added. Because the cloud-controller-manager |
| 82 | +is responsible for removing the no schedule taint, this can create a situation where a |
| 83 | +cloud-controller-manager that is being managed by a Kubernetes resource, such as a `Deployment` |
| 84 | +or `DaemonSet`, may not be able to schedule. |
| 85 | + |
| 86 | +If the cloud-controller-manager is not able to be scheduled during the initialization of the |
| 87 | +control plane, then the resulting `Node` objects will all have the |
| 88 | +`node.cloudprovider.kubernetes.io/uninitialized` no schedule taint. It also means that this taint |
| 89 | +will not be removed as the cloud-controller-manager is responsible for its removal. If the no |
| 90 | +schedule taint is not removed, then critical workloads, such as the container network interface |
| 91 | +controllers, will not be able to schedule, and the cluster will be left in an unhealthy state. |
| 92 | + |
| 93 | +### Example: Cloud controller manager not scheduling due to not-ready taint |
| 94 | + |
| 95 | +The next example would be possible in situations where the container network interface (CNI) is |
| 96 | +waiting for IP address information from the cloud-controller-manager (CCM), and the CCM has not |
| 97 | +tolerated the taint which would be removed by the CNI. |
| 98 | + |
| 99 | +The [Kubernetes documentation describes][kubedocs1] the `node.kubernetes.io/not-ready` taint as follows: |
| 100 | + |
| 101 | +> "The Node controller detects whether a Node is ready by monitoring its health and adds or removes this taint accordingly." |
| 102 | +
|
| 103 | +One of the conditions that can lead to a `Node` resource having this taint is when the container |
| 104 | +network has not yet been initialized on that node. As the cloud-controller-manager is responsible |
| 105 | +for adding the IP addresses to a `Node` resource, and the IP addresses are needed by the container |
| 106 | +network controllers to properly configure the container network, it is possible in some |
| 107 | +circumstances for a node to become stuck as not ready and uninitialized permanently. |
| 108 | + |
| 109 | +This situation occurs for a similar reason as the first example, although in this case, the |
| 110 | +`node.kubernetes.io/not-ready` taint is used with the no execute effect and thus will cause the |
| 111 | +cloud-controller-manager not to run on the node with the taint. If the cloud-controller-manager is |
| 112 | +not able to execute, then it will not initialize the node. It will cascade into the container |
| 113 | +network controllers not being able to run properly, and the node will end up carrying both the |
| 114 | +`node.cloudprovider.kubernetes.io/uninitialized` and `node.kubernetes.io/not-ready` taints, |
| 115 | +leaving the cluster in an unhealthy state. |
| 116 | + |
| 117 | +## Our Recommendations |
| 118 | + |
| 119 | +There is no one “correct way” to run a cloud-controller-manager. The details will depend on the |
| 120 | +specific needs of the cluster administrators and users. When planning your clusters and the |
| 121 | +lifecycle of the cloud-controller-managers please consider the following guidance: |
| 122 | + |
| 123 | +For cloud-controller-managers running in the same cluster, they are managing. |
| 124 | + |
| 125 | +1. Use host network mode, rather than the pod network: in most cases, a cloud controller manager |
| 126 | + will need to communicate with an API service endpoint associated with the infrastructure. |
| 127 | + Setting “hostNetwork” to true will ensure that the cloud controller is using the host |
| 128 | + networking instead of the container network and, as such, will have the same network access as |
| 129 | + the host operating system. It will also remove the dependency on the networking plugin. This |
| 130 | + will ensure that the cloud controller has access to the infrastructure endpoint (always check |
| 131 | + your networking configuration against your infrastructure provider’s instructions). |
| 132 | +2. Use a scalable resource type. `Deployments` and `DaemonSets` are useful for controlling the |
| 133 | + lifecycle of a cloud controller. They allow easy access to running multiple copies for redundancy |
| 134 | + as well as using the Kubernetes scheduling to ensure proper placement in the cluster. When using |
| 135 | + these primitives to control the lifecycle of your cloud controllers and running multiple |
| 136 | + replicas, you must remember to enable leader election, or else your controllers will collide |
| 137 | + with each other which could lead to nodes not being initialized in the cluster. |
| 138 | +3. Target the controller manager containers to the control plane. There might exist other |
| 139 | + controllers which need to run outside the control plane (for example, Azure’s node manager |
| 140 | + controller). Still, the controller managers themselves should be deployed to the control plane. |
| 141 | + Use a node selector or affinity stanza to direct the scheduling of cloud controllers to the |
| 142 | + control plane to ensure that they are running in a protected space. Cloud controllers are vital |
| 143 | + to adding and removing nodes to a cluster as they form a link between Kubernetes and the |
| 144 | + physical infrastructure. Running them on the control plane will help to ensure that they run |
| 145 | + with a similar priority as other core cluster controllers and that they have some separation |
| 146 | + from non-privileged user workloads. |
| 147 | + 1. It is worth noting that an anti-affinity stanza to prevent cloud controllers from running |
| 148 | + on the same host is also very useful to ensure that a single node failure will not degrade |
| 149 | + the cloud controller performance. |
| 150 | +4. Ensure that the tolerations allow operation. Use tolerations on the manifest for the cloud |
| 151 | + controller container to ensure that it will schedule to the correct nodes and that it can run |
| 152 | + in situations where a node is initializing. This means that cloud controllers should tolerate |
| 153 | + the `node.cloudprovider.kubernetes.io/uninitialized` taint, and it should also tolerate any |
| 154 | + taints associated with the control plane (for example, `node-role.kubernetes.io/control-plane` |
| 155 | + or `node-role.kubernetes.io/master`). It can also be useful to tolerate the |
| 156 | + `node.kubernetes.io/not-ready` taint to ensure that the cloud controller can run even when the |
| 157 | + node is not yet available for health monitoring. |
| 158 | + |
| 159 | +For cloud-controller-managers that will not be running on the cluster they manage (for example, |
| 160 | +in a hosted control plane on a separate cluster), then the rules are much more constrained by the |
| 161 | +dependencies of the environment of the cluster running the cloud-controller-manager. The advice |
| 162 | +for running on a self-managed cluster may not be appropriate as the types of conflicts and network |
| 163 | +constraints will be different. Please consult the architecture and requirements of your topology |
| 164 | +for these scenarios. |
| 165 | + |
| 166 | +### Example |
| 167 | + |
| 168 | +This is an example of a Kubernetes Deployment highlighting the guidance shown above. It is |
| 169 | +important to note that this is for demonstration purposes only, for production uses please |
| 170 | +consult your cloud provider’s documentation. |
| 171 | + |
| 172 | +``` |
| 173 | +apiVersion: apps/v1 |
| 174 | +kind: Deployment |
| 175 | +metadata: |
| 176 | + labels: |
| 177 | + app.kubernetes.io/name: cloud-controller-manager |
| 178 | + kubernetes.io/description: "Cloud controller manager for my infrastructure" |
| 179 | + name: cloud-controller-manager |
| 180 | + namespace: kube-system |
| 181 | +spec: |
| 182 | + replicas: 2 |
| 183 | + selector: |
| 184 | + matchLabels: |
| 185 | + app.kubernetes.io/name: cloud-controller-manager |
| 186 | + strategy: |
| 187 | + type: Recreate |
| 188 | + template: |
| 189 | + metadata: |
| 190 | + labels: |
| 191 | + app.kubernetes.io/name: cloud-controller-manager |
| 192 | + spec: |
| 193 | + containers: # the container details will depend on your specific cloud controller manager |
| 194 | + - name: cloud-controller-manager |
| 195 | + command: |
| 196 | + - /bin/my-infrastructure-cloud-controller-manager |
| 197 | + - --leader-elect=true |
| 198 | + - -v=1 |
| 199 | + image: registry/my-infrastructure-cloud-controller-manager@latest |
| 200 | + resources: |
| 201 | + requests: |
| 202 | + cpu: 200m |
| 203 | + memory: 50Mi |
| 204 | + hostNetwork: true # these Pods are part of the control plane |
| 205 | + nodeSelector: |
| 206 | + node-role.kubernetes.io/control-plane: "" |
| 207 | + affinity: |
| 208 | + podAntiAffinity: |
| 209 | + requiredDuringSchedulingIgnoredDuringExecution: |
| 210 | + - topologyKey: "kubernetes.io/hostname" |
| 211 | + labelSelector: |
| 212 | + matchLabels: |
| 213 | + app.kubernetes.io/name: cloud-controller-manager |
| 214 | + tolerations: |
| 215 | + - effect: NoSchedule |
| 216 | + key: node-role.kubernetes.io/master |
| 217 | + operator: Exists |
| 218 | + - effect: NoExecute |
| 219 | + key: node.kubernetes.io/unreachable |
| 220 | + operator: Exists |
| 221 | + tolerationSeconds: 120 |
| 222 | + - effect: NoExecute |
| 223 | + key: node.kubernetes.io/not-ready |
| 224 | + operator: Exists |
| 225 | + tolerationSeconds: 120 |
| 226 | + - effect: NoSchedule |
| 227 | + key: node.cloudprovider.kubernetes.io/uninitialized |
| 228 | + operator: Exists |
| 229 | + - effect: NoSchedule |
| 230 | + key: node.kubernetes.io/not-ready |
| 231 | + operator: Exists |
| 232 | +``` |
| 233 | + |
| 234 | +When deciding how to deploy your cloud controller manager it is worth noting that |
| 235 | +cluster-proportional, or resource-based, pod autoscaling is not recommended. Running multiple |
| 236 | +replicas of a cloud controller manager is good practice for ensuring high-availability and |
| 237 | +redundancy, but does not contribute to better performance. In general, only a single instance |
| 238 | +of a cloud controller manager will be reconciling a cluster at any given time. |
| 239 | + |
| 240 | +[migration-blog]: /blog/2024/05/20/completing-cloud-provider-migration/ |
| 241 | +[kep2392]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-cloud-provider/2392-cloud-controller-manager/README.md |
| 242 | +[kep1281]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1281-network-proxy |
| 243 | +[kep2133]: https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/2133-kubelet-credential-providers |
| 244 | +[csi]: https://github.com/container-storage-interface/spec?tab=readme-ov-file#container-storage-interface-csi-specification- |
| 245 | +[kep625]: https://github.com/kubernetes/enhancements/blob/master/keps/sig-storage/625-csi-migration/README.md |
| 246 | +[ccm]: /docs/concepts/architecture/cloud-controller/ |
| 247 | +[chicken-and-egg]: /docs/tasks/administer-cluster/running-cloud-controller/#chicken-and-egg |
| 248 | +[kubedocs0]: /docs/tasks/administer-cluster/running-cloud-controller/#running-cloud-controller-manager |
| 249 | +[kubedocs1]: /docs/reference/labels-annotations-taints/#node-kubernetes-io-not-ready |
0 commit comments