|
| 1 | +--- |
| 2 | + |
| 3 | +title: In place propagation of changes affecting Kubernetes objects only |
| 4 | +authors: |
| 5 | +- "@fabriziopandini" |
| 6 | +- @sbueringer |
| 7 | +reviewers: |
| 8 | +- @oscar |
| 9 | +- @vincepri |
| 10 | +creation-date: 2022-02-10 |
| 11 | +last-updated: 2022-02-26 |
| 12 | +status: implementable |
| 13 | +replaces: |
| 14 | +superseded-by: |
| 15 | + |
| 16 | +--- |
| 17 | + |
| 18 | +# In place propagation of changes affecting Kubernetes objects only |
| 19 | + |
| 20 | +## Glossary |
| 21 | + |
| 22 | +Refer to the [Cluster API Book Glossary](https://cluster-api.sigs.k8s.io/reference/glossary.html). |
| 23 | + |
| 24 | +**In-place mutable fields**: fields which changes would only impact Kubernetes objects or/and controller behaviour |
| 25 | +but they won't mutate in any way provider infrastructure nor the software running on it. In-place mutable fields |
| 26 | +are propagated in place by CAPI controllers to avoid the more elaborated mechanics of a replace rollout. |
| 27 | +They include metadata, MinReadySeconds, NodeDrainTimeout, NodeVolumeDetachTimeout and NodeDeletionTimeout but are |
| 28 | +not limited to be expanded in the future. |
| 29 | + |
| 30 | +## Summary |
| 31 | + |
| 32 | +This document discusses how labels, annotation and other fields impacting only Kubernetes objects or controller behaviour (e.g NodeDrainTimeout) |
| 33 | +propagate from ClusterClass to KubeadmControlPlane/MachineDeployments and ultimately to Machines. |
| 34 | + |
| 35 | +## Motivation |
| 36 | + |
| 37 | +Managing labels on Kubernetes nodes has been a long standing [issue](https://github.com/kubernetes-sigs/cluster-api/issues/493) in Cluster API. |
| 38 | + |
| 39 | +The following challenges have been identified through various iterations: |
| 40 | + |
| 41 | +- Define how labels propagate from Machine to Node. |
| 42 | +- Define how labels and annotations propagate from ClusterClass to KubeadmControlPlane/MachineDeployments and ultimately to Machines. |
| 43 | +- Define how to prevent that label and annotation propagation triggers unnecessary rollouts. |
| 44 | + |
| 45 | +The first point is being addressed by [Label Sync Between Machine and underlying Kubernetes Nodes](./20220927-label-sync-between-machine-and-nodes.md), |
| 46 | +while this document tackles the remaining two points. |
| 47 | + |
| 48 | +During a preliminary exploration we identified that the two above challenges apply also to other fields impacting only Kubernetes objects or |
| 49 | +controller behaviour (see e.g. [Support to propagate properties in-place from MachineDeployments to Machines](https://github.com/kubernetes-sigs/cluster-api/issues/5880)). |
| 50 | + |
| 51 | +As a consequence we have decided to expand this work to consider how to propagate labels, annotations and fields impacting only Kubernetes objects or |
| 52 | +controller behaviour, as well as this related issue: [Labels and annotations for MachineDeployments and KubeadmControlPlane created by topology controller](https://github.com/kubernetes-sigs/cluster-api/issues/7006). |
| 53 | + |
| 54 | +### Goals |
| 55 | + |
| 56 | +- Define how labels and annotations propagate from ClusterClass to KubeadmControlPlane/MachineDeployments and ultimately to Machines. |
| 57 | +- Define how fields impacting only Kubernetes objects or controller behaviour propagate from ClusterClass to KubeadmControlPlane |
| 58 | + MachineDeployments, and ultimately to Machines. |
| 59 | +- Define how to prevent that propagation of labels, annotations and other fields impacting only Kubernetes objects or controller behaviour |
| 60 | + triggers unnecessary rollouts. |
| 61 | + |
| 62 | +### Non-Goals |
| 63 | + |
| 64 | +- Discuss the immutability core design principle in Cluster API (on the contrary, this proposal makes immutability even better by improving |
| 65 | + the criteria on when we trigger Machine rollouts). |
| 66 | +- To support in-place mutation for components or settings that exist on Machines (this proposal focuses only on labels, annotations and other |
| 67 | + fields impacting only Kubernetes objects or controller behaviour). |
| 68 | + |
| 69 | +### Future-Goals |
| 70 | + |
| 71 | +- Expand propagation rules including MachinePools after the [MachinePools Machine proposal](./20220209-machinepool-machines.md) is implemented. |
| 72 | + |
| 73 | +## Proposal |
| 74 | + |
| 75 | +### User Stories |
| 76 | + |
| 77 | +#### Story 1 |
| 78 | + |
| 79 | +As a cluster admin/user, I would like a declarative and secure means by which to assign roles to my nodes via Cluster topology metadata |
| 80 | +(for Clusters with ClusterClass). |
| 81 | + |
| 82 | +As a cluster admin/user, I would like a declarative and secure means by which to assign roles to my nodes via KubeadmControlPlane and |
| 83 | +MachineDeployments (for Clusters without ClusterClass). |
| 84 | + |
| 85 | +#### Story 2 |
| 86 | + |
| 87 | +As a cluster admin/user, I would like to change labels or annotations on Machines without triggering Machine rollouts. |
| 88 | + |
| 89 | +#### Story 3 |
| 90 | + |
| 91 | +As a cluster admin/user, I would like to change nodeDrainTimeout on Machines without triggering Machine rollouts. |
| 92 | + |
| 93 | +#### Story 4 |
| 94 | + |
| 95 | +As a cluster admin/user, I would like to set autoscaler labels for MachineDeployments by changing Cluster topology metadata |
| 96 | +(for Clusters with ClusterClass). |
| 97 | + |
| 98 | +### Implementation Details/Notes/Constraints |
| 99 | + |
| 100 | +### Metadata propagation |
| 101 | + |
| 102 | +The following schema represent how metadata propagation works today (also documented in [book](https://cluster-api.sigs.k8s.io/developer/architecture/controllers/metadata-propagation.html)). |
| 103 | + |
| 104 | + |
| 105 | + |
| 106 | +With this proposal we are suggesting to improve metadata propagation as described in the following schema: |
| 107 | + |
| 108 | + |
| 109 | + |
| 110 | +Following paragraphs provide more details about the proposed changes. |
| 111 | + |
| 112 | +#### 1. Label Sync Between Machine and underlying Kubernetes Nodes |
| 113 | + |
| 114 | +As discussed in [Label Sync Between Machine and underlying Kubernetes Nodes](./20220927-label-sync-between-machine-and-nodes.md) we are propagating only |
| 115 | +labels with a well-known prefix or a well-known domain from the Machine to the corresponding Kubernetes Node. |
| 116 | + |
| 117 | +#### 2. Labels/Annotations always reconciled |
| 118 | + |
| 119 | +All the labels/annotations previously set only on creation are now going to be always reconciled; |
| 120 | +in order to prevent unnecessary rollouts, metadata propagation should happen in-place; |
| 121 | +see [in-place propagation](#in-place-propagation) down in this document for more details. |
| 122 | + |
| 123 | +Note: As of today the topology controller already propagates ClusterClass and Cluster topology metadata changes in-place when possible |
| 124 | +in order to avoid unnecessary template rotation with the consequent Machine rollout; we do not foresee changes to this logic. |
| 125 | + |
| 126 | +#### 3. and 4. Set top level labels/annotations for ControlPlane and MachineDeployment created from a ClusterClass |
| 127 | + |
| 128 | +Labels and annotations from ClusterClass and Cluster.topology are going to be propagated to top-level level labels and annotations in |
| 129 | +ControlPlane and MachineDeployment. |
| 130 | + |
| 131 | +This addresses [Labels and annotations for MachineDeployments and KubeadmControlPlane created by topology controller](https://github.com/kubernetes-sigs/cluster-api/issues/7006). |
| 132 | + |
| 133 | +Note: The proposed solution avoids to add additional metadata fields in ClusterClass and Cluster.topology, but |
| 134 | +this has the disadvantage that it is not possible to differentiate top-level labels/annotations from Machines, |
| 135 | +but given the discussion on the above issue this isn't a requirement. |
| 136 | + |
| 137 | +### Propagation of fields impacting only Kubernetes objects or controller behaviour |
| 138 | + |
| 139 | +In addition to labels and annotations, there are also other fields that flow down from ClusterClass to KubeadmControlPlane/MachineDeployments and |
| 140 | +ultimately to Machines. |
| 141 | + |
| 142 | +Some of them can be considered like labels and annotations, because they have impacts only on Kubernetes objects or controller behaviour, but |
| 143 | +not on the actual Machine itself - including infrastructure and the software running on it (in-place mutable fields). |
| 144 | +Examples are `MinReadySeconds`, `NodeDrainTimeout`, `NodeVolumeDetachTimeout`, `NodeDeletionTimeout`. |
| 145 | + |
| 146 | +Propagation of changes to those fields will be implemented using the same [in-place propagation](#in-place-propagation) mechanism implemented |
| 147 | +for metadata. |
| 148 | + |
| 149 | +### In-place propagation |
| 150 | + |
| 151 | +With in-place propagation we are referring to a mechanism that updates existing Kubernetes objects, like MachineSets or Machines, instead of |
| 152 | +creating a new object with the updated fields and then deleting the current Kubernetes object. |
| 153 | + |
| 154 | +The main benefit of this approach is that it prevents unnecessary rollouts of the corresponding infrastructure, with the consequent creation/ |
| 155 | +deletion of a Kubernetes node and drain/scheduling of workloads hosted on the Machine being deleted. |
| 156 | + |
| 157 | +**Important!** In-place propagation of changes as defined above applies only to metadata changes or to fields impacting only Kubernetes objects |
| 158 | +or controller behaviour. This approach can not be used to apply changes to the infrastructure hosting a Machine, to the OS or any software |
| 159 | +installed on it, Kubernetes components included (Kubelet, static pods, CRI etc.). |
| 160 | + |
| 161 | +Implementing in-place propagation has two distinct challenges: |
| 162 | + |
| 163 | +- Current rules defining when MachineDeployments or KubeadmControlPlane trigger a rollout should be modified in order to ignore metadata and |
| 164 | + other fields that are going to be propagated in-place. |
| 165 | + |
| 166 | +- When implementing the reconcile loop that performs in-place propagation, it is required to avoid impact on other components applying |
| 167 | + labels or annotations to the same object. For example, when reconciling labels to a Machine, Cluster API should take care of reconciling |
| 168 | + only the labels it manages, without changing any label applied by the users/by another controller on the same Machine. |
| 169 | + |
| 170 | +#### MachineDeployment rollouts |
| 171 | + |
| 172 | +The MachineDeployment controller determines when a rollout is required using a "semantic equality" comparison between current MachineDeployment |
| 173 | +spec and the corresponding MachineSet spec. |
| 174 | + |
| 175 | +While implementing this proposal we should change the definition of "semantic equality" in order to exclude metadata and fields that |
| 176 | +should be updated in-place. |
| 177 | + |
| 178 | +On top of that we should also account for the use case where, after deploying the new "semantic equality" rule, there is already one or more |
| 179 | +MachineSet(s) matching the MachineDeployment. Today in this case Cluster API deterministically picks the oldest of them. |
| 180 | + |
| 181 | +When exploring the solution for this proposal we discovered that the above approach can cause turbulence in the Cluster because it does not |
| 182 | +take into account to which MachineSets existing Machines belong. As a consequence a Cluster API upgrade could lead to a rollout with Machines moving from |
| 183 | +a "semantically equal" MachineSet to another, which is an unnecessary operation. |
| 184 | + |
| 185 | +In order to prevent this we are modifying the MachineDeployment controller in order to pick the "semantically equal" MachineSet with more |
| 186 | +Machines, thus avoiding or minimizing turbulence in the Cluster. |
| 187 | + |
| 188 | +##### What about the hash label |
| 189 | + |
| 190 | +The MachineDeployment controller relies on a label with a hash value to identify Machines belonging to a MachineSet; also, the hash value |
| 191 | +is used as suffix for the MachineSet name. |
| 192 | + |
| 193 | +Currently the hash is computed using an algorithm that considers the same set of fields used to determine "semantic equality" between current |
| 194 | +MachineDeployment spec and the corresponding MachineSet spec. |
| 195 | + |
| 196 | +When exploring the solution for this proposal, we decided above algorithm can be simplified by using a simple random string |
| 197 | +plus a check that ensures that the random string is not already taken by an existing MachineSet (for this MachineDeployment). |
| 198 | + |
| 199 | +The main benefit of this change is that we are going to decouple "semantic equality" from computing a UID to be used for identifying Machines |
| 200 | +belonging to a MachineSet. Thus making the code easier to understand and simplifying future changes on rollout rules. |
| 201 | + |
| 202 | +#### KCP rollouts |
| 203 | + |
| 204 | +The KCP controller determines when a rollout is required using a "semantic equality" comparison between current KCP |
| 205 | +object and the corresponding Machine object. |
| 206 | + |
| 207 | +The "semantic equality" implementation is pretty complex, but for the sake of this proposal only a few detail are relevant: |
| 208 | + |
| 209 | +- Rollout is triggered if a Machine doesn't have all the labels and the annotations in spec.machineTemplate.Metadata. |
| 210 | +- Rollout is triggered if the KubeadmConfig linked to a Machine doesn't have all the labels and the annotations in spec.machineTemplate.Metadata. |
| 211 | + |
| 212 | +While implementing this proposal, above rule should be dropped, and replaced by in-place update of label & annotations. |
| 213 | +Please also note that the current rule does not detect when a label/annotation is removed from spec.machineTemplate.Metadata |
| 214 | +and thus users are required to remove labels/annotation manually; this is considered a bug and the new implementation |
| 215 | +should account for this use case. |
| 216 | + |
| 217 | +Also, according to the current "semantic equality" rules, changes to nodeDrainTimeout, nodeVolumeDetachTimeout, nodeDeletionTimeout are |
| 218 | +applied only to new machines (they don't trigger rollout). While implementing this proposal, we should make sure that |
| 219 | +those changes are propagated to existing machines, without triggering rollout. |
| 220 | + |
| 221 | +#### Avoiding conflicts with other components |
| 222 | + |
| 223 | +While doing [in-place propagation](#in-place-propagation), and thus continuously reconciling info from a Kubernetes |
| 224 | +object to another we are also reconciling values in a map, like e.g. Labels or Annotations. |
| 225 | + |
| 226 | +This creates some challenges. Assume that: |
| 227 | + |
| 228 | +We want to reconcile following labels form MachineDeployment to Machine: |
| 229 | + |
| 230 | +```yaml |
| 231 | +labels: |
| 232 | + a: a |
| 233 | + b: b |
| 234 | +``` |
| 235 | +
|
| 236 | +After the first reconciliation, the Machine gets above labels. |
| 237 | +Now assume that we remove label `b` from the MachineDeployment; The expected set of labels is |
| 238 | + |
| 239 | +```yaml |
| 240 | +labels: |
| 241 | + a: a |
| 242 | +``` |
| 243 | + |
| 244 | +But the machine still has the label `b`, and the controller cannot remove it, because at this stage there is not |
| 245 | +a clear signal allowing to detect if this label has been applied by Cluster API or by the user or another controllers. |
| 246 | + |
| 247 | +In order to manage properly this use case, that is co-authored maps, the solution available in API server is |
| 248 | +to use [Server Side Apply patches](https://kubernetes.io/docs/reference/using-api/server-side-apply/). |
| 249 | + |
| 250 | +Based on previous experience in introducing SSA in the topology controller this change requires a lot of testing |
| 251 | +and validation. Some cases that should be specifically verified includes: |
| 252 | + |
| 253 | +- introducing SSA patches on an already existing object (and ensure that SSA takes over ownership of managed labels/annotations properly) |
| 254 | +- using SSA patches on objects after move or velero backup/restore (and ensure that SSA takes over ownership of managed labels/annotations properly) |
| 255 | + |
| 256 | +However, despite those use case to be verified during implementation, it is assumed that using API server |
| 257 | +build in capabilities is a stronger, long term solution than any other alternative. |
| 258 | + |
| 259 | +## Alternatives |
| 260 | + |
| 261 | +### To not use SSA for [in-place propagation](#in-place-propagation) and be authoritative on labels and annotations |
| 262 | + |
| 263 | +If Cluster API uses regular patches instead of SSA patches, a well tested path in Cluster API, Cluster API can |
| 264 | +be implemented in order to be authoritative on label and annotations, that means that all the labels and annotations should |
| 265 | +be propagated from higher level objects (e.g. all the Machine's labels should be set on the MachineSet, and going |
| 266 | +on up the propagation chain). |
| 267 | + |
| 268 | +This is not considered acceptable, because users and other controller must be capable to apply their own |
| 269 | +labels to any Kubernetes object, included the ones managed by Cluster API. |
| 270 | + |
| 271 | +### To not use SSA for [in-place propagation](#in-place-propagation) and do not delete labels/annotations |
| 272 | + |
| 273 | +If Cluster API uses regular patches instead of SSA patches, but without being authoritative, Cluster API can |
| 274 | +be implemented in order to add new labels from higher level objects (e.g. a new label added to MachineSet is added to |
| 275 | +the corresponding Machine) and to enforce labels values from higher level objects. |
| 276 | + |
| 277 | +But, as explained in [avoiding conflicts with other components](#avoiding-conflicts-with-other-components), using |
| 278 | +this approach there is no way to determine if label/annotation has been applied by Cluster API or by the user or another controllers, |
| 279 | +and thus automatic label/annotation deletion cannot be implemented. |
| 280 | + |
| 281 | +This approach is not considered ideal, because it is transferring the ownership of labels and annotations deletion |
| 282 | +to users or other controllers, and this is not considered a nice user experience. |
| 283 | + |
| 284 | +### To not use SSA for [in-place propagation](#in-place-propagation) and use status fields to track labels previously applied by CAPI |
| 285 | + |
| 286 | +If Cluster API uses regular patches instead of SSA patches, without being authoritative, it is possible to implement |
| 287 | +a DIY solution for tracking label ownership based on status fields or annotations. |
| 288 | + |
| 289 | +This approach is not considered ideal, because e.g. status field do not survive move/backup and restore, and tacking |
| 290 | +a step back, this is sort of re-implementing SSA or a subset of it. |
| 291 | + |
| 292 | +### Change more propagation rules |
| 293 | + |
| 294 | +While working on the set of changes proposed above a set of optional changes to the existing propagation rules have been |
| 295 | +identified; however, considering that the more complex part of this proposal is implementing [in-place propagation](#in-place-propagation), |
| 296 | +it was decided to implement only the few, most critical changes to propagation rules. |
| 297 | + |
| 298 | +Nevertheless we are documenting optional changes dropped from the scope of this iteration for future reference. |
| 299 | + |
| 300 | + |
| 301 | + |
| 302 | +Optional changes: |
| 303 | + |
| 304 | +- 4b: Simplify MachineDeployment to MachineSet label propagation |
| 305 | + Leveraging on changed introduced 4, it is possible to simplify MachineDeployment to MachineSet label propagation, |
| 306 | + which currently mimics Deployment to ReplicaSet label propagation. The backside of this chance is that it wouldn't be |
| 307 | + possible anymore to have different labels/annotations on MachineDeployment & MachineSet. |
| 308 | + |
| 309 | +- 5a and 5b: Propagate ClusterClass and Cluster.topology to templates |
| 310 | + This changes make ClusterClass and Cluster.topology labels/annotation to be propagated to templates as well. |
| 311 | + Please note that this change requires further discussions, because |
| 312 | + - Contract with providers should be extended to add optional metadata fields where necessary |
| 313 | + - It should be defined how to detect if a template for a specific provider has the optional metadata fields, |
| 314 | + and this is tricky because Cluster API doesn't have detailed knowledge of provider's types. |
| 315 | + - InfrastructureMachineTemplates are immutable in a lot of providers, so we have to discuss how/if we should |
| 316 | + be able to mutate the InfrastructureMachineTemplates.spec.template.metadata. |
| 317 | + |
| 318 | +### Change more propagation rules |
| 319 | + |
| 320 | + |
| 321 | +## Implementation History |
| 322 | + |
| 323 | +- [ ] 10/03/2022: First Draft of this document |
0 commit comments