|
| 1 | +--- |
| 2 | +title: "Enhancing Meltdown Protection with Dependency-Watchdog Annotations" |
| 3 | +linkTitle: "Enhancing Meltdown Protection with Dependency-Watchdog Annotations" |
| 4 | +newsSubtitle: June 25, 2025 |
| 5 | +publishdate: 2025-06-25 |
| 6 | +authors: |
| 7 | +- avatar: https://avatars.githubusercontent.com/ashwani2k |
| 8 | + login: ashwani2k |
| 9 | + name: Ashwani Kumar |
| 10 | +aliases: ["/blog/2025/06/25/enhancing-meltdown-protection-with-dependency-watchdog-annotations"] |
| 11 | +--- |
| 12 | + |
| 13 | +Gardener's `dependency-watchdog` is a crucial component for ensuring cluster stability. During infrastructure-level outages where worker nodes cannot communicate with the control plane, it activates a "meltdown protection" mechanism. This involves scaling down key control plane components like the `machine-controller-manager` (MCM), `cluster-autoscaler` (CA), and `kube-controller-manager` (KCM) to prevent them from taking incorrect actions based on stale information, such as deleting healthy nodes that are only temporarily unreachable. |
| 14 | + |
| 15 | +### The Challenge: Premature Scale-Up During Reconciliation |
| 16 | + |
| 17 | +Previously, a potential race condition could undermine this protection. While `dependency-watchdog` scaled down the necessary components, a concurrent `Shoot` reconciliation, whether triggered manually by an operator or by other events, could misinterpret the situation. The reconciliation logic, unaware that the scale-down was a deliberate protective measure, would attempt to restore the "desired" state by scaling the `machine-controller-manager` and `cluster-autoscaler` back up. |
| 18 | + |
| 19 | +This premature scale-up could have serious consequences. An active `machine-controller-manager`, for instance, might see nodes in an `Unknown` state due to the ongoing outage and incorrectly decide to delete them, defeating the entire purpose of the meltdown protection. |
| 20 | + |
| 21 | +### The Solution: A New Annotation for Clearer Signaling |
| 22 | + |
| 23 | +To address this, Gardener now uses a more explicit signaling mechanism between `dependency-watchdog` and `gardenlet`. When `dependency-watchdog` scales down a deployment as part of its meltdown protection, it now adds the following annotation to the resource: |
| 24 | + |
| 25 | +`dependency-watchdog.gardener.cloud/meltdown-protection-active` |
| 26 | + |
| 27 | +This annotation serves as a clear, persistent signal that the component has been intentionally scaled down for safety. |
| 28 | + |
| 29 | +### How It Works |
| 30 | + |
| 31 | +The `gardenlet` component has been updated to recognize and respect this new annotation. During a `Shoot` reconciliation, before scaling any deployment, `gardenlet` now checks for the presence of the `dependency-watchdog.gardener.cloud/meltdown-protection-active` annotation. |
| 32 | + |
| 33 | +If the annotation is found, `gardenlet` will not scale up the deployment. Instead, it preserves the current replica count set by `dependency-watchdog`, ensuring that the meltdown protection remains effective until the underlying infrastructure issue is resolved and `dependency-watchdog` itself restores the components. This change makes the meltdown protection mechanism more robust and prevents unintended node deletions during any degradation of connectivity between the nodes and control plane. |
| 34 | + |
| 35 | +Additionally on a deployment which has `dependency-watchdog.gardener.cloud/meltdown-protection-active` annotation set, if the operator decides to ignores such a deployment from meltdown consideration by annotating it with `dependency-watchdog.gardener.cloud/ignore-scaling`, then for such deployments `dependency-watchdog` shall remove the `dependency-watchdog.gardener.cloud/meltdown-protection-active` annotation and the deployment shall be considered for scale-up as part of next shoot reconciliation. The operator can also explicitly scale up such a deployment and not wait for the next shoot reconciliation. |
| 36 | + |
| 37 | +### For More Information |
| 38 | + |
| 39 | +To dive deeper into the implementation details, you can review the changes in the corresponding pull request. |
| 40 | + |
| 41 | +* **[GitHub PR #12272](https://github.com/gardener/gardener/pull/12272)** |
| 42 | +* **[Watch the demo](https://youtu.be/kcXSyloteSs?t=501)** |
0 commit comments