Skip to content

Commit 9ee8fdd

Browse files
Add blog enhancing-meltdown-protection-with-dependency-watchdog-annotations (gardener#667)
1 parent 7e75b0e commit 9ee8fdd

File tree

1 file changed

+42
-0
lines changed

1 file changed

+42
-0
lines changed
Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
---
2+
title: "Enhancing Meltdown Protection with Dependency-Watchdog Annotations"
3+
linkTitle: "Enhancing Meltdown Protection with Dependency-Watchdog Annotations"
4+
newsSubtitle: June 25, 2025
5+
publishdate: 2025-06-25
6+
authors:
7+
- avatar: https://avatars.githubusercontent.com/ashwani2k
8+
login: ashwani2k
9+
name: Ashwani Kumar
10+
aliases: ["/blog/2025/06/25/enhancing-meltdown-protection-with-dependency-watchdog-annotations"]
11+
---
12+
13+
Gardener's `dependency-watchdog` is a crucial component for ensuring cluster stability. During infrastructure-level outages where worker nodes cannot communicate with the control plane, it activates a "meltdown protection" mechanism. This involves scaling down key control plane components like the `machine-controller-manager` (MCM), `cluster-autoscaler` (CA), and `kube-controller-manager` (KCM) to prevent them from taking incorrect actions based on stale information, such as deleting healthy nodes that are only temporarily unreachable.
14+
15+
### The Challenge: Premature Scale-Up During Reconciliation
16+
17+
Previously, a potential race condition could undermine this protection. While `dependency-watchdog` scaled down the necessary components, a concurrent `Shoot` reconciliation, whether triggered manually by an operator or by other events, could misinterpret the situation. The reconciliation logic, unaware that the scale-down was a deliberate protective measure, would attempt to restore the "desired" state by scaling the `machine-controller-manager` and `cluster-autoscaler` back up.
18+
19+
This premature scale-up could have serious consequences. An active `machine-controller-manager`, for instance, might see nodes in an `Unknown` state due to the ongoing outage and incorrectly decide to delete them, defeating the entire purpose of the meltdown protection.
20+
21+
### The Solution: A New Annotation for Clearer Signaling
22+
23+
To address this, Gardener now uses a more explicit signaling mechanism between `dependency-watchdog` and `gardenlet`. When `dependency-watchdog` scales down a deployment as part of its meltdown protection, it now adds the following annotation to the resource:
24+
25+
`dependency-watchdog.gardener.cloud/meltdown-protection-active`
26+
27+
This annotation serves as a clear, persistent signal that the component has been intentionally scaled down for safety.
28+
29+
### How It Works
30+
31+
The `gardenlet` component has been updated to recognize and respect this new annotation. During a `Shoot` reconciliation, before scaling any deployment, `gardenlet` now checks for the presence of the `dependency-watchdog.gardener.cloud/meltdown-protection-active` annotation.
32+
33+
If the annotation is found, `gardenlet` will not scale up the deployment. Instead, it preserves the current replica count set by `dependency-watchdog`, ensuring that the meltdown protection remains effective until the underlying infrastructure issue is resolved and `dependency-watchdog` itself restores the components. This change makes the meltdown protection mechanism more robust and prevents unintended node deletions during any degradation of connectivity between the nodes and control plane.
34+
35+
Additionally on a deployment which has `dependency-watchdog.gardener.cloud/meltdown-protection-active` annotation set, if the operator decides to ignores such a deployment from meltdown consideration by annotating it with `dependency-watchdog.gardener.cloud/ignore-scaling`, then for such deployments `dependency-watchdog` shall remove the `dependency-watchdog.gardener.cloud/meltdown-protection-active` annotation and the deployment shall be considered for scale-up as part of next shoot reconciliation. The operator can also explicitly scale up such a deployment and not wait for the next shoot reconciliation.
36+
37+
### For More Information
38+
39+
To dive deeper into the implementation details, you can review the changes in the corresponding pull request.
40+
41+
* **[GitHub PR #12272](https://github.com/gardener/gardener/pull/12272)**
42+
* **[Watch the demo](https://youtu.be/kcXSyloteSs?t=501)**

0 commit comments

Comments
 (0)