Skip to content

Conversation

@thunderboltsid
Copy link
Contributor

@thunderboltsid thunderboltsid commented Jul 11, 2025

Adds a new controller that monitors cluster.status.failureDomains and triggers rollouts on KubeAdmControlPlane when failure domains currently in use by control plane machines are disabled or removed.

The controller is enabled by default and can be disabled with --failure-domain-rollout-enabled=false.

How has this been tested?

  • Create 4 failure domains: fd-1, fd-2, fd-3, fd-4
  • Create a cluster with control plane on 3 failure domains fd-1, fd-2, fd-3
  • Remove fd-3 from the control plane configuration and observe a rollout to fd-1 & fd-2 only
  • Add fd-3 again and observe a rollout to fd-1, fd-2, and fd-3
  • Add fd-4 to control plane configuration and observe no rollout takes place

@thunderboltsid thunderboltsid force-pushed the issue/auto-kcp-rollout branch 8 times, most recently from 7875587 to e8b2f8d Compare July 14, 2025 16:35
Adds a new controller that monitors cluster.status.failureDomains and
triggers rollouts on KubeAdmControlPlane when failure domains currently
in use by control plane machines are disabled or removed.

The controller is enabled by default and can be disabled with
--failure-domain-rollout-enabled=false.
@thunderboltsid thunderboltsid force-pushed the issue/auto-kcp-rollout branch from e8b2f8d to 2be0669 Compare July 14, 2025 17:57
@thunderboltsid thunderboltsid changed the title [WIP] feat(controllers): add failure domain rollout controller feat(controllers): add failure domain rollout controller Jul 14, 2025
@github-actions github-actions bot added feature and removed feature labels Jul 14, 2025
@thunderboltsid thunderboltsid changed the title feat(controllers): add failure domain rollout controller feat(failuredomains): add failure domain rollout controller Jul 14, 2025
@github-actions github-actions bot added feature and removed feature labels Jul 14, 2025
@thunderboltsid thunderboltsid requested a review from yanhua121 July 14, 2025 22:32
@thunderboltsid thunderboltsid force-pushed the issue/auto-kcp-rollout branch 2 times, most recently from 3f97778 to 0a5a2e3 Compare July 15, 2025 12:12
- fixes issues where a cluster or KCP is being reconciled by original
  controller
- fixes a race where a machine that's marked for deletion was factored
  in for rollout decision.
- remove KCP watch
- refactor reconciler logic into helpers
- add exhaustive unit tests for helpers
Copy link
Contributor

@dlipovetsky dlipovetsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a future PR, we can add unit tests to cover the error cases that are currently not covered.

@thunderboltsid thunderboltsid merged commit 0199374 into main Jul 15, 2025
22 checks passed
@thunderboltsid thunderboltsid deleted the issue/auto-kcp-rollout branch July 15, 2025 22:17
jimmidyson added a commit that referenced this pull request Jul 23, 2025
🤖 I have created a release *beep* *boop*
---


## 0.32.0 (2025-07-23)

<!-- Release notes generated using configuration in .github/release.yaml
at main -->

## What's Changed
### Exciting New Features 🎉
* feat: Validate the configured failure domain(s) exist and valid by
@yanhua121 in
#1208
* feat: Add context to preflight check messages by @dlipovetsky in
#1210
* feat: Skip FD dependent preflight checks when failureDomain configured
by @yanhua121 in
#1213
* feat(failuredomains): add failure domain rollout controller by
@thunderboltsid in
#1207
* feat: Add more context to MetalLB config apply conflict errors by
@dlipovetsky in
#1225
* feat: Add scale from zero cluster-autoscaler support by @jimmidyson in
#1227
### Fixes 🔧
* fix: Fix typo in field name; use one-line strings to prevent future
typos by @dlipovetsky in
#1206
* fix: Nuanced image Kubernetes version check errors by @dlipovetsky in
#1211
* fix(helm): add failuredomain rollout controller config to helm chart
by @thunderboltsid in
#1214
* fix(ccm): Update Nutanix CCM to v0.5.2 by @thunderboltsid in
#1220
* fix(preflight): check storage containers on all failure domains by
@thunderboltsid in
#1215
* fix(preflight): ensure MDs without overrides are also checked by
@thunderboltsid in
#1216
* fix: Use JSONPath in check result fields by @dlipovetsky in
#1221
* fix: machineDetails fields "cluster" and "subnets" should be optional
by @yanhua121 in
#1217
* fix(auto-cert-renewal): adds 0 as valid value for daysBeforeExpiry by
@atulv7 in
#1218
* fix: Namespacesync copies resources after partial copy failure by
@dlipovetsky in
#1228
### Other Changes
* build: update mindthegap version by @dkoshkin in
#1212

## New Contributors
* @atulv7 made their first contribution in
#1218

**Full Changelog**:
v0.31.1...v0.32.0

---
This PR was generated with [Release
Please](https://github.com/googleapis/release-please). See
[documentation](https://github.com/googleapis/release-please#release-please).

---------

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Jimmi Dyson <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants