@@ -121,7 +121,7 @@ NAME CLUSTER REPLICAS READY UPDATE
121121machinedeployment.cluster.x-k8s.io/demo-sm0 demo 1 1 1 0 Running 11d v1.24.2
122122
123123NAME PHASE AGE VERSION
124- cluster.cluster.x-k8s.io/demo Provisioned 11d
124+ cluster.cluster.x-k8s.io/demo Provisioned 11d
125125
126126NAME CLUSTER NODENAME PROVIDERID PHASE AGE VERSION
127127machine.cluster.x-k8s.io/demo-control-plane-7p8zv demo demo-control-plane-7d76d0be-z6dm8 openstack:///f687f926-3cee-4550-91e5-32c2885708b0 Running 11d v1.24.2
@@ -133,7 +133,7 @@ NAME CLUSTER
133133kubeadmcontrolplane.controlplane.cluster.x-k8s.io/demo-control-plane demo true true 3 3 3 0 11d v1.24.2
134134
135135NAME CLUSTER READY NETWORK SUBNET BASTION IP
136- openstackcluster.infrastructure.cluster.x-k8s.io/demo demo true 4b6b2722-ee5b-40ec-8e52-a6610e14cc51 73e22c49-10b8-4763-af2f-4c0cce007c82
136+ openstackcluster.infrastructure.cluster.x-k8s.io/demo demo true 4b6b2722-ee5b-40ec-8e52-a6610e14cc51 73e22c49-10b8-4763-af2f-4c0cce007c82
137137
138138NAME CLUSTER INSTANCESTATE READY PROVIDERID MACHINE
139139openstackmachine.infrastructure.cluster.x-k8s.io/demo-control-plane-7d76d0be-d2mcr demo ACTIVE true openstack:///ea91f79a-8abb-4cb9-a2ea-8f772568e93c demo-control-plane-9skvh
@@ -167,6 +167,52 @@ kubectl -n capo-system logs deploy/capo-controller-manager
167167kubectl -n capi-addon-system logs deploy/cluster-api-addon-provider
168168```
169169
170+ ### Recovering clusters stuck in failed state after network disruption
171+
172+ If the underlying cloud infrastructure has undergone maintenance or suffered
173+ from temporary networking problems, clusters can get stuck in a 'Failed' state
174+ even after the network is recovered and the cluster is otherwise fully
175+ functional.
176+ This is can happen when ` failureMessage ` and ` failureReason ` are set, which
177+ Cluster API mistakenly interprets as an unrecoverable error and therefore
178+ changes the cluster's status to ` Failed ` . There are ongoing discussions in the
179+ Kubernetes community about resolving this mistaken interpretation of transient
180+ networking errors but for now this failed status must be manually cleared.
181+
182+ If you think this is the case, you can check for affected clusters with the following command:
183+
184+ ``` command title="On the K3s node, targetting the HA cluster if deployed"
185+ $ kubectl get cluster.cluster.x-k8s.io --all-namespaces -o json | jq -r ' .items[] | "\(.metadata.name): \(.status.failureMessage) \(.status.failureReason)"'
186+ ```
187+
188+ Clusters where one or both of the ` failure{Message,Reason} ` fields is not
189+ ` null ` are affected.
190+ You can reset the status for an individual cluster by updating removing the
191+ failure message and reason fields using
192+ ` kubectl edit --subresource=status clusters.cluster.x-k8s.io/<cluster-name> ` .
193+ Alternatively, you can apply a patch to all workload clusters at once using the
194+ following command:
195+
196+ ``` command title="On the K3s node, targetting the HA cluster if deployed"
197+ # Shell command to extract the list of failed clusters and generate the required `kubectl patch` command for each one
198+ $ kubectl get cluster.cluster.x-k8s.io --all-namespaces -o json \
199+ | jq -r ' .items[] | select(.status.failureMessage or .status.failureReason) | "kubectl patch cluster.cluster.x-k8s.io \(.metadata.name) -n \(.metadata.namespace) --type=merge --subresource=status -p ' \' ' {\"status\": {\"failureMessage\": null, \"failureReason\": null}}' \' ' "'
200+ kubectl patch cluster.cluster.x-k8s.io demo1 -n az-demo --type=merge --subresource=status -p ' {"status": {"failureMessage": null, "failureReason": null}}'
201+ kubectl patch cluster.cluster.x-k8s.io demo2 -n az-demo --type=merge --subresource=status -p ' {"status": {"failureMessage": null, "failureReason": null}}'
202+ kubectl patch cluster.cluster.x-k8s.io demo3 -n az-demo --type=merge --subresource=status -p ' {"status": {"failureMessage": null, "failureReason": null}}'
203+ kubectl patch cluster.cluster.x-k8s.io demo4 -n az-demo --type=merge --subresource=status -p ' {"status": {"failureMessage": null, "failureReason": null}}'
204+
205+ # Modification of the previous command which pipes the output into `sh` so that the `kubectl patch` commands are executed to fix the failed clusters
206+ $ kubectl get cluster.cluster.x-k8s.io --all-namespaces -o json \
207+ | jq -r ' .items[] | select(.status.failureMessage or .status.failureReason) | "kubectl patch cluster.cluster.x-k8s.io \(.metadata.name) -n \(.metadata.namespace) --type=merge --subresource=status -p ' \' ' {\"status\": {\"failureMessage\": null, \"failureReason\": null}}' \' ' "' \
208+ | sh
209+ cluster.cluster.x-k8s.io/demo1 patched
210+ cluster.cluster.x-k8s.io/demo2 patched
211+ cluster.cluster.x-k8s.io/demo3 patched
212+ cluster.cluster.x-k8s.io/demo4 patched
213+
214+ ```
215+
170216## Accessing tenant clusters
171217
172218The kubeconfigs for all tenant clusters are stored as secrets. First, you need
0 commit comments