Race in deleting cluster secret prevents deletion of NutanixCluster object when deleting CAPI cluster object

/kind bug

**What steps did you take and what happened:**
1. Create kind cluster as a management cluster for this bug repro. Run `clusterctl init --infrastructure v1.2.1` to initialize the providers. Make sure that `~/.cluster-api/clusterctl.yaml` contains relevant values needed for cluster initialization, before running this command. 
2. use clusterctl to generate a cluster yaml for CAPX provider type. For the sake of this bug report, lets name the CAPI cluster referred to in the label `cluster.x-k8s.io/cluster-name` be called as `capx-cluster`. For this report, keep the WMD and KCP pointing to same NutanixMachineTemplate, say `capx-cluster-mt-0` and keep the worker node count and control node count as 1.
3. Change the image name for NutanixMachineTemplate, under `.spec.template.spec.image.name` to a random string value such that, the image does not exist in PC.
4. Deploy this cluster yaml on a management kind cluster. Note that status of `capx-cluster` after few mins. This will say that cluster is in `Provisioned` state. This is incorrect ! it must either be in `Provisioning` state or in `Failed` state. !!
5. Next delete this cluster object using `kubectl delete cl capx-cluster`. This command will be stuck as there are `finalizers` set on the capx-cluster object. Now open another terminal with same cluster context as kind management cluster used earlier. Check the logs of CAPI controller manager and CAPX controller manager.
6. CAPI controller manager will report: 
```
I0511 23:54:36.485905       1 machine_controller.go:318] "Deleting Kubernetes Node associated with Machine is not allowed" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/capx-cluster-kcp-kj4cp" namespace="default" name="capx-cluster-kcp-kj4cp" reconcileID=861994a7-ea7e-421d-8ed5-cbb0b5f95fb1 KubeadmControlPlane="default/capx-cluster-kcp" Cluster="default/capx-cluster" Node="" cause="cluster is being deleted"
E0511 23:54:36.510267       1 controller.go:329] "Reconciler error" err="machines.cluster.x-k8s.io \"capx-cluster-kcp-kj4cp\" not found" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="default/capx-cluster-kcp-kj4cp" namespace="default" name="capx-cluster-kcp-kj4cp" reconcileID=861994a7-ea7e-421d-8ed5-cbb0b5f95fb1
I0511 23:54:36.548737       1 cluster_controller.go:329] "Cluster still has descendants - need to requeue" controller="cluster" controllerGroup="cluster.x-k8s.io" controllerKind="Cluster" Cluster="default/capx-cluster" namespace="default" name="capx-cluster" reconcileID=5bf051b7-537e-422c-9750-a4d148a73867 infrastructureRef="capx-cluster"
```
7. CAPX cluster manager will report:
```
I0512 00:00:06.558316       1 nutanixcluster_controller.go:122] NutanixCluster[namespace: default, name: capx-cluster] Reconciling the NutanixCluster.
I0512 00:00:06.558407       1 nutanixcluster_controller.go:157] NutanixCluster[namespace: default, name: capx-cluster] Fetched the owner Cluster: capx-cluster
I0512 00:00:06.558616       1 nutanixcluster_controller.go:333] Credential ref is kind Secret for cluster capx-cluster
E0512 00:00:06.558636       1 nutanixcluster_controller.go:342] error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found
E0512 00:00:06.558650       1 nutanixcluster_controller.go:178] NutanixCluster[namespace: default, name: capx-cluster] error occurred while reconciling credential ref for cluster capx-cluster: error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found
I0512 00:00:06.559019       1 nutanixcluster_controller.go:172] NutanixCluster[namespace: default, name: capx-cluster] Patched NutanixCluster. Status: {Ready:true FailureDomains:map[] Conditions:[{Type:ClusterCategoryCreated Status:False Severity:Info LastTransitionTime:2023-05-11 23:54:38 +0000 UTC Reason:Deleting Message:} {Type:CredentialRefSecretOwnerSet Status:False Severity:Error LastTransitionTime:2023-05-11 23:54:38 +0000 UTC Reason:CredentialRefSecretOwnerSetFailed Message:error occurred while fetching cluster capx-cluster secret for credential ref: Secret "capx-cluster" not found} {Type:PrismClientInit Status:True Severity: LastTransitionTime:2023-05-11 23:47:53 +0000 UTC Reason: Message:}] FailureReason:<nil> FailureMessage:<nil>}
1.6838496065590758e+09  ERROR   Reconciler error        {"controller": "nutanixcluster", "controllerGroup": "infrastructure.cluster.x-k8s.io", "controllerKind": "NutanixCluster", "NutanixCluster": {"name":"capx-cluster","namespace":"default"}, "namespace": "default", "name": "capx-cluster", "reconcileID": "93f1b33d-9269-4d15-b94a-64f2e08bcc72", "error": "error occurred while fetching cluster capx-cluster secret for credential ref: Secret \"capx-cluster\" not found"}
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
        sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:326
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
        sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:273
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
        sigs.k8s.io/controller-runtime@v0.13.1/pkg/internal/controller/controller.go:234
```

Essentially it is looking for the secret object, but the object does not exist, as its deleted earlier.

**What did you expect to happen:**
1. Cluster should not change to `Provisioned` status. It must either be in `Provisioning` state or `Failure` state.
2. Cluster secret must not be deleted first. If it is deleted, then reconciler must omit the search of that in delete logic.

**Anything else you would like to add:**

None


**Environment:**

- Cluster-api-provider-nutanix version:  v1.2.1
- Kubernetes version: (use `kubectl version`): v1.25.3 
- OS (e.g. from `/etc/os-release`): "CentOS Linux 7 (Core)"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race in deleting cluster secret prevents deletion of NutanixCluster object when deleting CAPI cluster object #281

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Race in deleting cluster secret prevents deletion of NutanixCluster object when deleting CAPI cluster object #281

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions