Skip to content

Commit 27e72e4

Browse files
committed
feat(chart): add automatic Skyhook resource cleanup on helm uninstall
Add pre-delete hook to automatically clean up Skyhook and DeploymentPolicy resources during helm uninstall, eliminating the manual step previously required to avoid reinstall issues. Enabled by default with configurable timeout (120s). Can be disabled with cleanup.enabled=false.
1 parent 26e8920 commit 27e72e4

File tree

12 files changed

+214
-38
lines changed

12 files changed

+214
-38
lines changed

README.md

Lines changed: 39 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -141,19 +141,53 @@ kubectl wait --for=jsonpath='{.status.status}'=complete skyhook/skyhook-sample -
141141
kubectl describe skyhook skyhook-sample
142142
```
143143

144-
### ⚠️ Important: Before Uninstalling **(really only important if you want to reinstall)**
144+
### Uninstalling
145145

146-
**Always delete all Skyhook Custom Resources before running `helm uninstall`:**
146+
**Automatic Cleanup (Default):** By default, the Helm chart includes a pre-delete hook that automatically cleans up all Skyhook and DeploymentPolicy resources before uninstalling:
147147

148148
```bash
149-
# Delete all Skyhook resources first (REQUIRED before uninstall)
150-
kubectl delete skyhooks --all --all-namespaces
149+
# Uninstall the chart (cleanup happens automatically)
150+
helm uninstall skyhook --namespace skyhook
151+
```
152+
153+
The pre-delete hook will:
154+
- Delete all Skyhook resources
155+
- Delete all DeploymentPolicy resources
156+
- Complete quickly if no resources exist
157+
- Wait for finalizers to be processed if resources exist
158+
- Proceed with uninstall even if cleanup times out (job deadline: 2 minutes)
159+
160+
**Configuration Options:**
161+
162+
To disable automatic cleanup and manage resources manually:
163+
164+
```bash
165+
helm install skyhook ./chart --namespace skyhook --set cleanup.enabled=false
166+
```
167+
168+
To adjust the job timeout:
169+
170+
```bash
171+
helm install skyhook ./chart --namespace skyhook \
172+
--set cleanup.jobTimeoutSeconds=180
173+
```
174+
175+
**Manual Cleanup (if needed):**
176+
177+
If you disabled automatic cleanup or need to clean up resources manually:
178+
179+
```bash
180+
# Delete all Skyhook resources first
181+
kubectl delete skyhooks --all
182+
183+
# Delete all DeploymentPolicy resources
184+
kubectl delete deploymentpolicies --all
151185

152186
# Then uninstall the chart
153187
helm uninstall skyhook --namespace skyhook
154188
```
155189

156-
**Why?** If you `helm uninstall` while Skyhook CRs still exist, finalizers will leave the CRD in a broken state, causing reinstalls to fail.
190+
**Why cleanup matters:** If you uninstall while Skyhook CRs with finalizers still exist, it can leave resources in a broken state that may cause reinstall issues.
157191

158192
## Monitoring and Troubleshooting
159193

chart/README.md

Lines changed: 51 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,8 @@ Settings | Description | Default |
3737
| imagePullSecret | the secret used to pull the operator controller image, agent image, and package images. | "" |
3838
| estimatedPackageCount | estimated number of packages to be installed on the cluster, this is used to calculate the resources for the operator controller. | 1 |
3939
| estimatedNodeCount | estimated number of nodes in the cluster, this is used to calculate the resources for the operator controller | 1 |
40+
| cleanup.enabled | Automatically delete all Skyhook and DeploymentPolicy resources during helm uninstall. Recommended to prevent orphaned CRs. | true |
41+
| cleanup.jobTimeoutSeconds | Hard deadline for the entire cleanup job during uninstall. The job will be killed if it exceeds this time. | 120 |
4042

4143
### NOTES
4244
- **estimatedPackageCount** and **estimatedNodeCount** are used to size the resource requirements. Default setting should be good for nodes > 1000 and packages 1-2 or nodes > 500 and packages >= 4. If your approaching this size deployment it would make sense to set these. You can also override them by explicitly with `controllerManager.manager.resources` the values file has an example.
@@ -70,3 +72,52 @@ This Helm chart follows independent versioning from the operator and agent compo
7072
### Chart Version vs App Version
7173
- **Chart version** (`version` in Chart.yaml): Tracks changes to chart templates, values, and configuration (NOTE: agent version in set in the values.)
7274
- **App version** (`appVersion` in Chart.yaml): Recommended stable operator version for this chart release
75+
76+
## Uninstalling
77+
78+
### Automatic Cleanup (Default Behavior)
79+
80+
By default, the Helm chart includes a pre-delete hook that automatically cleans up all Skyhook and DeploymentPolicy custom resources before uninstalling. This prevents orphaned resources that could cause issues during reinstallation.
81+
82+
```bash
83+
# Uninstall with automatic cleanup (default)
84+
helm uninstall skyhook --namespace skyhook
85+
```
86+
87+
The pre-delete hook will:
88+
- Delete all Skyhook resources cluster-wide
89+
- Delete all DeploymentPolicy resources cluster-wide
90+
- Wait for finalizers to be processed
91+
- Proceed with uninstall even if cleanup times out (job deadline: 2 minutes, configurable via `cleanup.jobTimeoutSeconds`)
92+
93+
### Disabling Automatic Cleanup
94+
95+
If you need to preserve Skyhook resources during uninstall (e.g., for backup/migration scenarios), disable the cleanup feature:
96+
97+
```yaml
98+
# values.yaml
99+
cleanup:
100+
enabled: false
101+
```
102+
103+
When disabled, you must manually delete resources before uninstalling to avoid issues:
104+
105+
```bash
106+
# Manual cleanup when automatic cleanup is disabled
107+
kubectl delete skyhooks --all
108+
kubectl delete deploymentpolicies --all
109+
helm uninstall skyhook --namespace skyhook
110+
```
111+
112+
### Configuring Timeout Values
113+
114+
For large clusters or when resources have complex finalizers, you may need to adjust the job timeout:
115+
116+
```yaml
117+
# values.yaml
118+
cleanup:
119+
enabled: true
120+
jobTimeoutSeconds: 180 # 3 minutes total job deadline
121+
```
122+
123+
**Note:** The job will be killed if it exceeds `jobTimeoutSeconds`. The default of 120 seconds (2 minutes) should be sufficient for most clusters.
Lines changed: 79 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,79 @@
1+
{{- if .Values.cleanup.enabled }}
2+
apiVersion: batch/v1
3+
kind: Job
4+
metadata:
5+
name: "{{ include "chart.fullname" . }}-skyhook-cleanup"
6+
namespace: "{{ .Release.Namespace }}"
7+
annotations:
8+
"helm.sh/hook": pre-delete
9+
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
10+
"helm.sh/hook-weight": "-5"
11+
spec:
12+
activeDeadlineSeconds: {{ .Values.cleanup.jobTimeoutSeconds }}
13+
template:
14+
spec:
15+
restartPolicy: Never
16+
automountServiceAccountToken: false
17+
serviceAccountName: {{ include "chart.fullname" . }}-controller-manager
18+
{{- with .Values.controllerManager.tolerations }}
19+
tolerations:
20+
{{- toYaml . | nindent 6 }}
21+
{{- end }}
22+
securityContext:
23+
runAsNonRoot: true
24+
runAsUser: 10001
25+
seccompProfile:
26+
type: RuntimeDefault
27+
volumes:
28+
- name: kube-api-access
29+
projected:
30+
defaultMode: 420
31+
sources:
32+
- serviceAccountToken:
33+
path: token
34+
expirationSeconds: 3607
35+
- configMap:
36+
items:
37+
- key: ca.crt
38+
path: ca.crt
39+
name: kube-root-ca.crt
40+
- downwardAPI:
41+
items:
42+
- fieldRef:
43+
apiVersion: v1
44+
fieldPath: metadata.namespace
45+
path: namespace
46+
containers:
47+
- name: cleanup
48+
image: {{ .Values.webhook.removalImage | default "bitnami/kubectl" }}{{- if .Values.webhook.removalDigest }}{{- if .Values.webhook.removalTag }}:{{ .Values.webhook.removalTag | default "latest" }}@{{ .Values.webhook.removalDigest }}{{- else }}@{{ .Values.webhook.removalDigest }}{{- end }}{{- else }}:{{ .Values.webhook.removalTag | default "latest" }}{{- end }}
49+
imagePullPolicy: Always
50+
securityContext:
51+
allowPrivilegeEscalation: false
52+
readOnlyRootFilesystem: true
53+
capabilities:
54+
drop:
55+
- ALL
56+
seccompProfile:
57+
type: RuntimeDefault
58+
volumeMounts:
59+
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
60+
name: kube-api-access
61+
readOnly: true
62+
resources:
63+
limits:
64+
cpu: {{ .Values.limitRange.default.cpu }}
65+
memory: {{ .Values.limitRange.default.memory }}
66+
requests:
67+
cpu: {{ .Values.limitRange.defaultRequest.cpu }}
68+
memory: {{ .Values.limitRange.defaultRequest.memory }}
69+
command:
70+
- /bin/sh
71+
- -c
72+
- |
73+
# Delete all Skyhook resources (cluster-scoped)
74+
# Using || true to ensure job succeeds even if delete times out
75+
kubectl delete skyhooks --all --ignore-not-found || true
76+
# Delete all DeploymentPolicy resources (cluster-scoped)
77+
kubectl delete deploymentpolicies --all --ignore-not-found || true
78+
echo "Cleanup completed"
79+
{{- end }}

chart/templates/cleanup-webhook-job.yaml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ metadata:
88
"helm.sh/hook": pre-delete
99
"helm.sh/hook-delete-policy": before-hook-creation,hook-succeeded
1010
spec:
11+
activeDeadlineSeconds: {{ .Values.cleanup.jobTimeoutSeconds }}
1112
template:
1213
spec:
1314
restartPolicy: Never

chart/values.yaml

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,15 @@ webhook:
161161
removalTag: latest
162162
removalDigest: "sha256:1bc359beb3ae3982591349df11db50b0917b0596e8bed8ab9cf0c8a84a3502d1"
163163

164+
## cleanup: Configuration for pre-delete cleanup jobs
165+
cleanup:
166+
## enabled: When true, automatically deletes all Skyhook and DeploymentPolicy
167+
## resources before helm uninstall. Recommended to prevent orphaned CRs.
168+
enabled: true
169+
## jobTimeoutSeconds: Hard deadline for the entire cleanup job.
170+
## The job will be killed if it exceeds this time.
171+
jobTimeoutSeconds: 120
172+
164173
metrics:
165174
addServiceAccountBinding: false
166175
serviceAccountName: prometheus

k8s-tests/chainsaw/helm/helm-chart-test/README.md

Lines changed: 11 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -2,24 +2,27 @@
22

33
## Purpose
44

5-
Validates that the Helm chart deploys correctly with custom configurations, including custom deployment names and tolerations.
5+
Validates that the Helm chart deploys correctly with custom configurations, including custom deployment names, tolerations, and automatic cleanup on uninstall.
66

77
## Test Scenario
88

99
1. Reset state from previous runs
10-
2. Install the Helm chart with custom values:
11-
- Different deployment name than `skyhook-operator`
12-
- Custom tolerations
13-
3. Verify the operator is scheduled correctly
14-
4. Apply a skyhook and verify it completes
15-
5. Assert metrics and state are correct
10+
2. Install the Helm chart with a "bad" node taint and verify pods don't schedule
11+
3. Change to a "good" node taint that matches configured tolerations
12+
4. Reinstall the Helm chart (tests uninstall + reinstall flow)
13+
5. Verify the operator is scheduled correctly with tolerations
14+
6. Apply a DeploymentPolicy and Skyhook
15+
7. Verify the Skyhook completes successfully
16+
8. Uninstall the Helm chart (verifies pre-delete hook cleans up Skyhook/DeploymentPolicy resources automatically)
1617

1718
## Key Features Tested
1819

1920
- Custom deployment name support
2021
- Toleration configuration via Helm values
21-
- Operator deployment and scheduling
22+
- Operator deployment and scheduling with node taints
2223
- End-to-end skyhook processing with Helm-deployed operator
24+
- Automatic cleanup of Skyhook and DeploymentPolicy resources during helm uninstall
25+
- Pre-delete hook tolerates node taints (can schedule on same nodes as operator)
2326

2427
## Files
2528

k8s-tests/chainsaw/helm/helm-chart-test/assert-no-schedule.yaml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33
#
44
#
@@ -119,8 +119,6 @@ spec:
119119
terminationMessagePolicy: File
120120
dnsPolicy: ClusterFirst
121121
enableServiceLinks: true
122-
imagePullSecrets:
123-
- name: node-init-secret
124122
preemptionPolicy: PreemptLowerPriority
125123
priority: 0
126124
restartPolicy: Always

k8s-tests/chainsaw/helm/helm-chart-test/assert-scheduled.yaml

Lines changed: 1 addition & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# SPDX-FileCopyrightText: Copyright (c) 2025 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
1+
# SPDX-FileCopyrightText: Copyright (c) 2026 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
22
# SPDX-License-Identifier: Apache-2.0
33
#
44
#
@@ -119,8 +119,6 @@ spec:
119119
terminationMessagePolicy: File
120120
dnsPolicy: ClusterFirst
121121
enableServiceLinks: true
122-
imagePullSecrets:
123-
- name: node-init-secret
124122
preemptionPolicy: PreemptLowerPriority
125123
priority: 0
126124
restartPolicy: Always

k8s-tests/chainsaw/helm/helm-chart-test/chainsaw-test.yaml

Lines changed: 8 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -62,19 +62,22 @@ spec:
6262
file: skyhook.yaml
6363
- assert:
6464
file: assert-skyhook-complete.yaml
65+
- script:
66+
content: |
67+
## Remove helm chart (pre-delete hook should clean up Skyhooks/DeploymentPolicies)
68+
## If cleanup fails or times out, helm uninstall will fail
69+
../uninstall-helm-chart.sh foobar
6570
finally:
6671
- script:
6772
content: |
68-
## Cleanup deployment policy test resources
73+
## Cleanup any remaining resources
6974
../skyhook-cli reset helm-test-skyhook --confirm 2>/dev/null || true
70-
kubectl delete skyhook helm-test-skyhook --ignore-not-found || true
71-
kubectl delete deploymentpolicy helm-test-policy -n skyhook --ignore-not-found || true
7275
- script:
7376
content: |
7477
## Remove taint from nodes
7578
../untaint-nodes.sh dedicated=bogus:NoSchedule
7679
../untaint-nodes.sh dedicated=system-cpu:NoSchedule
7780
- script:
7881
content: |
79-
## Remove helm chart
80-
../uninstall-helm-chart.sh foobar
82+
## Remove helm chart if still present
83+
../uninstall-helm-chart.sh foobar || true

operator/README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -217,14 +217,14 @@ helm install skyhook-operator ./chart --namespace skyhook
217217
218218
to remove operator from a cluster:
219219
```
220-
## remove operator
220+
## remove operator (automatic cleanup enabled by default)
221221
helm uninstall skyhook-operator --namespace skyhook
222222

223223
## delete CRD
224224
make uninstall
225225
```
226226
227-
**NOTE**: because there is a finalizer on it you need to need to delete the SCRs before uninstalling the CRD or operator. If you remove the operator first, delete the CRD or SCR can hang trying to finalize. Easiest way to fix is re install the operator. You can clean up by hand, but could be some work. cleaning up: configmaps, uncording nodes, removing taints, and deleting running pods
227+
**NOTE**: The Helm chart includes automatic cleanup of Skyhook and DeploymentPolicy resources during uninstall (enabled by default). If you've disabled automatic cleanup (`cleanup.enabled: false`), you must manually delete SCRs before uninstalling to avoid finalizer issues. If you remove the operator before deleting SCRs with finalizers, they can hang. To fix: reinstall the operator, delete resources, then uninstall properly. Manual cleanup may require removing configmaps, uncordoning nodes, removing taints, and deleting running pods.
228228
229229
230230
## Helm Chart and General Config infra

0 commit comments

Comments
 (0)