Skip to content

Commit b7150ad

Browse files
committed
feat(cleanup): add cleanup instructions and scripts for uninstalling cluster components and destroying infrastructure
1 parent 5474d35 commit b7150ad

File tree

3 files changed

+223
-1
lines changed

3 files changed

+223
-1
lines changed

deploy/001-iac/README.md

Lines changed: 89 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -113,6 +113,95 @@ terraform init && terraform apply
113113

114114
See [automation/README.md](automation/README.md) for runbook configuration.
115115

116+
## Destroy Infrastructure
117+
118+
Remove Azure resources deployed by Terraform. Clean up cluster components first.
119+
120+
### Prerequisites
121+
122+
- Cluster components uninstalled (see [002-setup/README.md](../002-setup/README.md#cleanup))
123+
- Terraform state accessible
124+
- Azure CLI authenticated
125+
126+
### Option A: Terraform Destroy
127+
128+
Preserves Terraform state and allows redeployment:
129+
130+
```bash
131+
cd deploy/001-iac
132+
133+
# Preview resources to be destroyed
134+
terraform plan -destroy -var-file=terraform.tfvars
135+
136+
# Destroy infrastructure
137+
terraform destroy -var-file=terraform.tfvars
138+
```
139+
140+
If VPN was deployed separately:
141+
142+
```bash
143+
cd vpn
144+
terraform destroy -var-file=terraform.tfvars
145+
```
146+
147+
### Option B: Delete Resource Group
148+
149+
Fastest cleanup method (completely deletes the resource group):
150+
151+
```bash
152+
# Get resource group name
153+
terraform output -raw resource_group | jq -r '.name'
154+
155+
# Or check Azure portal / terraform.tfvars for naming pattern
156+
# Default: <resource_prefix>-<environment>-rg
157+
158+
# Delete entire resource group
159+
az group delete --name <resource-group-name> --yes
160+
161+
# For async deletion (returns immediately)
162+
az group delete --name <resource-group-name> --yes --no-wait
163+
```
164+
165+
Resource group deletion removes all contained resources regardless of how they were created.
166+
167+
### Cleanup Order
168+
169+
Follow this order to avoid dependency failures:
170+
171+
| Order | Component | Command |
172+
|:-----:|-----------|--------|
173+
| 1 | OSMO Backend | `../002-setup/cleanup/uninstall-osmo-backend.sh` |
174+
| 2 | OSMO Control Plane | `../002-setup/cleanup/uninstall-osmo-control-plane.sh` |
175+
| 3 | AzureML Extension | `../002-setup/cleanup/uninstall-azureml-extension.sh` |
176+
| 4 | GPU Infrastructure | `../002-setup/cleanup/uninstall-robotics-charts.sh` |
177+
| 5 | VPN (if deployed) | `cd vpn && terraform destroy -var-file=terraform.tfvars` |
178+
| 6 | Main Infrastructure | `terraform destroy -var-file=terraform.tfvars` |
179+
180+
### Troubleshooting Destroy
181+
182+
**Resources stuck deleting**: Some resources (Private Endpoints, AKS) may take 10-15 minutes. Check status:
183+
184+
```bash
185+
az resource list --resource-group <rg> --query "[].{name:name, type:type}" -o table
186+
```
187+
188+
**Terraform state mismatch**: If resources were manually deleted:
189+
190+
```bash
191+
# Refresh state to match Azure
192+
terraform refresh -var-file=terraform.tfvars
193+
194+
# Then destroy
195+
terraform destroy -var-file=terraform.tfvars
196+
```
197+
198+
**Locks preventing deletion**: Remove resource locks if present:
199+
200+
```bash
201+
az lock list --resource-group <rg> -o table
202+
az lock delete --name <lock-name> --resource-group <rg>
203+
```
204+
116205
## Directory Structure
117206

118207
```

deploy/002-setup/README.md

Lines changed: 73 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -207,4 +207,76 @@ kubectl describe sa osmo-service -n osmo-control-plane
207207
|--------|---------|
208208
| `optional/deploy-volcano-scheduler.sh` | Volcano (alternative to KAI) |
209209
| `optional/validate-gpu-metrics.sh` | GPU metrics verification |
210-
| `cleanup/uninstall-azureml-extension.sh` | Remove AzureML extension |
210+
211+
## Cleanup
212+
213+
Uninstall scripts in `cleanup/` remove cluster components in reverse deployment order.
214+
215+
### Cleanup Scripts
216+
217+
| Script | Removes |
218+
|--------|---------|
219+
| `cleanup/uninstall-osmo-backend.sh` | Backend operator, workflow namespaces |
220+
| `cleanup/uninstall-osmo-control-plane.sh` | OSMO service, router, web-ui |
221+
| `cleanup/uninstall-azureml-extension.sh` | ML extension, compute target, FICs |
222+
| `cleanup/uninstall-robotics-charts.sh` | GPU Operator, KAI Scheduler |
223+
224+
### Uninstall Order
225+
226+
Run scripts in this order to avoid dependency issues:
227+
228+
```bash
229+
cd cleanup
230+
231+
# 1. OSMO backend (workflows namespace, operator)
232+
./uninstall-osmo-backend.sh
233+
234+
# 2. OSMO control plane (service, router, UI)
235+
./uninstall-osmo-control-plane.sh
236+
237+
# 3. AzureML extension (extension, compute target)
238+
./uninstall-azureml-extension.sh
239+
240+
# 4. GPU infrastructure (operator, scheduler)
241+
./uninstall-robotics-charts.sh
242+
```
243+
244+
### Data Preservation
245+
246+
By default, uninstall scripts preserve data. Use flags for complete removal:
247+
248+
| Script | Preservation Flag | Description |
249+
|--------|-------------------|-------------|
250+
| `uninstall-osmo-backend.sh` | `--delete-container` | Deletes blob container with workflow artifacts |
251+
| `uninstall-osmo-control-plane.sh` | `--delete-mek` | Removes encryption key ConfigMap |
252+
| `uninstall-osmo-control-plane.sh` | `--purge-postgres` | Drops OSMO tables from PostgreSQL |
253+
| `uninstall-osmo-control-plane.sh` | `--purge-redis` | Flushes OSMO keys from Redis |
254+
| `uninstall-robotics-charts.sh` | `--delete-namespaces` | Removes gpu-operator, kai-scheduler namespaces |
255+
| `uninstall-robotics-charts.sh` | `--delete-crds` | Removes GPU Operator CRDs |
256+
257+
### Full Component Cleanup
258+
259+
Remove everything including data:
260+
261+
```bash
262+
cd cleanup
263+
./uninstall-osmo-backend.sh --delete-container
264+
./uninstall-osmo-control-plane.sh --purge-postgres --purge-redis --delete-mek
265+
./uninstall-azureml-extension.sh --force
266+
./uninstall-robotics-charts.sh --delete-namespaces --delete-crds
267+
```
268+
269+
### Selective Cleanup
270+
271+
Remove only specific components:
272+
273+
```bash
274+
# OSMO only (preserve AzureML and GPU infrastructure)
275+
./uninstall-osmo-backend.sh
276+
./uninstall-osmo-control-plane.sh
277+
278+
# AzureML only (preserve OSMO)
279+
./uninstall-azureml-extension.sh
280+
```
281+
282+
After cluster cleanup, proceed to [001-iac](../001-iac/README.md#destroy-infrastructure) to destroy Azure infrastructure.

deploy/README.md

Lines changed: 61 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,64 @@ For OSMO deployment, see [002-setup/README.md](002-setup/README.md) for authenti
4343
- **Optional**: VPN Gateway for private endpoint access
4444

4545
See the [root README](../README.md) for architecture details.
46+
47+
## Cleanup
48+
49+
Remove deployed components in reverse order. Cluster components must be removed before infrastructure.
50+
51+
| Step | Folder | Description | Time |
52+
|:----:|--------|-------------|------|
53+
| 1 | [002-setup/cleanup](002-setup/cleanup/) | Uninstall Helm charts, extensions, namespaces | 5-10 min |
54+
| 2 | [001-iac](001-iac/) | Terraform destroy or resource group deletion | 10-15 min |
55+
56+
### Partial Cleanup (Cluster Components Only)
57+
58+
Remove OSMO, AzureML, and GPU components while preserving Azure infrastructure:
59+
60+
```bash
61+
cd 002-setup/cleanup
62+
63+
# Uninstall in reverse deployment order
64+
./uninstall-osmo-backend.sh
65+
./uninstall-osmo-control-plane.sh
66+
./uninstall-azureml-extension.sh
67+
./uninstall-robotics-charts.sh
68+
```
69+
70+
See [002-setup/README.md](002-setup/README.md#cleanup) for script options and data preservation.
71+
72+
### Full Teardown
73+
74+
Remove all Azure resources. Choose based on how infrastructure was created.
75+
76+
**Option A: Terraform Destroy** (recommended if using Terraform state)
77+
78+
```bash
79+
# Remove cluster components first
80+
cd 002-setup/cleanup
81+
./uninstall-osmo-backend.sh --delete-container
82+
./uninstall-osmo-control-plane.sh --purge-postgres --purge-redis --delete-mek
83+
./uninstall-azureml-extension.sh
84+
./uninstall-robotics-charts.sh --delete-namespaces --delete-crds
85+
86+
# Destroy infrastructure
87+
cd ../../001-iac
88+
terraform destroy -var-file=terraform.tfvars
89+
90+
# Optional: destroy VPN if deployed
91+
cd vpn && terraform destroy -var-file=terraform.tfvars
92+
```
93+
94+
**Option B: Delete Resource Group** (fastest, deletes everything)
95+
96+
```bash
97+
# Get resource group name from terraform or Azure portal
98+
az group delete --name <resource-group-name> --yes --no-wait
99+
```
100+
101+
This deletes all resources in the group immediately. Use when:
102+
- Terraform created the resource group
103+
- You want to remove everything without preserving state
104+
- Terraform state is corrupted or unavailable
105+
106+
See [001-iac/README.md](001-iac/README.md#destroy-infrastructure) for detailed options.

0 commit comments

Comments
 (0)