|
| 1 | +# Decision Log 006: Azure OpenAI Soft-Delete Purge in CI/CD |
| 2 | + |
| 3 | +**Date:** 2025-10-16 |
| 4 | +**Status:** Approved |
| 5 | + |
| 6 | +## Context |
| 7 | + |
| 8 | +Azure OpenAI (Cognitive Services) resources implement a soft-delete feature for data protection. When resources are deleted via `azd down` or Terraform `destroy`, they enter a soft-deleted state and remain in the subscription's "recently deleted" list for a retention period (typically 48 hours). During this time, the resource name is reserved and cannot be reused. |
| 9 | + |
| 10 | +This creates a problem for CI/CD workflows that: |
| 11 | +1. Deploy infrastructure with deterministic resource names based on environment |
| 12 | +2. Run automated tests |
| 13 | +3. Clean up resources after testing |
| 14 | +4. Attempt to redeploy with the same resource names |
| 15 | + |
| 16 | +The second deployment fails because the soft-deleted resource with the same name still exists in the purge queue. |
| 17 | + |
| 18 | +## Decision |
| 19 | + |
| 20 | +We implement a post-destruction purge step in GitHub Actions workflows to immediately remove soft-deleted Azure OpenAI resources after `azd down` completes. This is implemented directly in workflow files rather than as an azd hook because: |
| 21 | + |
| 22 | +1. **azd Hook Limitations**: Azure Developer CLI (azd) does not currently support `postdown` or `predestroy` hooks |
| 23 | +2. **Workflow Control**: GitHub Actions workflows provide better visibility and control over the purge process |
| 24 | +3. **Error Handling**: Workflow steps can gracefully handle cases where resources are not in soft-delete state |
| 25 | + |
| 26 | +## Implementation |
| 27 | + |
| 28 | +### Terraform Output Addition |
| 29 | + |
| 30 | +Added `openai_resource_name` output to `infra/outputs.tf`: |
| 31 | + |
| 32 | +```hcl |
| 33 | +output "openai_resource_name" { |
| 34 | + description = "The name of the Azure OpenAI resource (for purging soft-deleted resources)" |
| 35 | + value = module.azure_open_ai.resource.name |
| 36 | +} |
| 37 | +``` |
| 38 | + |
| 39 | +This exposes the resource name to azd environment values for use in the purge command. |
| 40 | + |
| 41 | +### Workflow Step Addition |
| 42 | + |
| 43 | +Added "Purge Soft-Deleted Azure OpenAI Resources" step to both: |
| 44 | +- `.github/workflows/azure-dev.yml` (after "Destroy Infrastructure" step) |
| 45 | +- `.github/workflows/azure-dev-down.yml` (after "Azd down" step) |
| 46 | + |
| 47 | +The purge step: |
| 48 | +1. Retrieves resource metadata from `azd env get-values` output |
| 49 | +2. Executes `az cognitiveservices account purge` command |
| 50 | +3. Continues gracefully if resource is not found or already purged |
| 51 | +4. Only runs when infrastructure destruction occurs (PR builds or manual trigger with `run_azd_down`) |
| 52 | + |
| 53 | +### Command Used |
| 54 | + |
| 55 | +```bash |
| 56 | +az cognitiveservices account purge \ |
| 57 | + --location "$AZURE_REGION" \ |
| 58 | + --resource-group "$RESOURCE_GROUP" \ |
| 59 | + --name "$OPENAI_RESOURCE_NAME" |
| 60 | +``` |
| 61 | + |
| 62 | +## Rationale |
| 63 | + |
| 64 | +1. **CI/CD Reliability**: Ensures subsequent deployments with the same resource names succeed without manual intervention |
| 65 | +2. **Cost Optimization**: Immediately releases resources rather than waiting for the retention period to expire |
| 66 | +3. **Clean State**: Prevents accumulation of soft-deleted resources in the subscription |
| 67 | +4. **Automation**: No manual Azure Portal interaction required to purge resources |
| 68 | +5. **Safety**: Error handling prevents workflow failure if resource is already purged or not in soft-delete state |
| 69 | + |
| 70 | +## Alternative Approaches Considered |
| 71 | + |
| 72 | +### 1. Dynamic Resource Naming |
| 73 | +**Rejected**: Would require changing naming strategy and complicate resource tracking across deployments |
| 74 | + |
| 75 | +### 2. Manual Purge Between Deployments |
| 76 | +**Rejected**: Defeats the purpose of automated CI/CD and introduces human error |
| 77 | + |
| 78 | +### 3. Terraform Custom Provisioner |
| 79 | +**Rejected**: Terraform provisioners are a last resort and azd handles destroy operations outside direct Terraform control |
| 80 | + |
| 81 | +### 4. Wait for Retention Period |
| 82 | +**Rejected**: 48-hour wait between deployments is unacceptable for CI/CD velocity |
| 83 | + |
| 84 | +## Consequences |
| 85 | + |
| 86 | +### Positive |
| 87 | +- Automated, reliable CI/CD deployments with consistent resource naming |
| 88 | +- No manual intervention required for resource cleanup |
| 89 | +- Clear audit trail of purge operations in workflow logs |
| 90 | + |
| 91 | +### Negative |
| 92 | +- Adds minor complexity to workflow files |
| 93 | +- Requires Azure CLI authentication in workflows (already present) |
| 94 | +- Purge operation is permanent and cannot be undone |
| 95 | + |
| 96 | +### Neutral |
| 97 | +- Purge step execution time is minimal (typically < 5 seconds) |
| 98 | +- Requires `openai_resource_name` output to be maintained in Terraform |
| 99 | + |
| 100 | +## References |
| 101 | + |
| 102 | +- [Azure Cognitive Services soft-delete documentation](https://learn.microsoft.com/azure/cognitive-services/manage-resources-deletion-recovery) |
| 103 | +- [Azure CLI cognitiveservices account purge command](https://learn.microsoft.com/cli/azure/cognitiveservices/account?view=azure-cli-latest#az-cognitiveservices-account-purge) |
| 104 | +- [Azure Developer CLI hooks documentation](https://learn.microsoft.com/azure/developer/azure-developer-cli/azd-extensibility) |
| 105 | +- GitHub Issue #309: Azure OpenAI Resources Soft-Deleted |
0 commit comments