Fix portal pods never becoming ready during AKS rollout#531
Fix portal pods never becoming ready during AKS rollout#531aurelianware merged 4 commits intomainfrom
Conversation
Root causes of the 600s rollout timeout: 1. Azure AD env vars had both `value: ""` and `valueFrom` — Kubernetes uses `value` when both are present, so TenantId/ClientId/ClientSecret were empty strings, causing auth middleware initialization to fail. 2. Redis and MongoDB were never deployed — the workflow applied portal and service manifests but never deployed the infrastructure they depend on (distributed cache, DataProtection key store). 3. deploy-aks job only waited for portal image build (build-push-acr), not backend service builds (build-push-services), so service image pulls failed with ImagePullBackOff. 4. Memory request/limit (256Mi/512Mi) too low for .NET 8 Blazor Server app with 15+ service clients, Redis, MongoDB, SignalR — increased to 512Mi/1Gi. https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
There was a problem hiding this comment.
Pull request overview
This PR addresses AKS rollout failures where the portal pods never reached Ready by correcting Azure AD env var injection, ensuring dependent infrastructure is deployed, and tightening the CI/CD job ordering so services are available when the portal starts.
Changes:
- Remove conflicting
value: ""entries from Azure AD env vars and increase portal resource requests/limits. - Update the AKS deploy workflow to wait for both portal and service image builds before deploying.
- Add deployment steps for Redis and MongoDB during AKS rollout.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
src/portal/CloudHealthOffice.Portal/k8s/portal-deployment.yaml |
Fix Azure AD env var configuration and increase resource requests/limits to improve readiness stability. |
.github/workflows/deploy-azure-aks.yml |
Ensure deploy waits on service builds and deploy required infra (Redis/MongoDB) during AKS rollout. |
| run: kubectl apply -f infrastructure/k8s/redis-dataprotection.yaml | ||
|
|
||
| - name: Deploy MongoDB (DataProtection key store) | ||
| run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml |
There was a problem hiding this comment.
Applying infrastructure/k8s/mongodb-deployment.yaml will create/overwrite the mongodb-secret Secret in cloudhealthoffice with only username/password keys (see infrastructure/k8s/mongodb-deployment.yaml:69-78). Earlier in this workflow you create mongodb-secret with a connectionString key, and multiple service manifests reference secretKeyRef: name: mongodb-secret, key: connectionString (e.g., src/services/member-service/k8s/member-service-deployment.yaml:45-48). Deploying this manifest in PROD will break those services at runtime when the connectionString key disappears. Use a different Secret name for MongoDB auth, or remove the Secret from the manifest and keep mongodb-secret.connectionString intact for services (or update all consumers consistently).
| run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml | |
| run: sed 's/mongodb-secret/mongodb-auth-secret/g' infrastructure/k8s/mongodb-deployment.yaml | kubectl apply -f - |
| - name: Deploy MongoDB (DataProtection key store) | ||
| run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml |
There was a problem hiding this comment.
infrastructure/k8s/mongodb-deployment.yaml embeds a mongodb-secret with a hardcoded password (securepassword123) and admin username (mongodb-deployment.yaml:69-78). Deploying this from the PROD workflow is insecure and makes credential rotation difficult. Prefer creating the MongoDB credentials Secret from GitHub/Azure secrets (or external secret operator) and keep credentials out of the repo/manifests.
| kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s || true | ||
| kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s || true |
There was a problem hiding this comment.
This step is intended to block until Redis/MongoDB are ready, but both kubectl rollout status ... || true lines will always succeed even if the rollout fails or times out, so the job can continue while dependencies are still unavailable. Drop the || true (or replace with explicit retry/backoff + failure) so readiness failures correctly fail the deployment.
| kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s || true | |
| kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s || true | |
| kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s | |
| kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s |
…s, fail on infra errors 1. Rename MongoDB StatefulSet secret from `mongodb-secret` to `mongodb-auth` to avoid overwriting the CI-created `mongodb-secret` (which has the `connectionString` key services depend on). 2. Remove hardcoded credentials (admin/securepassword123) from the mongodb-deployment.yaml manifest. Create `mongodb-auth` secret from GitHub Secrets (MONGODB_USERNAME, MONGODB_PASSWORD) in CI instead. 3. Remove `|| true` from Redis/MongoDB rollout status checks so infrastructure failures properly fail the deployment job. https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
…n keys The portal's MongoDB client already connects to CosmosDB via cosmosdb-secret.connectionString, so DataProtection keys are stored there. No need for a separate MongoDB StatefulSet or its credentials. Keeps Redis deployment (required for distributed cache/sessions). https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
- Deploy standalone MongoDB StatefulSet as fallback - If COSMOS_DB_CONNECTION_STRING is set, use CosmosDB - If not, auto-generate connection string pointing at standalone MongoDB - mongodb-auth secret defaults to admin/admin if GitHub secrets not set - No new secrets required for a working deployment https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
Root causes of the 600s rollout timeout:
Azure AD env vars had both
value: ""andvalueFrom— Kubernetes usesvaluewhen both are present, so TenantId/ClientId/ClientSecret were empty strings, causing auth middleware initialization to fail.Redis and MongoDB were never deployed — the workflow applied portal and service manifests but never deployed the infrastructure they depend on (distributed cache, DataProtection key store).
deploy-aks job only waited for portal image build (build-push-acr), not backend service builds (build-push-services), so service image pulls failed with ImagePullBackOff.
Memory request/limit (256Mi/512Mi) too low for .NET 8 Blazor Server app with 15+ service clients, Redis, MongoDB, SignalR — increased to 512Mi/1Gi.
https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS