Skip to content

Fix portal pods never becoming ready during AKS rollout#531

Merged
aurelianware merged 4 commits intomainfrom
claude/fix-portal-rollout-timeout-eTHyI
Mar 20, 2026
Merged

Fix portal pods never becoming ready during AKS rollout#531
aurelianware merged 4 commits intomainfrom
claude/fix-portal-rollout-timeout-eTHyI

Conversation

@aurelianware
Copy link
Owner

Root causes of the 600s rollout timeout:

  1. Azure AD env vars had both value: "" and valueFrom — Kubernetes uses value when both are present, so TenantId/ClientId/ClientSecret were empty strings, causing auth middleware initialization to fail.

  2. Redis and MongoDB were never deployed — the workflow applied portal and service manifests but never deployed the infrastructure they depend on (distributed cache, DataProtection key store).

  3. deploy-aks job only waited for portal image build (build-push-acr), not backend service builds (build-push-services), so service image pulls failed with ImagePullBackOff.

  4. Memory request/limit (256Mi/512Mi) too low for .NET 8 Blazor Server app with 15+ service clients, Redis, MongoDB, SignalR — increased to 512Mi/1Gi.

https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS

Root causes of the 600s rollout timeout:

1. Azure AD env vars had both `value: ""` and `valueFrom` — Kubernetes
   uses `value` when both are present, so TenantId/ClientId/ClientSecret
   were empty strings, causing auth middleware initialization to fail.

2. Redis and MongoDB were never deployed — the workflow applied portal
   and service manifests but never deployed the infrastructure they
   depend on (distributed cache, DataProtection key store).

3. deploy-aks job only waited for portal image build (build-push-acr),
   not backend service builds (build-push-services), so service image
   pulls failed with ImagePullBackOff.

4. Memory request/limit (256Mi/512Mi) too low for .NET 8 Blazor Server
   app with 15+ service clients, Redis, MongoDB, SignalR — increased
   to 512Mi/1Gi.

https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses AKS rollout failures where the portal pods never reached Ready by correcting Azure AD env var injection, ensuring dependent infrastructure is deployed, and tightening the CI/CD job ordering so services are available when the portal starts.

Changes:

  • Remove conflicting value: "" entries from Azure AD env vars and increase portal resource requests/limits.
  • Update the AKS deploy workflow to wait for both portal and service image builds before deploying.
  • Add deployment steps for Redis and MongoDB during AKS rollout.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
src/portal/CloudHealthOffice.Portal/k8s/portal-deployment.yaml Fix Azure AD env var configuration and increase resource requests/limits to improve readiness stability.
.github/workflows/deploy-azure-aks.yml Ensure deploy waits on service builds and deploy required infra (Redis/MongoDB) during AKS rollout.

run: kubectl apply -f infrastructure/k8s/redis-dataprotection.yaml

- name: Deploy MongoDB (DataProtection key store)
run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying infrastructure/k8s/mongodb-deployment.yaml will create/overwrite the mongodb-secret Secret in cloudhealthoffice with only username/password keys (see infrastructure/k8s/mongodb-deployment.yaml:69-78). Earlier in this workflow you create mongodb-secret with a connectionString key, and multiple service manifests reference secretKeyRef: name: mongodb-secret, key: connectionString (e.g., src/services/member-service/k8s/member-service-deployment.yaml:45-48). Deploying this manifest in PROD will break those services at runtime when the connectionString key disappears. Use a different Secret name for MongoDB auth, or remove the Secret from the manifest and keep mongodb-secret.connectionString intact for services (or update all consumers consistently).

Suggested change
run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml
run: sed 's/mongodb-secret/mongodb-auth-secret/g' infrastructure/k8s/mongodb-deployment.yaml | kubectl apply -f -

Copilot uses AI. Check for mistakes.
Comment on lines +415 to +416
- name: Deploy MongoDB (DataProtection key store)
run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

infrastructure/k8s/mongodb-deployment.yaml embeds a mongodb-secret with a hardcoded password (securepassword123) and admin username (mongodb-deployment.yaml:69-78). Deploying this from the PROD workflow is insecure and makes credential rotation difficult. Prefer creating the MongoDB credentials Secret from GitHub/Azure secrets (or external secret operator) and keep credentials out of the repo/manifests.

Copilot uses AI. Check for mistakes.
Comment on lines +420 to +421
kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s || true
kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s || true
Copy link

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step is intended to block until Redis/MongoDB are ready, but both kubectl rollout status ... || true lines will always succeed even if the rollout fails or times out, so the job can continue while dependencies are still unavailable. Drop the || true (or replace with explicit retry/backoff + failure) so readiness failures correctly fail the deployment.

Suggested change
kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s || true
kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s || true
kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s
kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s

Copilot uses AI. Check for mistakes.
claude added 3 commits March 20, 2026 21:37
…s, fail on infra errors

1. Rename MongoDB StatefulSet secret from `mongodb-secret` to `mongodb-auth`
   to avoid overwriting the CI-created `mongodb-secret` (which has the
   `connectionString` key services depend on).

2. Remove hardcoded credentials (admin/securepassword123) from the
   mongodb-deployment.yaml manifest. Create `mongodb-auth` secret from
   GitHub Secrets (MONGODB_USERNAME, MONGODB_PASSWORD) in CI instead.

3. Remove `|| true` from Redis/MongoDB rollout status checks so
   infrastructure failures properly fail the deployment job.

https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
…n keys

The portal's MongoDB client already connects to CosmosDB via
cosmosdb-secret.connectionString, so DataProtection keys are stored
there. No need for a separate MongoDB StatefulSet or its credentials.

Keeps Redis deployment (required for distributed cache/sessions).

https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
- Deploy standalone MongoDB StatefulSet as fallback
- If COSMOS_DB_CONNECTION_STRING is set, use CosmosDB
- If not, auto-generate connection string pointing at standalone MongoDB
- mongodb-auth secret defaults to admin/admin if GitHub secrets not set
- No new secrets required for a working deployment

https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS
@github-actions
Copy link

Code Coverage

Package Line Rate Branch Rate Health
CloudHealthOffice.Portal 13% 3%
CloudHealthOffice.Portal 13% 3%
Summary 13% (2498 / 18662) 3% (174 / 5968)

@aurelianware aurelianware merged commit c54c1d8 into main Mar 20, 2026
59 checks passed
@aurelianware aurelianware deleted the claude/fix-portal-rollout-timeout-eTHyI branch March 20, 2026 21:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants