Fix portal pods never becoming ready during AKS rollout by aurelianware · Pull Request #531 · aurelianware/cloudhealthoffice

aurelianware · 2026-03-20T21:27:54Z

Root causes of the 600s rollout timeout:

Azure AD env vars had both value: "" and valueFrom — Kubernetes uses value when both are present, so TenantId/ClientId/ClientSecret were empty strings, causing auth middleware initialization to fail.
Redis and MongoDB were never deployed — the workflow applied portal and service manifests but never deployed the infrastructure they depend on (distributed cache, DataProtection key store).
deploy-aks job only waited for portal image build (build-push-acr), not backend service builds (build-push-services), so service image pulls failed with ImagePullBackOff.
Memory request/limit (256Mi/512Mi) too low for .NET 8 Blazor Server app with 15+ service clients, Redis, MongoDB, SignalR — increased to 512Mi/1Gi.

https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS

Root causes of the 600s rollout timeout: 1. Azure AD env vars had both `value: ""` and `valueFrom` — Kubernetes uses `value` when both are present, so TenantId/ClientId/ClientSecret were empty strings, causing auth middleware initialization to fail. 2. Redis and MongoDB were never deployed — the workflow applied portal and service manifests but never deployed the infrastructure they depend on (distributed cache, DataProtection key store). 3. deploy-aks job only waited for portal image build (build-push-acr), not backend service builds (build-push-services), so service image pulls failed with ImagePullBackOff. 4. Memory request/limit (256Mi/512Mi) too low for .NET 8 Blazor Server app with 15+ service clients, Redis, MongoDB, SignalR — increased to 512Mi/1Gi. https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS

Copilot

Pull request overview

This PR addresses AKS rollout failures where the portal pods never reached Ready by correcting Azure AD env var injection, ensuring dependent infrastructure is deployed, and tightening the CI/CD job ordering so services are available when the portal starts.

Changes:

Remove conflicting value: "" entries from Azure AD env vars and increase portal resource requests/limits.
Update the AKS deploy workflow to wait for both portal and service image builds before deploying.
Add deployment steps for Redis and MongoDB during AKS rollout.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File	Description
`src/portal/CloudHealthOffice.Portal/k8s/portal-deployment.yaml`	Fix Azure AD env var configuration and increase resource requests/limits to improve readiness stability.
`.github/workflows/deploy-azure-aks.yml`	Ensure deploy waits on service builds and deploy required infra (Redis/MongoDB) during AKS rollout.

Copilot · 2026-03-20T21:31:47Z

.github/workflows/deploy-azure-aks.yml

+        run: kubectl apply -f infrastructure/k8s/redis-dataprotection.yaml
+
+      - name: Deploy MongoDB (DataProtection key store)
+        run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml


Applying infrastructure/k8s/mongodb-deployment.yaml will create/overwrite the mongodb-secret Secret in cloudhealthoffice with only username/password keys (see infrastructure/k8s/mongodb-deployment.yaml:69-78). Earlier in this workflow you create mongodb-secret with a connectionString key, and multiple service manifests reference secretKeyRef: name: mongodb-secret, key: connectionString (e.g., src/services/member-service/k8s/member-service-deployment.yaml:45-48). Deploying this manifest in PROD will break those services at runtime when the connectionString key disappears. Use a different Secret name for MongoDB auth, or remove the Secret from the manifest and keep mongodb-secret.connectionString intact for services (or update all consumers consistently).

Suggested change

run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml

run: sed 's/mongodb-secret/mongodb-auth-secret/g' infrastructure/k8s/mongodb-deployment.yaml | kubectl apply -f -

Copilot · 2026-03-20T21:31:47Z

.github/workflows/deploy-azure-aks.yml

+      - name: Deploy MongoDB (DataProtection key store)
+        run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml


infrastructure/k8s/mongodb-deployment.yaml embeds a mongodb-secret with a hardcoded password (securepassword123) and admin username (mongodb-deployment.yaml:69-78). Deploying this from the PROD workflow is insecure and makes credential rotation difficult. Prefer creating the MongoDB credentials Secret from GitHub/Azure secrets (or external secret operator) and keep credentials out of the repo/manifests.

Copilot · 2026-03-20T21:31:47Z

.github/workflows/deploy-azure-aks.yml

+          kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s || true
+          kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s || true


This step is intended to block until Redis/MongoDB are ready, but both kubectl rollout status ... || true lines will always succeed even if the rollout fails or times out, so the job can continue while dependencies are still unavailable. Drop the || true (or replace with explicit retry/backoff + failure) so readiness failures correctly fail the deployment.

Suggested change

kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s || true

kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s || true

kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s

kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s

…s, fail on infra errors 1. Rename MongoDB StatefulSet secret from `mongodb-secret` to `mongodb-auth` to avoid overwriting the CI-created `mongodb-secret` (which has the `connectionString` key services depend on). 2. Remove hardcoded credentials (admin/securepassword123) from the mongodb-deployment.yaml manifest. Create `mongodb-auth` secret from GitHub Secrets (MONGODB_USERNAME, MONGODB_PASSWORD) in CI instead. 3. Remove `|| true` from Redis/MongoDB rollout status checks so infrastructure failures properly fail the deployment job. https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS

…n keys The portal's MongoDB client already connects to CosmosDB via cosmosdb-secret.connectionString, so DataProtection keys are stored there. No need for a separate MongoDB StatefulSet or its credentials. Keeps Redis deployment (required for distributed cache/sessions). https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS

- Deploy standalone MongoDB StatefulSet as fallback - If COSMOS_DB_CONNECTION_STRING is set, use CosmosDB - If not, auto-generate connection string pointing at standalone MongoDB - mongodb-auth secret defaults to admin/admin if GitHub secrets not set - No new secrets required for a working deployment https://claude.ai/code/session_01A95Uah18uxLJpuAR5HShNS

github-actions · 2026-03-20T21:46:19Z

Package	Line Rate	Branch Rate	Health
CloudHealthOffice.Portal	13%	3%	❌
CloudHealthOffice.Portal	13%	3%	❌
Summary	13% (2498 / 18662)	3% (174 / 5968)	❌

Copilot AI review requested due to automatic review settings March 20, 2026 21:27

github-actions bot added ci/cd github-config size/S labels Mar 20, 2026

Copilot started reviewing on behalf of aurelianware March 20, 2026 21:28 View session

Copilot AI reviewed Mar 20, 2026

View reviewed changes

claude added 3 commits March 20, 2026 21:37

aurelianware merged commit c54c1d8 into main Mar 20, 2026
59 checks passed

aurelianware deleted the claude/fix-portal-rollout-timeout-eTHyI branch March 20, 2026 21:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix portal pods never becoming ready during AKS rollout#531

Fix portal pods never becoming ready during AKS rollout#531
aurelianware merged 4 commits intomainfrom
claude/fix-portal-rollout-timeout-eTHyI

aurelianware commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

Copilot AI Mar 20, 2026

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml
	run: sed 's/mongodb-secret/mongodb-auth-secret/g' infrastructure/k8s/mongodb-deployment.yaml \| kubectl apply -f -

		- name: Deploy MongoDB (DataProtection key store)
		run: kubectl apply -f infrastructure/k8s/mongodb-deployment.yaml

		kubectl rollout status deployment/redis-dataprotection -n ${{ env.NAMESPACE }} --timeout=120s \|\| true
		kubectl rollout status statefulset/mongodb -n ${{ env.NAMESPACE }} --timeout=120s \|\| true

Uh oh!

Conversation

aurelianware commented Mar 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants