-
Notifications
You must be signed in to change notification settings - Fork 3
Closed
Description
codejail-service deploys get stuck because a pod fails to come up within the readiness probe retries window. Pods that are deleted also fail to come up.
Sometimes pods do come up; we're not sure under which circumstances. It seems like k8s eventually replaces the broken pods
A/C:
- Give GoCD longer timeouts to allow an overloaded ArgoCD time to say "yes, the sync has finished". Monitor the timeouts to ensure the new value chosen is appropriately handling the failures AND minimizing a lengthy timeout.
- Possibly reduce priority level on deployment alerts (since they're less reliable)
- File a ticket for 1) reducing ArgoCD staleness and 2) then tightening up our deploy standards again (undoing the above -- search for code references to this ticket)
Notes:
- This problem wasn't occurring in the first month that codejail-service was deployed. But as of June 23 if we kill a pod, the replacement has successful startup checks within 6 seconds yet apparently fails to respond to readiness and liveness checks after 1-2 minutes.
- We tried increasing retry counts but it didn't help: https://github.com/edx/edx-internal/pull/12996
- We've also been experiencing some Datadog metrics and APM cutouts that have impeded diagnosing this. Unclear if related. (Infrastructure issue?)
- EKS upgrades seem to have fixed some issues with pod readiness/liveness probes.
- Our latest information (as of early July 2025) is that ArgoCD seems to just be overloaded and is providing stale information at times (to both GoCD and in the UI).
Metadata
Metadata
Assignees
Labels
No labels
Type
Projects
Status
Done