Skip to content

(workaround) codejail-service and all k8s pods are not seen by GoCD after coming up #1073

@timmc-edx

Description

@timmc-edx

codejail-service deploys get stuck because a pod fails to come up within the readiness probe retries window. Pods that are deleted also fail to come up.

Sometimes pods do come up; we're not sure under which circumstances. It seems like k8s eventually replaces the broken pods

A/C:

  • Give GoCD longer timeouts to allow an overloaded ArgoCD time to say "yes, the sync has finished". Monitor the timeouts to ensure the new value chosen is appropriately handling the failures AND minimizing a lengthy timeout.
  • Possibly reduce priority level on deployment alerts (since they're less reliable)
  • File a ticket for 1) reducing ArgoCD staleness and 2) then tightening up our deploy standards again (undoing the above -- search for code references to this ticket)

Notes:

  • This problem wasn't occurring in the first month that codejail-service was deployed. But as of June 23 if we kill a pod, the replacement has successful startup checks within 6 seconds yet apparently fails to respond to readiness and liveness checks after 1-2 minutes.
  • We tried increasing retry counts but it didn't help: https://github.com/edx/edx-internal/pull/12996
  • We've also been experiencing some Datadog metrics and APM cutouts that have impeded diagnosing this. Unclear if related. (Infrastructure issue?)
  • EKS upgrades seem to have fixed some issues with pod readiness/liveness probes.
  • Our latest information (as of early July 2025) is that ArgoCD seems to just be overloaded and is providing stale information at times (to both GoCD and in the UI).

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions