-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Description
Reopen of 5863 (I don't seem to have permission to reopen an issue)
Reply to @romange:
@xuekat what is the behavior that you need during the snapshot loading? the datastore won't accept any reads during that time so how health check solves the issue of downtime?
the right approach (and this is what we do in our cloud service) is to use replication for version updates, to have zero-downtime updates.
We're hoping for health check to only pass once the dragonfly pod has finished loading the dataset into memory, because as the logic currently stands, a dragonfly pod might have passed health check but still not be available as it's loading dataset into memory; so the operator will kill the old pod while the new pod is still not yet able to handle requests.
For now we are remediating this by increasing the number of replicas to be very high so that at least one pod is still alive during the rollout, but this is suboptimal because it increases our infra costs.