Replies: 5 comments
-
|
hi @levzem - thanks for the issue!
running many prefect servers without a distributed message broker for events / message passing has never been a supported pattern in prefect. unless you've implemented your own message broker, the default experience of running an in-memory message broker on each replica would result in event ordering problems that'd break automations and cause overall consistency problems.
can you elaborate on this? it'd be super helpful to know what your frictions were when adopting redis for an HA setup. relatedly, we've had some user contributions that might improve whatever you were witnessing in this regard |
Beta Was this translation helpful? Give feedback.
-
|
Well I guess we got lucky/unlucky that it used to work without Redis. We don't use automations, so that didn't really affect us.
We run thousands of deployed flows concurrently and when we tested Redis under our load, the Redis instance was redlining at 100% CPU and 100K+ requests/s. We were using the largest Redis available on GCP (8CPU, 58GB) and unable to scale it further. We anticipate having even more flows in the future and we are concerned about Redis being a bottleneck given it is already using all of CPU available. We would feel a lot more comfortable using Redis, if sharding was supported, which it currently isn't. |
Beta Was this translation helpful? Give feedback.
-
|
Issues like this are what made us pay much more attention to execution correctness. When task runs aren’t reliably persisted, retries or recovery logic can actually make things worse, Curious if the team has considered failing harder or earlier in cases where persistence guarantees can’t be met. |
Beta Was this translation helpful? Give feedback.
-
|
@levzem are you using the Prefect Helm chart ( |
Beta Was this translation helpful? Give feedback.
-
|
@bdalpe yes we are |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Bug summary
We recently upgraded from Prefect 3.4.13 to 3.6.2 and we have observed that task runs no longer appear in the Prefect flow dashboard. The task run tab is completely. Upon further inspection, the
task_runtable in the DB no longer has any task_runs associated with tasks (it still writes task runs for running deployed flows) after the version upgrade. This means task runs are no longer being persisted to the DB.We self-host all of our Prefect infra on GKE. We have many Prefect server replicas and host the Prefect background services separately. We are unable to reproduce the issue locally.
Upon further testing, we are able to fix the issue by either:
Neither option is viable for us. 1 does not scale to meet our needs and previous testing has shown that 2 does not meet our scaling needs either (due to a lack of sharding support).
This seems like a pretty serious bug as it is breaking previously supported behavior of running distributed Prefect without Redis. If this is intentional, this should warrant a major version change.
Version info
Additional context
No response
Beta Was this translation helpful? Give feedback.
All reactions