task runs persistence for distributed Prefect servers #20213

levzem · 2025-12-16T00:00:24Z

levzem
Dec 16, 2025

Bug summary

We recently upgraded from Prefect 3.4.13 to 3.6.2 and we have observed that task runs no longer appear in the Prefect flow dashboard. The task run tab is completely. Upon further inspection, the task_run table in the DB no longer has any task_runs associated with tasks (it still writes task runs for running deployed flows) after the version upgrade. This means task runs are no longer being persisted to the DB.

We self-host all of our Prefect infra on GKE. We have many Prefect server replicas and host the Prefect background services separately. We are unable to reproduce the issue locally.

Upon further testing, we are able to fix the issue by either:

setting Prefect server replica count 1 and running the background services in memory
standing up a Redis instance

Neither option is viable for us. 1 does not scale to meet our needs and previous testing has shown that 2 does not meet our scaling needs either (due to a lack of sharding support).

This seems like a pretty serious bug as it is breaking previously supported behavior of running distributed Prefect without Redis. If this is intentional, this should warrant a major version change.

Version info

Version:              3.6.2
API version:          0.8.4
Python version:       3.12.8
Git commit:           6e305792
Built:                Fri, Nov 14, 2025 12:15 AM
OS/Arch:              darwin/arm64
Profile:              ephemeral
Server type:          unconfigured
Pydantic version:     2.10.6
Server:
  Database:           sqlite
  SQLite version:     3.47.1
Integrations:
  prefect-gcp:        0.6.11
  prefect-docker:     0.6.1

Additional context

No response

zzstoatzz · 2025-12-16T02:30:19Z

zzstoatzz
Dec 16, 2025
Maintainer

hi @levzem - thanks for the issue!

breaking previously supported behavior of running distributed Prefect without Redis

running many prefect servers without a distributed message broker for events / message passing has never been a supported pattern in prefect. unless you've implemented your own message broker, the default experience of running an in-memory message broker on each replica would result in event ordering problems that'd break automations and cause overall consistency problems.

2 does not meet our scaling needs either (due to a lack of sharding support).

can you elaborate on this? it'd be super helpful to know what your frictions were when adopting redis for an HA setup. relatedly, we've had some user contributions that might improve whatever you were witnessing in this regard

0 replies

levzem · 2025-12-16T18:45:26Z

levzem
Dec 16, 2025
Author

@zzstoatzz

Well I guess we got lucky/unlucky that it used to work without Redis. We don't use automations, so that didn't really affect us.

2 does not meet our scaling needs either (due to a lack of sharding support).

can you elaborate on this? it'd be super helpful to know what your frictions were when adopting redis for an HA setup. relatedly, we've had some user contributions that might improve whatever you were witnessing in this regard

We run thousands of deployed flows concurrently and when we tested Redis under our load, the Redis instance was redlining at 100% CPU and 100K+ requests/s. We were using the largest Redis available on GCP (8CPU, 58GB) and unable to scale it further. We anticipate having even more flows in the future and we are concerned about Redis being a bottleneck given it is already using all of CPU available. We would feel a lot more comfortable using Redis, if sharding was supported, which it currently isn't.

0 replies

Zi-Ling · 2025-12-17T12:47:20Z

Zi-Ling
Dec 17, 2025

Issues like this are what made us pay much more attention to execution correctness.

When task runs aren’t reliably persisted, retries or recovery logic can actually make things worse,
because you lose the ability to reason about what truly happened.

Curious if the team has considered failing harder or earlier in cases where persistence guarantees can’t be met.

0 replies

bdalpe · 2025-12-17T19:51:22Z

bdalpe
Dec 17, 2025
Maintainer

@levzem are you using the Prefect Helm chart (prefect-server) for running on GKE?

0 replies

levzem · 2025-12-19T18:21:59Z

levzem
Dec 19, 2025
Author

@bdalpe yes we are

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

task runs persistence for distributed Prefect servers #20213

Uh oh!

{{title}}

Uh oh!

Replies: 5 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

task runs persistence for distributed Prefect servers #20213

Uh oh!

levzem Dec 16, 2025

Bug summary

Version info

Additional context

Replies: 5 comments

Uh oh!

zzstoatzz Dec 16, 2025 Maintainer

Uh oh!

levzem Dec 16, 2025 Author

Uh oh!

Zi-Ling Dec 17, 2025

Uh oh!

bdalpe Dec 17, 2025 Maintainer

Uh oh!

levzem Dec 19, 2025 Author

levzem
Dec 16, 2025

zzstoatzz
Dec 16, 2025
Maintainer

levzem
Dec 16, 2025
Author

Zi-Ling
Dec 17, 2025

bdalpe
Dec 17, 2025
Maintainer

levzem
Dec 19, 2025
Author