Agent workflow recovery based on server-side state#6077
Agent workflow recovery based on server-side state#6077hhamalai wants to merge 17 commits intowoodpecker-ci:mainfrom
Conversation
bb6faaa to
d1aa97e
Compare
|
rebased commits from main |
|
Surge PR preview deployment succeeded. View it at https://woodpecker-ci-woodpecker-pr-6077.surge.sh |
|
@hhamalai could you check out linting and openapi: https://ci.woodpecker-ci.org/repos/3780/pipeline/31537/35 Otherwise this looks quite good to me, besides the mentioned style discussion. |
fixes depguard related linter errors
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6077 +/- ##
==========================================
- Coverage 31.66% 30.93% -0.73%
==========================================
Files 420 423 +3
Lines 28413 29089 +676
==========================================
+ Hits 8996 9000 +4
- Misses 18596 19265 +669
- Partials 821 824 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0d845b6 to
3e88cd8
Compare
|
Thanks, as this changes the rpc the version needs to be increased: https://github.com/woodpecker-ci/woodpecker/blob/main/rpc/proto/version.go#L19 |
|
Thanks, looks good to me now. @6543 you want to check again? |
|
I'm not fully convinced of the new proto type added let me rethink and check before i can make a based review |
This PR introduces a workflow recovery mechanism for agents. It allows workflow to resume from their last known state after an agent restart by persisting workflow progress in server database.
why
For larger deployments the agents must be occasionally updated/scaled, which currently causes all CI jobs to be interrupted as agents keep the workflow execution config in memory, lost during restarts, which causes headache especially with long running / critical workflows.
how
This PR introduces a bookkeeping mechanism to maintain a record of the workflow's progress. This bookkeeping is done in server database, and agents are querying step statuses from server, allowing an agent to identify which steps are pending, running, or completed. If the executing agent is lost, the workflow becomes available from the server queue and new agent can continue the workflow execution. This enables the pipeline to resume correctly following an agent restart or failure.
what else
Originally proposed in feat: workflow recovery for Kubernetes backend agents #5930 to target only kubernetes backend, as discussed there the recovery state / state bookkeeping was desired to be persisted in server database, and not to be bound to backend implemantations. With this, backends can implement an interface if they support recovery.
The recovered workflows might produce double logging visible on UI (original agent streams logs until it's deleted, the new agent taking over the workflow management will stream the same logs from the beginning). At no circumstances should the same step be executed twice.