Skip to content

Agent workflow recovery based on server-side state#6077

Open
hhamalai wants to merge 17 commits intowoodpecker-ci:mainfrom
hhamalai:recovery-feature
Open

Agent workflow recovery based on server-side state#6077
hhamalai wants to merge 17 commits intowoodpecker-ci:mainfrom
hhamalai:recovery-feature

Conversation

@hhamalai
Copy link
Contributor

@hhamalai hhamalai commented Feb 6, 2026

This PR introduces a workflow recovery mechanism for agents. It allows workflow to resume from their last known state after an agent restart by persisting workflow progress in server database.

why

For larger deployments the agents must be occasionally updated/scaled, which currently causes all CI jobs to be interrupted as agents keep the workflow execution config in memory, lost during restarts, which causes headache especially with long running / critical workflows.

how

This PR introduces a bookkeeping mechanism to maintain a record of the workflow's progress. This bookkeeping is done in server database, and agents are querying step statuses from server, allowing an agent to identify which steps are pending, running, or completed. If the executing agent is lost, the workflow becomes available from the server queue and new agent can continue the workflow execution. This enables the pipeline to resume correctly following an agent restart or failure.

what else

  • Originally proposed in feat: workflow recovery for Kubernetes backend agents #5930 to target only kubernetes backend, as discussed there the recovery state / state bookkeeping was desired to be persisted in server database, and not to be bound to backend implemantations. With this, backends can implement an interface if they support recovery.

  • The recovered workflows might produce double logging visible on UI (original agent streams logs until it's deleted, the new agent taking over the workflow management will stream the same logs from the beginning). At no circumstances should the same step be executed twice.

@hhamalai
Copy link
Contributor Author

rebased commits from main

@woodpecker-bot
Copy link
Contributor

woodpecker-bot commented Feb 10, 2026

Surge PR preview deployment succeeded. View it at https://woodpecker-ci-woodpecker-pr-6077.surge.sh

@qwerty287
Copy link
Contributor

@hhamalai could you check out linting and openapi: https://ci.woodpecker-ci.org/repos/3780/pipeline/31537/35

Otherwise this looks quite good to me, besides the mentioned style discussion.

@codecov
Copy link

codecov bot commented Feb 12, 2026

Codecov Report

❌ Patch coverage is 0.55556% with 716 lines in your changes missing coverage. Please review.
✅ Project coverage is 30.93%. Comparing base (2ca6f58) to head (e1e434f).
⚠️ Report is 12 commits behind head on main.

Files with missing lines Patch % Lines
rpc/proto/woodpecker.pb.go 0.00% 148 Missing ⚠️
pipeline/pipeline.go 0.00% 115 Missing ⚠️
pipeline/recovery.go 0.00% 72 Missing ⚠️
agent/rpc/client_grpc.go 0.00% 67 Missing ⚠️
rpc/proto/woodpecker_grpc.pb.go 0.00% 52 Missing ⚠️
server/rpc/rpc.go 0.00% 41 Missing ⚠️
pipeline/backend/docker/docker.go 0.00% 39 Missing ⚠️
server/store/datastore/recovery_state.go 0.00% 37 Missing ⚠️
pipeline/backend/kubernetes/kubernetes.go 0.00% 36 Missing ⚠️
server/rpc/server.go 0.00% 26 Missing ⚠️
... and 11 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6077      +/-   ##
==========================================
- Coverage   31.66%   30.93%   -0.73%     
==========================================
  Files         420      423       +3     
  Lines       28413    29089     +676     
==========================================
+ Hits         8996     9000       +4     
- Misses      18596    19265     +669     
- Partials      821      824       +3     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@qwerty287 qwerty287 requested a review from a team February 13, 2026 09:30
@hhamalai hhamalai force-pushed the recovery-feature branch 2 times, most recently from 0d845b6 to 3e88cd8 Compare February 16, 2026 11:15
@6543 6543 added the feature add new functionality label Feb 17, 2026
@xoxys xoxys changed the title feat: agent workflow recovery based on server-side state Agent workflow recovery based on server-side state Feb 22, 2026
@xoxys xoxys added server backend new backend agent and removed backend new backend labels Feb 22, 2026
@qwerty287
Copy link
Contributor

Thanks, as this changes the rpc the version needs to be increased: https://github.com/woodpecker-ci/woodpecker/blob/main/rpc/proto/version.go#L19

@qwerty287
Copy link
Contributor

Thanks, looks good to me now. @6543 you want to check again?

@6543
Copy link
Member

6543 commented Feb 25, 2026

I'm not fully convinced of the new proto type added let me rethink and check before i can make a based review

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

agent feature add new functionality server

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants