-
Notifications
You must be signed in to change notification settings - Fork 17
Description
The problem
Add a Durable Workflows capability to Quarkus Flow: persist workflow instance context/state on every task transition (and key lifecycle boundaries) to a durable store. In Kubernetes, multiple replicas can execute the same workflow definition and take over instances left behind after pod crashes or node drains.
This design leverages Kubernetes Lease objects (coordination.k8s.io/v1) as the distributed lock primitive for “instance ownership”. Leases are a lightweight coordination mechanism used by Kubernetes itself for heartbeats and leader election.
Goals
- Persist workflow instance state durably on every task transition:
- deterministic “last known good” checkpoint
- resumable from the latest persisted transition
- Prevent concurrent execution of the same workflow instance across replicas.
- Detect and recover orphaned instances (pod died / drained / lost lease renewal).
- Kubernetes-native coordination using Lease API + Fabric8 client.
Kubernetes-side overview
Why Lease objects
Kubernetes provides Lease resources under coordination.k8s.io/v1 that are designed for high-frequency coordination and are used for node heartbeats and leader election.
Relevant Lease spec fields:
spec.holderIdentity: who currently holds the lockspec.leaseDurationSeconds: expiry window since last renewspec.renewTime: last heartbeat/renewal timespec.acquireTime,spec.leaseTransitions: helpful for debugging/observability
Lock semantics we want
- Per workflow instance Lease:
- lease name derived from
instanceId(and optionally workflow name hash to avoid collisions) - labels to support list/filter (e.g.
io.quarkiverse.flow/workflow=<name>,io.quarkiverse.flow/instance=<id>)
- lease name derived from
- The “owner” (pod) renews while it is actively executing.
- If the pod dies or is drained mid-flight, renew stops and the lease expires; another pod can acquire and resume.
RBAC needed
ServiceAccount used by Quarkus Flow needs namespace-scoped permissions for leases:
- API group:
coordination.k8s.io - resource:
leases - verbs:
get,list,watch,create,update,patch, (optionallydelete)
Operational reality (drain/termination)
- On node drain / pod eviction, Kubernetes sends SIGTERM and respects termination grace periods.
- We should attempt graceful checkpoint + lease release during shutdown, but correctness must not depend on it.
- The hard guarantee comes from: persisted checkpoints + lease expiry-based takeover.
Quarkus Flow-side design
Proposed module structure
Introduce a new module (suggested):
quarkus-flow-durable(core SPI + runtime integration)
Optional split:quarkus-flow-durable(store + runtime hooks)quarkus-flow-durable-kubernetes(Lease-based locking via Fabric8)
Implementation notes
- Use Fabric8 Coordination API for Leases.
- Avoid “list all leases then decide” if possible:
- prefer: select DB candidates over attempt Lease acquire (conflict means “someone else got it”)
- Quarkus native: ensure Lease model classes are registered for reflection when needed.
Runtime hooks inside Quarkus Flow
We need explicit hook points in the workflow engine:
On every task transition
- Persist checkpoint before moving to next task (or at least before acknowledging completion).
- Persist a “transition counter” / monotonic sequence number (guards against out-of-order updates).
On task start / completion
- Persist lightweight progress markers (useful for “RUNNING but stuck” detection).
On WAIT states
- Persist enough info to resume (expected event, timer due time, correlation keys).
- Consider releasing the Lease when entering WAITING (so any replica can resume when an event arrives), then re-acquire when resuming execution.
On resume trigger (event/timer/manual)
- Re-acquire Lease before executing the next segment.
- Load latest checkpoint from store.
- Continue execution from cursor.
Orphan detection + recovery loop (Reconciler)
A small scheduler running in each pod (or a single elected reconciler) that:
- Queries DurableStore for candidates:
status = RUNNINGandlastHeartbeatAtolder than thresholdstatus = WAITINGand timer due (if applicable)status = RETRY_WAITand retry due
- For each candidate:
tryAcquire(Lease)(winner proceeds)loadCheckpoint()- resume execution
- renew lease while actively executing
- On success/failure:
- update status + checkpoint
- release lease when reaching WAITING or terminal state
Deployment modes
- Mode A (simpler / less API load): one reconciler leader using a global lease (“reconciler leader election”), while per-instance leases protect execution.
- Mode B (fully distributed): all pods run reconciler; rely on Lease conflicts to prevent duplication (may require jitter/backoff + optional sharding).
Fabric8 + Quarkus integration details
- Use Fabric8 client to read/create/update Leases (
coordination.k8s.io/v1). - Ensure RBAC Role/RoleBinding is generated/documented.
- Native image considerations: register Lease model classes for reflection if required.
Failure modes & correctness notes (to document)
- Semantics are at-least-once execution across task boundaries unless tasks are idempotent or integrated with an outbox pattern.
- Recovery will resume from the last persisted checkpoint; a task may re-run if the crash happens after side effects but before checkpoint commit.
- Lease expiry is time-based; use conservative durations + renew frequency + jitter.
Deliverables
- New module(s):
quarkus-flow-durable(+ optional-kubernetessplit). - SPI:
DurableStore,LockManager,Reconciler. - Runtime hook integration: checkpoint on task transition + resume path.
- Implementations:
- Redis store (quarkus-redis-client)
- (optional) PostgreSQL store (quarkus-jdbc / reactive-pg)
- Kubernetes Lease lock via Fabric8
- RBAC manifests and docs.
- Integration tests:
- Testcontainers (PG/Redis)
- Fabric8 mock server for Lease behaviors
- “kill pod mid-run” scenario (simulated), verifying takeover
Area(s)
- Runtime / Engine
- DSL / API
- Dev UI / Mermaid
- Persistence
- Messaging / CloudEvents
- Observability (Micrometer / OTEL)
- Docs & examples
- Agentic AI