feat: Durable Workflows (Durable Execution + Recovery on Kubernetes)

## The problem

Add a **Durable Workflows** capability to Quarkus Flow: persist workflow instance context/state on every **task transition** (and key lifecycle boundaries) to a durable store. In Kubernetes, multiple replicas can execute the same workflow definition and **take over** instances left behind after pod crashes or node drains.

This design leverages **Kubernetes Lease objects** (`coordination.k8s.io/v1`) as the distributed lock primitive for “instance ownership”. Leases are a lightweight coordination mechanism used by Kubernetes itself for heartbeats and leader election.

## Goals
- Persist workflow instance state durably on every task transition:
  - deterministic “last known good” checkpoint
  - resumable from the latest persisted transition
- Prevent concurrent execution of the same workflow instance across replicas.
- Detect and recover **orphaned** instances (pod died / drained / lost lease renewal).
- Kubernetes-native coordination using **Lease API** + Fabric8 client.

## Kubernetes-side overview

### Why Lease objects
Kubernetes provides **Lease** resources under `coordination.k8s.io/v1` that are designed for high-frequency coordination and are used for node heartbeats and leader election.

Relevant Lease spec fields:
- `spec.holderIdentity`: who currently holds the lock 
- `spec.leaseDurationSeconds`: expiry window since last renew
- `spec.renewTime`: last heartbeat/renewal time
- `spec.acquireTime`, `spec.leaseTransitions`: helpful for debugging/observability

#### Lock semantics we want
- **Per workflow instance Lease**:
  - lease name derived from `instanceId` (and optionally workflow name hash to avoid collisions)
  - labels to support list/filter (e.g. `io.quarkiverse.flow/workflow=<name>`, `io.quarkiverse.flow/instance=<id>`)
- The “owner” (pod) renews while it is actively executing.
- If the pod dies or is drained mid-flight, renew stops and the lease expires; another pod can acquire and resume.


### RBAC needed
ServiceAccount used by Quarkus Flow needs namespace-scoped permissions for leases:
- API group: `coordination.k8s.io`
- resource: `leases`
- verbs: `get`, `list`, `watch`, `create`, `update`, `patch`, (optionally `delete`)

### Operational reality (drain/termination)
- On node drain / pod eviction, Kubernetes sends SIGTERM and respects termination grace periods.
- We should **attempt** graceful checkpoint + lease release during shutdown, but correctness must not depend on it.
- The hard guarantee comes from: persisted checkpoints + lease expiry-based takeover.

## Quarkus Flow-side design

### Proposed module structure
Introduce a new module (suggested):
- `quarkus-flow-durable` (core SPI + runtime integration)
Optional split:
- `quarkus-flow-durable` (store + runtime hooks)
- `quarkus-flow-durable-kubernetes` (Lease-based locking via Fabric8)

**Implementation notes**
- Use Fabric8 Coordination API for Leases.
- Avoid “list all leases then decide” if possible:
  - prefer: select DB candidates over attempt Lease acquire (conflict means “someone else got it”)
- Quarkus native: ensure Lease model classes are registered for reflection when needed.

### Runtime hooks inside Quarkus Flow
We need explicit hook points in the workflow engine:

**On every task transition**
- Persist checkpoint *before* moving to next task (or at least before acknowledging completion).
- Persist a “transition counter” / monotonic sequence number (guards against out-of-order updates).

**On task start / completion**
- Persist lightweight progress markers (useful for “RUNNING but stuck” detection).

**On WAIT states**
- Persist enough info to resume (expected event, timer due time, correlation keys).
- Consider releasing the Lease when entering WAITING (so any replica can resume when an event arrives), then re-acquire when resuming execution.

**On resume trigger (event/timer/manual)**
- Re-acquire Lease before executing the next segment.
- Load latest checkpoint from store.
- Continue execution from cursor.


### Orphan detection + recovery loop (Reconciler)
A small scheduler running in each pod (or a single elected reconciler) that:
1. Queries DurableStore for candidates:
   - `status = RUNNING` and `lastHeartbeatAt` older than threshold
   - `status = WAITING` and timer due (if applicable)
   - `status = RETRY_WAIT` and retry due
2. For each candidate:
   - `tryAcquire(Lease)` (winner proceeds)
   - `loadCheckpoint()`
   - resume execution
   - renew lease while actively executing
3. On success/failure:
   - update status + checkpoint
   - release lease when reaching WAITING or terminal state

**Deployment modes**
- **Mode A (simpler / less API load):** one reconciler leader using a *global* lease (“reconciler leader election”), while per-instance leases protect execution.
- **Mode B (fully distributed):** all pods run reconciler; rely on Lease conflicts to prevent duplication (may require jitter/backoff + optional sharding).

## Fabric8 + Quarkus integration details
- Use Fabric8 client to read/create/update Leases (`coordination.k8s.io/v1`).
- Ensure RBAC Role/RoleBinding is generated/documented.
- Native image considerations: register Lease model classes for reflection if required.

### Failure modes & correctness notes (to document)
- Semantics are **at-least-once execution** across task boundaries unless tasks are idempotent or integrated with an outbox pattern.
- Recovery will resume from the last persisted checkpoint; a task may re-run if the crash happens after side effects but before checkpoint commit.
- Lease expiry is time-based; use conservative durations + renew frequency + jitter.

## Deliverables
1. New module(s): `quarkus-flow-durable` (+ optional `-kubernetes` split).
2. SPI: `DurableStore`, `LockManager`, `Reconciler`.
3. Runtime hook integration: checkpoint on task transition + resume path.
4. Implementations:
   - Redis store (quarkus-redis-client)
   - (optional) PostgreSQL store (quarkus-jdbc / reactive-pg)
   - Kubernetes Lease lock via Fabric8
5. RBAC manifests and docs.
6. Integration tests:
   - Testcontainers (PG/Redis)
   - Fabric8 mock server for Lease behaviors
   - “kill pod mid-run” scenario (simulated), verifying takeover


### Area(s)

- [x] Runtime / Engine
- [ ] DSL / API
- [ ] Dev UI / Mermaid
- [ ] Persistence
- [ ] Messaging / CloudEvents
- [ ] Observability (Micrometer / OTEL)
- [ ] Docs & examples
- [ ] Agentic AI

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Durable Workflows (Durable Execution + Recovery on Kubernetes) #172

The problem

Goals

Kubernetes-side overview

Why Lease objects

Lock semantics we want

RBAC needed

Operational reality (drain/termination)

Quarkus Flow-side design

Proposed module structure

Runtime hooks inside Quarkus Flow

Orphan detection + recovery loop (Reconciler)

Fabric8 + Quarkus integration details

Failure modes & correctness notes (to document)

Deliverables

Area(s)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: Durable Workflows (Durable Execution + Recovery on Kubernetes) #172

Description

The problem

Goals

Kubernetes-side overview

Why Lease objects

Lock semantics we want

RBAC needed

Operational reality (drain/termination)

Quarkus Flow-side design

Proposed module structure

Runtime hooks inside Quarkus Flow

Orphan detection + recovery loop (Reconciler)

Fabric8 + Quarkus integration details

Failure modes & correctness notes (to document)

Deliverables

Area(s)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions