Async, concurrent parallel-branch execution for the workflow runner (suspend/resume, Action-Scheduler-backed)
Problem
The workflow runner is synchronous and terminal-status-gated. WP_Agent_Workflow_Runner::run() walks steps in a foreach (class-wp-agent-workflow-runner.php:204-241) and the gate at :224 (STATUS_SUCCEEDED !== $record['status']) has no "step not done yet, come back later" path. The parallel step is fanout orchestration, not concurrency: run_parallel_roles()/run_parallel_map() execute branches in a blocking in-process foreach (:785, :692) with zero Action Scheduler use. So N branches that each make a slow AI call run strictly serially inside one PHP request. This was found in real usage: a multi-page generation with 4 branches × ~120s serializes and the aggregator times out.
Design (locked decisions)
- Suspend/resume run model. A
parallel step can return a _suspend directive; the runner persists a suspension frame, returns STATUS_SUSPENDED, and a single reconcile entry point (agents_reconcile_workflow_branch()) resumes the run from the suspended step once all branches complete.
- Async REQUIRES Action Scheduler. AS supplies, without a new table, the two things async needs: (a) durable branch persistence (descriptors in the AS action payload / AS's own tables) and (b) a cross-process atomic claim (AS claims each action exactly once) — which is the CAS the "last branch resumes exactly once" transition needs. The resume itself is an atomically-claimed AS action.
- No Action Scheduler → SYNCHRONOUS. Byte-for-byte the v0.5.0 in-process loops. Correct and serial. Posture: concurrent when AS is present; correct-and-synchronous otherwise.
- HARD CONSTRAINT: no new database tables, ever. AS's tables + one bounded, non-autoloaded
run_id-keyed row (deleted on resume) hold all durable state.
- One executor interface (
dispatch/are_all_complete/collect), selected via wp_agent_workflow_step_executor (AS present → AS executor; absent → sync). A caller MAY register its own executor at its own risk (owns durability+atomicity); core ships and supports only the AS executor and the sync default.
Backward compatibility
No-AS installs behave exactly as v0.5.0 (serial, single terminal run() return, no SUSPENDED). The sync loops are the shared branch core the async path also reuses per branch — zero divergence. No consumer is forced to change.
Phasing
- Phase 1 (this issue's first PR):
STATUS_SUSPENDED + _suspend sentinel + suspension frame + resumable runner loop + resume() + executor interface + agents_reconcile_workflow_branch() + extract shared run_branch_steps() + core selector (no async executor → v0.5.0 sync). Ships safely — no behavior change on existing installs. Tested via a filter-registered FakeExecutor driving the REAL suspend→reconcile→resume (no AS, no AI, no shape-mocks).
- Phase 2: the Action Scheduler branch executor (one action/branch, atomically-claimed resume, table-free frame round-trip, exactly-once-resume race test under AS's claim).
- Phase 3: consumer adoption (multi-page site builder, codebox fan-out islands) — spec-only, no core change.
Testing (non-negotiable)
Tests drive the real runner suspend/resume path, never shape-mocks. This gap is why the prior consumer migration shipped broken (a stub blessed a wrong method name; smoke tests only asserted structure). Required coverage includes exactly-once resume under simultaneous branch finish, crash-resume durability, and table-free frame round-trip (assert no new table is created).
Full design doc (state machine, reconcile algorithm, file:line grounding, open questions) available on request.
AI assistance
- AI assistance: Yes
- Tool(s): Claude Code (Claude Opus 4.8)
- Used for: Architecture investigation (found the synchronous-runner limitation in real usage) and drafting this design.
Async, concurrent parallel-branch execution for the workflow runner (suspend/resume, Action-Scheduler-backed)
Problem
The workflow runner is synchronous and terminal-status-gated.
WP_Agent_Workflow_Runner::run()walks steps in aforeach(class-wp-agent-workflow-runner.php:204-241) and the gate at:224(STATUS_SUCCEEDED !== $record['status']) has no "step not done yet, come back later" path. Theparallelstep is fanout orchestration, not concurrency:run_parallel_roles()/run_parallel_map()execute branches in a blocking in-processforeach(:785,:692) with zero Action Scheduler use. So N branches that each make a slow AI call run strictly serially inside one PHP request. This was found in real usage: a multi-page generation with 4 branches × ~120s serializes and the aggregator times out.Design (locked decisions)
parallelstep can return a_suspenddirective; the runner persists a suspension frame, returnsSTATUS_SUSPENDED, and a single reconcile entry point (agents_reconcile_workflow_branch()) resumes the run from the suspended step once all branches complete.run_id-keyed row (deleted on resume) hold all durable state.dispatch/are_all_complete/collect), selected viawp_agent_workflow_step_executor(AS present → AS executor; absent → sync). A caller MAY register its own executor at its own risk (owns durability+atomicity); core ships and supports only the AS executor and the sync default.Backward compatibility
No-AS installs behave exactly as v0.5.0 (serial, single terminal
run()return, noSUSPENDED). The sync loops are the shared branch core the async path also reuses per branch — zero divergence. No consumer is forced to change.Phasing
STATUS_SUSPENDED+_suspendsentinel + suspension frame + resumable runner loop +resume()+ executor interface +agents_reconcile_workflow_branch()+ extract sharedrun_branch_steps()+ core selector (no async executor → v0.5.0 sync). Ships safely — no behavior change on existing installs. Tested via a filter-registered FakeExecutor driving the REAL suspend→reconcile→resume (no AS, no AI, no shape-mocks).Testing (non-negotiable)
Tests drive the real runner suspend/resume path, never shape-mocks. This gap is why the prior consumer migration shipped broken (a stub blessed a wrong method name; smoke tests only asserted structure). Required coverage includes exactly-once resume under simultaneous branch finish, crash-resume durability, and table-free frame round-trip (assert no new table is created).
Full design doc (state machine, reconcile algorithm, file:line grounding, open questions) available on request.
AI assistance