Merge pull request #86 from dmoliveira/my_opencode-e17-resume-policy

dmoliveira · web-flow · commit 1c381aff0c85 · 2026-02-13T22:26:38.000+11:00
Define E17-T1 resume policy contract
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -52,6 +52,7 @@ All notable changes to this project are documented in this file.
 - Added `scripts/todo_enforcement.py` with deterministic todo transition/completion validation and remediation hint helpers for Epic 15 Task 15.2.
 - Added `scripts/todo_command.py` with `/todo status` and `/todo enforce` diagnostics for runtime compliance visibility.
 - Added `/todo`, `/todo-status`, and `/todo-enforce` aliases in `opencode.json`.
+- Added `instructions/resume_policy_model.md` defining interruption classes, resume eligibility/cool-down rules, attempt limits, escalation semantics, and deterministic reason codes for Epic 17 Task 17.1.
 
 ### Changes
 - Documented extension evaluation outcomes and when each tool is the better fit.
@@ -99,6 +100,7 @@ All notable changes to this project are documented in this file.
 - Updated `/start-work` execution to enforce todo compliance transitions, emit audit events, and block completion when required items remain unchecked.
 - Integrated todo compliance checks into `/doctor` summary, installer self-checks, and install-test smoke coverage.
 - Expanded selftest coverage for todo transition gating, completion blocking, and bypass audit-event payload validation.
+- Marked Epic 17 as in progress and completed Task 17.1 resume-policy definition notes in the roadmap.
 
 ## v0.2.0 - 2026-02-12
 
diff --git a/IMPLEMENTATION_ROADMAP.md b/IMPLEMENTATION_ROADMAP.md
@@ -53,7 +53,7 @@ Use this map to avoid overlapping implementations.
 | E14 | Plan-to-Execution Bridge Command | done | Medium | E2, E3 | bd-1z6, bd-2te, bd-3sg, bd-2bv | Execute validated plans with progress tracking |
 | E15 | Todo Enforcer and Plan Compliance | done | High | E14 | bd-l9c | Keep execution aligned with approved checklists |
 | E16 | Comment and Output Quality Checker Loop | merged | Medium | E23 | TBD | Merged into E23 (PR Review Copilot) |
-| E17 | Auto-Resume and Recovery Loop | planned | High | E11, E14 | TBD | Resume interrupted work from checkpoints safely |
+| E17 | Auto-Resume and Recovery Loop | in_progress | High | E11, E14 | bd-1ho | Resume interrupted work from checkpoints safely |
 | E18 | LSP/AST-Assisted Safe Edit Mode | planned | High | E3 | TBD | Prefer semantic edits over plain text replacements |
 | E19 | Session Checkpoint Snapshots | planned | Medium | E2, E17 | TBD | Durable state for rollback and restart safety |
 | E20 | Execution Budget Guardrails | planned | High | E2, E11 | TBD | Bound time/tool/token usage for autonomous runs |
@@ -638,15 +638,16 @@ Every command-oriented epic must ship all of the following:
 
 ## Epic 17 - Auto-Resume and Recovery Loop
 
-**Status:** `planned`
+**Status:** `in_progress`
 **Priority:** High
 **Goal:** Resume interrupted workflows from last valid checkpoint with explicit safety checks.
 **Depends on:** Epic 11, Epic 14
 
-- [ ] Task 17.1: Define resume policy
-  - [ ] Subtask 17.1.1: Define interruption classes (tool failure, timeout, context reset, crash)
-  - [ ] Subtask 17.1.2: Define resume eligibility and cool-down rules
-  - [ ] Subtask 17.1.3: Define max resume attempts and escalation path
+- [x] Task 17.1: Define resume policy
+  - [x] Subtask 17.1.1: Define interruption classes (tool failure, timeout, context reset, crash)
+  - [x] Subtask 17.1.2: Define resume eligibility and cool-down rules
+  - [x] Subtask 17.1.3: Define max resume attempts and escalation path
+  - [x] Notes: Added `instructions/resume_policy_model.md` with interruption classes, deterministic eligibility/cool-down/attempt-limit rules, reason codes, and audit event contract.
 - [ ] Task 17.2: Implement recovery engine
   - [ ] Subtask 17.2.1: Load last safe checkpoint and reconstruct state
   - [ ] Subtask 17.2.2: Re-run only idempotent or explicitly approved steps
diff --git a/README.md b/README.md
@@ -631,6 +631,15 @@ Compliant workflow pattern:
 - inspect `/todo status --json` for current state counts
 - gate handoff/closure with `/todo enforce --json`
 
+## Resume policy model
+
+Epic 17 Task 17.1 defines the baseline policy contract for safe auto-resume behavior:
+
+- policy spec: `instructions/resume_policy_model.md`
+- interruption classes: `tool_failure`, `timeout`, `context_reset`, `process_crash`
+- eligibility gate: checkpoint availability + idempotency + artifact readiness + attempt budget
+- safety controls: class-specific cool-down windows and escalation after max attempts
+
 ## Context resilience policy
 
 Epic 11 Task 11.1 defines the baseline policy schema for context-window resilience:
diff --git a/instructions/resume_policy_model.md b/instructions/resume_policy_model.md
@@ -0,0 +1,86 @@
+# Resume Policy Model
+
+Epic 17 Task 17.1 defines the policy contract for safe, deterministic auto-resume behavior.
+
+## Goals
+
+- resume interrupted workflows from the latest known-safe checkpoint without repeating unsafe actions
+- keep resume decisions explainable and auditable for operators and tooling
+- prevent runaway retry loops through bounded attempts and escalation rules
+
+## Interruption classes
+
+Every interrupted run must be classified into one interruption class:
+
+- `tool_failure`: an external command/tool exits non-zero without hard timeout
+- `timeout`: an action exceeds configured wall-clock or tool-level timeout
+- `context_reset`: execution context is lost or pruned before current step completes
+- `process_crash`: orchestrator/runtime exits unexpectedly (panic, signal, crash)
+
+Unknown classes must fail eligibility checks.
+
+## Resume eligibility rules
+
+A run is resume-eligible only when all conditions hold:
+
+1. A valid checkpoint exists with `status: in_progress` or `status: failed`.
+2. The last attempted step is marked idempotent or explicitly resume-approved.
+3. Required artifacts (plan snapshot, runtime state, and transition history) are readable.
+4. Current resume attempt count is below `max_resume_attempts`.
+
+If any condition fails, auto-resume must not execute and must emit a deterministic reason code.
+
+## Cool-down policy
+
+Resume attempts must respect cool-down windows:
+
+- `tool_failure`: 30 seconds
+- `timeout`: 120 seconds
+- `context_reset`: 10 seconds
+- `process_crash`: 60 seconds
+
+During cool-down, status commands should return `resume_blocked_cooldown` with remaining seconds.
+
+## Max attempts and escalation
+
+- default `max_resume_attempts`: 3 per run
+- after exceeding max attempts, set run state to `resume_escalated`
+- escalation must require explicit operator action (for example `/resume now --force` in a future task)
+- escalation output must include a remediation checklist and latest failure context
+
+## Deterministic reason codes
+
+Resume policy outcomes should use machine-readable reason codes:
+
+- `resume_allowed`
+- `resume_missing_checkpoint`
+- `resume_unknown_interruption_class`
+- `resume_non_idempotent_step`
+- `resume_missing_runtime_artifacts`
+- `resume_blocked_cooldown`
+- `resume_attempt_limit_reached`
+
+## Audit event contract
+
+Resume-relevant decisions should emit append-only audit events:
+
+```json
+{
+  "event": "resume_decision",
+  "run_id": "run-2026-02-13-01",
+  "interruption_class": "timeout",
+  "eligible": false,
+  "reason_code": "resume_blocked_cooldown",
+  "cooldown_seconds_remaining": 87,
+  "attempt": 2,
+  "max_attempts": 3,
+  "at": "2026-02-13T12:20:00Z",
+  "actor": "system"
+}
+```
+
+## Integration targets
+
+- Task 17.2 should implement runtime eligibility and cooldown/attempt counters from this contract.
+- Task 17.3 should expose `/resume` command outputs using these reason codes.
+- Task 17.4 should validate interruption-class coverage, cooldown enforcement, and escalation behavior.