Skip to content

Commit 1c381af

Browse files
authored
Merge pull request #86 from dmoliveira/my_opencode-e17-resume-policy
Define E17-T1 resume policy contract
2 parents 4b404cc + 65c7b37 commit 1c381af

File tree

4 files changed

+104
-6
lines changed

4 files changed

+104
-6
lines changed

CHANGELOG.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -52,6 +52,7 @@ All notable changes to this project are documented in this file.
5252
- Added `scripts/todo_enforcement.py` with deterministic todo transition/completion validation and remediation hint helpers for Epic 15 Task 15.2.
5353
- Added `scripts/todo_command.py` with `/todo status` and `/todo enforce` diagnostics for runtime compliance visibility.
5454
- Added `/todo`, `/todo-status`, and `/todo-enforce` aliases in `opencode.json`.
55+
- Added `instructions/resume_policy_model.md` defining interruption classes, resume eligibility/cool-down rules, attempt limits, escalation semantics, and deterministic reason codes for Epic 17 Task 17.1.
5556

5657
### Changes
5758
- Documented extension evaluation outcomes and when each tool is the better fit.
@@ -99,6 +100,7 @@ All notable changes to this project are documented in this file.
99100
- Updated `/start-work` execution to enforce todo compliance transitions, emit audit events, and block completion when required items remain unchecked.
100101
- Integrated todo compliance checks into `/doctor` summary, installer self-checks, and install-test smoke coverage.
101102
- Expanded selftest coverage for todo transition gating, completion blocking, and bypass audit-event payload validation.
103+
- Marked Epic 17 as in progress and completed Task 17.1 resume-policy definition notes in the roadmap.
102104

103105
## v0.2.0 - 2026-02-12
104106

IMPLEMENTATION_ROADMAP.md

Lines changed: 7 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,7 @@ Use this map to avoid overlapping implementations.
5353
| E14 | Plan-to-Execution Bridge Command | done | Medium | E2, E3 | bd-1z6, bd-2te, bd-3sg, bd-2bv | Execute validated plans with progress tracking |
5454
| E15 | Todo Enforcer and Plan Compliance | done | High | E14 | bd-l9c | Keep execution aligned with approved checklists |
5555
| E16 | Comment and Output Quality Checker Loop | merged | Medium | E23 | TBD | Merged into E23 (PR Review Copilot) |
56-
| E17 | Auto-Resume and Recovery Loop | planned | High | E11, E14 | TBD | Resume interrupted work from checkpoints safely |
56+
| E17 | Auto-Resume and Recovery Loop | in_progress | High | E11, E14 | bd-1ho | Resume interrupted work from checkpoints safely |
5757
| E18 | LSP/AST-Assisted Safe Edit Mode | planned | High | E3 | TBD | Prefer semantic edits over plain text replacements |
5858
| E19 | Session Checkpoint Snapshots | planned | Medium | E2, E17 | TBD | Durable state for rollback and restart safety |
5959
| E20 | Execution Budget Guardrails | planned | High | E2, E11 | TBD | Bound time/tool/token usage for autonomous runs |
@@ -638,15 +638,16 @@ Every command-oriented epic must ship all of the following:
638638

639639
## Epic 17 - Auto-Resume and Recovery Loop
640640

641-
**Status:** `planned`
641+
**Status:** `in_progress`
642642
**Priority:** High
643643
**Goal:** Resume interrupted workflows from last valid checkpoint with explicit safety checks.
644644
**Depends on:** Epic 11, Epic 14
645645

646-
- [ ] Task 17.1: Define resume policy
647-
- [ ] Subtask 17.1.1: Define interruption classes (tool failure, timeout, context reset, crash)
648-
- [ ] Subtask 17.1.2: Define resume eligibility and cool-down rules
649-
- [ ] Subtask 17.1.3: Define max resume attempts and escalation path
646+
- [x] Task 17.1: Define resume policy
647+
- [x] Subtask 17.1.1: Define interruption classes (tool failure, timeout, context reset, crash)
648+
- [x] Subtask 17.1.2: Define resume eligibility and cool-down rules
649+
- [x] Subtask 17.1.3: Define max resume attempts and escalation path
650+
- [x] Notes: Added `instructions/resume_policy_model.md` with interruption classes, deterministic eligibility/cool-down/attempt-limit rules, reason codes, and audit event contract.
650651
- [ ] Task 17.2: Implement recovery engine
651652
- [ ] Subtask 17.2.1: Load last safe checkpoint and reconstruct state
652653
- [ ] Subtask 17.2.2: Re-run only idempotent or explicitly approved steps

README.md

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -631,6 +631,15 @@ Compliant workflow pattern:
631631
- inspect `/todo status --json` for current state counts
632632
- gate handoff/closure with `/todo enforce --json`
633633

634+
## Resume policy model
635+
636+
Epic 17 Task 17.1 defines the baseline policy contract for safe auto-resume behavior:
637+
638+
- policy spec: `instructions/resume_policy_model.md`
639+
- interruption classes: `tool_failure`, `timeout`, `context_reset`, `process_crash`
640+
- eligibility gate: checkpoint availability + idempotency + artifact readiness + attempt budget
641+
- safety controls: class-specific cool-down windows and escalation after max attempts
642+
634643
## Context resilience policy
635644

636645
Epic 11 Task 11.1 defines the baseline policy schema for context-window resilience:
Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
# Resume Policy Model
2+
3+
Epic 17 Task 17.1 defines the policy contract for safe, deterministic auto-resume behavior.
4+
5+
## Goals
6+
7+
- resume interrupted workflows from the latest known-safe checkpoint without repeating unsafe actions
8+
- keep resume decisions explainable and auditable for operators and tooling
9+
- prevent runaway retry loops through bounded attempts and escalation rules
10+
11+
## Interruption classes
12+
13+
Every interrupted run must be classified into one interruption class:
14+
15+
- `tool_failure`: an external command/tool exits non-zero without hard timeout
16+
- `timeout`: an action exceeds configured wall-clock or tool-level timeout
17+
- `context_reset`: execution context is lost or pruned before current step completes
18+
- `process_crash`: orchestrator/runtime exits unexpectedly (panic, signal, crash)
19+
20+
Unknown classes must fail eligibility checks.
21+
22+
## Resume eligibility rules
23+
24+
A run is resume-eligible only when all conditions hold:
25+
26+
1. A valid checkpoint exists with `status: in_progress` or `status: failed`.
27+
2. The last attempted step is marked idempotent or explicitly resume-approved.
28+
3. Required artifacts (plan snapshot, runtime state, and transition history) are readable.
29+
4. Current resume attempt count is below `max_resume_attempts`.
30+
31+
If any condition fails, auto-resume must not execute and must emit a deterministic reason code.
32+
33+
## Cool-down policy
34+
35+
Resume attempts must respect cool-down windows:
36+
37+
- `tool_failure`: 30 seconds
38+
- `timeout`: 120 seconds
39+
- `context_reset`: 10 seconds
40+
- `process_crash`: 60 seconds
41+
42+
During cool-down, status commands should return `resume_blocked_cooldown` with remaining seconds.
43+
44+
## Max attempts and escalation
45+
46+
- default `max_resume_attempts`: 3 per run
47+
- after exceeding max attempts, set run state to `resume_escalated`
48+
- escalation must require explicit operator action (for example `/resume now --force` in a future task)
49+
- escalation output must include a remediation checklist and latest failure context
50+
51+
## Deterministic reason codes
52+
53+
Resume policy outcomes should use machine-readable reason codes:
54+
55+
- `resume_allowed`
56+
- `resume_missing_checkpoint`
57+
- `resume_unknown_interruption_class`
58+
- `resume_non_idempotent_step`
59+
- `resume_missing_runtime_artifacts`
60+
- `resume_blocked_cooldown`
61+
- `resume_attempt_limit_reached`
62+
63+
## Audit event contract
64+
65+
Resume-relevant decisions should emit append-only audit events:
66+
67+
```json
68+
{
69+
"event": "resume_decision",
70+
"run_id": "run-2026-02-13-01",
71+
"interruption_class": "timeout",
72+
"eligible": false,
73+
"reason_code": "resume_blocked_cooldown",
74+
"cooldown_seconds_remaining": 87,
75+
"attempt": 2,
76+
"max_attempts": 3,
77+
"at": "2026-02-13T12:20:00Z",
78+
"actor": "system"
79+
}
80+
```
81+
82+
## Integration targets
83+
84+
- Task 17.2 should implement runtime eligibility and cooldown/attempt counters from this contract.
85+
- Task 17.3 should expose `/resume` command outputs using these reason codes.
86+
- Task 17.4 should validate interruption-class coverage, cooldown enforcement, and escalation behavior.

0 commit comments

Comments
 (0)