|
| 1 | +# Resume Policy Model |
| 2 | + |
| 3 | +Epic 17 Task 17.1 defines the policy contract for safe, deterministic auto-resume behavior. |
| 4 | + |
| 5 | +## Goals |
| 6 | + |
| 7 | +- resume interrupted workflows from the latest known-safe checkpoint without repeating unsafe actions |
| 8 | +- keep resume decisions explainable and auditable for operators and tooling |
| 9 | +- prevent runaway retry loops through bounded attempts and escalation rules |
| 10 | + |
| 11 | +## Interruption classes |
| 12 | + |
| 13 | +Every interrupted run must be classified into one interruption class: |
| 14 | + |
| 15 | +- `tool_failure`: an external command/tool exits non-zero without hard timeout |
| 16 | +- `timeout`: an action exceeds configured wall-clock or tool-level timeout |
| 17 | +- `context_reset`: execution context is lost or pruned before current step completes |
| 18 | +- `process_crash`: orchestrator/runtime exits unexpectedly (panic, signal, crash) |
| 19 | + |
| 20 | +Unknown classes must fail eligibility checks. |
| 21 | + |
| 22 | +## Resume eligibility rules |
| 23 | + |
| 24 | +A run is resume-eligible only when all conditions hold: |
| 25 | + |
| 26 | +1. A valid checkpoint exists with `status: in_progress` or `status: failed`. |
| 27 | +2. The last attempted step is marked idempotent or explicitly resume-approved. |
| 28 | +3. Required artifacts (plan snapshot, runtime state, and transition history) are readable. |
| 29 | +4. Current resume attempt count is below `max_resume_attempts`. |
| 30 | + |
| 31 | +If any condition fails, auto-resume must not execute and must emit a deterministic reason code. |
| 32 | + |
| 33 | +## Cool-down policy |
| 34 | + |
| 35 | +Resume attempts must respect cool-down windows: |
| 36 | + |
| 37 | +- `tool_failure`: 30 seconds |
| 38 | +- `timeout`: 120 seconds |
| 39 | +- `context_reset`: 10 seconds |
| 40 | +- `process_crash`: 60 seconds |
| 41 | + |
| 42 | +During cool-down, status commands should return `resume_blocked_cooldown` with remaining seconds. |
| 43 | + |
| 44 | +## Max attempts and escalation |
| 45 | + |
| 46 | +- default `max_resume_attempts`: 3 per run |
| 47 | +- after exceeding max attempts, set run state to `resume_escalated` |
| 48 | +- escalation must require explicit operator action (for example `/resume now --force` in a future task) |
| 49 | +- escalation output must include a remediation checklist and latest failure context |
| 50 | + |
| 51 | +## Deterministic reason codes |
| 52 | + |
| 53 | +Resume policy outcomes should use machine-readable reason codes: |
| 54 | + |
| 55 | +- `resume_allowed` |
| 56 | +- `resume_missing_checkpoint` |
| 57 | +- `resume_unknown_interruption_class` |
| 58 | +- `resume_non_idempotent_step` |
| 59 | +- `resume_missing_runtime_artifacts` |
| 60 | +- `resume_blocked_cooldown` |
| 61 | +- `resume_attempt_limit_reached` |
| 62 | + |
| 63 | +## Audit event contract |
| 64 | + |
| 65 | +Resume-relevant decisions should emit append-only audit events: |
| 66 | + |
| 67 | +```json |
| 68 | +{ |
| 69 | + "event": "resume_decision", |
| 70 | + "run_id": "run-2026-02-13-01", |
| 71 | + "interruption_class": "timeout", |
| 72 | + "eligible": false, |
| 73 | + "reason_code": "resume_blocked_cooldown", |
| 74 | + "cooldown_seconds_remaining": 87, |
| 75 | + "attempt": 2, |
| 76 | + "max_attempts": 3, |
| 77 | + "at": "2026-02-13T12:20:00Z", |
| 78 | + "actor": "system" |
| 79 | +} |
| 80 | +``` |
| 81 | + |
| 82 | +## Integration targets |
| 83 | + |
| 84 | +- Task 17.2 should implement runtime eligibility and cooldown/attempt counters from this contract. |
| 85 | +- Task 17.3 should expose `/resume` command outputs using these reason codes. |
| 86 | +- Task 17.4 should validate interruption-class coverage, cooldown enforcement, and escalation behavior. |
0 commit comments