|
| 1 | +# Revoke Safety Slice |
| 2 | + |
| 3 | +## Status |
| 4 | + |
| 5 | +Draft local planning baseline for `M9-T08` / issue `#85`. |
| 6 | + |
| 7 | +This document narrows the next implementation slice before code work starts. It is not a new |
| 8 | +authoritative API surface; it is the execution plan for bringing the accepted revoke/reclaim |
| 9 | +semantics into the current trusted-core implementation. |
| 10 | + |
| 11 | +## Purpose |
| 12 | + |
| 13 | +`M9-T07` established fencing and stale-holder rejection. The next kernel step is to withdraw |
| 14 | +holder authority explicitly without permitting early reuse. |
| 15 | + |
| 16 | +The implementation question for `M9-T08` is narrower than the full lease-model transition: |
| 17 | + |
| 18 | +- how revoke enters the current core cleanly |
| 19 | +- how reclaim becomes the only point where reuse is allowed |
| 20 | +- how to keep the reservation-era compatibility surface from drifting away from the accepted |
| 21 | + lease-centric semantics |
| 22 | + |
| 23 | +## Slice Goal |
| 24 | + |
| 25 | +Implement the minimum revoke/reclaim behavior needed to preserve the late-not-early reuse rule in |
| 26 | +the current execution path, including the already-approved crash/retry/failover safety contract for |
| 27 | +`M9-T08`. |
| 28 | + |
| 29 | +For this slice, "done" means: |
| 30 | + |
| 31 | +- the core can log and apply `revoke` and `reclaim` |
| 32 | +- stale holders lose authority as soon as revoke commits |
| 33 | +- resources remain unavailable until reclaim commits |
| 34 | +- exact retries stay deterministic |
| 35 | + |
| 36 | +## In Scope |
| 37 | + |
| 38 | +`M9-T08` should include: |
| 39 | + |
| 40 | +1. one explicit revoke command in the trusted core |
| 41 | +2. one explicit reclaim command in the trusted core |
| 42 | +3. one live non-terminal state for revoked-but-not-yet-reusable ownership |
| 43 | +4. one terminal revoked outcome that preserves history after reclaim |
| 44 | +5. the minimum executor, persistence, and replay plumbing needed so committed revoke/reclaim |
| 45 | + outcomes survive live apply, restart, and the existing failover contract |
| 46 | +6. invariant, negative-path, retry, and crash-recovery tests for the new safety rule |
| 47 | + |
| 48 | +## Out Of Scope |
| 49 | + |
| 50 | +`M9-T08` should not expand into: |
| 51 | + |
| 52 | +- new replication protocol design, failover refactors, or replicated-surface expansion beyond what |
| 53 | + is needed to preserve committed revoke/reclaim outcomes under the existing path |
| 54 | +- WAL or snapshot reshaping beyond the exact command/state support required for revoke/reclaim |
| 55 | +- broader public API and transport cleanup beyond the narrow compatibility bridge already required |
| 56 | + by this slice |
| 57 | +- heartbeat ingestion or wall-clock reclaim logic inside the state machine |
| 58 | +- policy reasons or operator metadata attached to revoke/reclaim |
| 59 | +- holder transfer or shared-resource semantics |
| 60 | + |
| 61 | +Those belong to later slices, primarily `M9-T09` through `M9-T11`. |
| 62 | + |
| 63 | +## Compatibility Rule |
| 64 | + |
| 65 | +The accepted model is lease-centric, but the current implementation is still reservation-centric in |
| 66 | +spelling and data layout. |
| 67 | + |
| 68 | +For `M9-T08`, that bridge is allowed under one rule: |
| 69 | + |
| 70 | +- reservation-era names may remain temporarily, but revoke/reclaim behavior must match the |
| 71 | + authoritative lease semantics exactly |
| 72 | + |
| 73 | +That means: |
| 74 | + |
| 75 | +- the current `reservation_id` may continue to serve as the implementation anchor for `lease_id` |
| 76 | +- the existing `confirmed` state may continue as the compatibility spelling for authoritative |
| 77 | + `active` |
| 78 | +- the slice must not introduce reservation-era shortcuts that would be invalid in the final lease |
| 79 | + model |
| 80 | + |
| 81 | +## Exact Command Semantics For This Slice |
| 82 | + |
| 83 | +### Revoke |
| 84 | + |
| 85 | +Implementation intent: |
| 86 | + |
| 87 | +```text |
| 88 | +revoke(lease_id) |
| 89 | +``` |
| 90 | + |
| 91 | +Current compatibility spelling may still route this through the reservation-era implementation, but |
| 92 | +the effect must be: |
| 93 | + |
| 94 | +- precondition: live lease exists and is currently `active` |
| 95 | +- success: lease moves to `revoking` |
| 96 | +- success: `lease_epoch` increments immediately |
| 97 | +- success: member resources stay unavailable and keep pointing at the same live owner |
| 98 | +- success: resource state becomes `revoking` |
| 99 | +- success: no retirement is scheduled yet |
| 100 | + |
| 101 | +Failure behavior: |
| 102 | + |
| 103 | +- `lease_not_found` if the lease never existed |
| 104 | +- `lease_retired` if retained history says the live record is already gone |
| 105 | +- `invalid_state` for `reserved`, `revoking`, `released`, `expired`, or `revoked` |
| 106 | + |
| 107 | +Duplicate behavior: |
| 108 | + |
| 109 | +- exact retry with the same `operation_id` must return the cached original result |
| 110 | +- a later distinct revoke with a different `operation_id` must not invent a second success; once a |
| 111 | + lease is already `revoking` or terminal, the answer is `invalid_state` |
| 112 | + |
| 113 | +### Reclaim |
| 114 | + |
| 115 | +Implementation intent: |
| 116 | + |
| 117 | +```text |
| 118 | +reclaim(lease_id) |
| 119 | +``` |
| 120 | + |
| 121 | +Effect: |
| 122 | + |
| 123 | +- precondition: live lease exists and is currently `revoking` |
| 124 | +- success: lease moves to terminal `revoked` |
| 125 | +- success: member resources return to `available` |
| 126 | +- success: per-resource current owner pointers clear |
| 127 | +- success: retirement is scheduled through the normal bounded history path |
| 128 | + |
| 129 | +Failure behavior: |
| 130 | + |
| 131 | +- `lease_not_found` |
| 132 | +- `lease_retired` |
| 133 | +- `invalid_state` for `reserved`, `active`, `released`, `expired`, or already `revoked` |
| 134 | + |
| 135 | +Duplicate behavior: |
| 136 | + |
| 137 | +- exact retry with the same `operation_id` must return the cached original result |
| 138 | +- a later distinct reclaim on an already terminal record must not produce a second success |
| 139 | + |
| 140 | +## Required Safety Properties |
| 141 | + |
| 142 | +`M9-T08` must preserve these invariants: |
| 143 | + |
| 144 | +1. revoke removes holder authority before reuse is possible |
| 145 | +2. reclaim is the only transition that makes a revoked resource reusable |
| 146 | +3. active or revoking leases are never freed by timer |
| 147 | +4. late external reclaim is acceptable; early reclaim is not |
| 148 | +5. holder-authorized commands that arrive after revoke with the old epoch fail deterministically |
| 149 | +6. replay of committed revoke/reclaim commands yields the same resource availability outcome |
| 150 | + |
| 151 | +## Implementation Boundaries |
| 152 | + |
| 153 | +The slice should be built in this order: |
| 154 | + |
| 155 | +1. add core state and command variants for revoke/reclaim |
| 156 | +2. apply revoke/reclaim through the same executor path already used by reserve/confirm/release |
| 157 | +3. add only the exact codec, snapshot, and recovery support required so live apply and replay |
| 158 | + preserve committed revoke/reclaim outcomes |
| 159 | +4. preserve the current no-early-reuse contract under crash, retry, and failover without broadening |
| 160 | + the replication surface in this slice |
| 161 | +5. add resource-state and lease-state invariants for `revoking` and `revoked` |
| 162 | +6. add retry and stale-holder regression coverage |
| 163 | + |
| 164 | +Important boundary: |
| 165 | + |
| 166 | +- if broader WAL/snapshot cleanup, transport normalization, or replication-surface redesign becomes |
| 167 | + necessary, keep only the revoke/reclaim unblocker here and defer the broader cleanup to `M9-T09` |
| 168 | + and `M9-T10` |
| 169 | + |
| 170 | +## Tests This Slice Should Add |
| 171 | + |
| 172 | +Minimum test set: |
| 173 | + |
| 174 | +- revoke on active lease moves the lease to `revoking` and bumps `lease_epoch` |
| 175 | +- revoke does not free member resources |
| 176 | +- holder `release` or `confirm` with the old epoch after revoke fails deterministically |
| 177 | +- reclaim from `revoking` returns resources to `available` and records terminal `revoked` history |
| 178 | +- reclaim before revoke is `invalid_state` |
| 179 | +- exact duplicate revoke and reclaim requests return cached committed results |
| 180 | +- reserved, active, and revoking resources cannot be reused early |
| 181 | +- crash/restart replay preserves `revoking` vs `revoked` outcomes |
| 182 | +- committed revoke/reclaim outcomes preserve the same no-early-reuse behavior across the current |
| 183 | + failover path |
| 184 | + |
| 185 | +## Exit Condition |
| 186 | + |
| 187 | +`M9-T08` is ready to hand off when: |
| 188 | + |
| 189 | +- the exact revoke/reclaim behavior above is implemented or explicitly mapped to narrower code |
| 190 | + tasks |
| 191 | +- `docs/status.md` points at `#85` instead of stale `#84` language |
| 192 | +- the slice still satisfies the existing `M9-T08` crash/retry/failover acceptance criteria without |
| 193 | + silently expanding into broader `M9-T09` or `M9-T10` cleanup |
| 194 | +- later work is cleanly reserved for `M9-T09` through `M9-T11` |
0 commit comments