Skip to content

Commit 1ccad80

Browse files
authored
docs: scope M9-T08 revoke safety slice (#91)
* docs: scope M9 revoke safety slice * docs: tighten M9 revoke slice boundaries
1 parent 42013c5 commit 1ccad80

File tree

4 files changed

+216
-14
lines changed

4 files changed

+216
-14
lines changed

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121
- [Jepsen Refactor Plan](./jepsen-refactor-plan.md)
2222
- [Lease Kernel Design Decisions](./lease-kernel-design.md)
2323
- [Lease Kernel Follow-On](./lease-kernel-follow-on.md)
24+
- [Revoke Safety Slice](./revoke-safety-slice.md)
2425
- [Operator Runbook](./operator-runbook.md)
2526
- [KubeVirt Jepsen Report](./kubevirt-jepsen-report.md)
2627
- [Replication Notes](./replication.md)

docs/design.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,8 @@ Current implementation anchor:
5555
first-class leases, bundle ownership, fencing, and revoke
5656
- [lease-kernel-follow-on.md](./lease-kernel-follow-on.md): post-`M8` planning for minimal
5757
generic lease-kernel extensions such as bundle ownership, fencing, and revoke
58+
- [revoke-safety-slice.md](./revoke-safety-slice.md): exact `M9-T08` scope for revoke, reclaim,
59+
and safe-reuse behavior in the current implementation
5860
- [replication.md](./replication.md): deferred distributed design areas and boundaries
5961
- [roadmap.md](./roadmap.md): high-level roadmap and exit criteria
6062
- [work-breakdown.md](./work-breakdown.md): concrete units of work for the first implementation

docs/revoke-safety-slice.md

Lines changed: 194 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,194 @@
1+
# Revoke Safety Slice
2+
3+
## Status
4+
5+
Draft local planning baseline for `M9-T08` / issue `#85`.
6+
7+
This document narrows the next implementation slice before code work starts. It is not a new
8+
authoritative API surface; it is the execution plan for bringing the accepted revoke/reclaim
9+
semantics into the current trusted-core implementation.
10+
11+
## Purpose
12+
13+
`M9-T07` established fencing and stale-holder rejection. The next kernel step is to withdraw
14+
holder authority explicitly without permitting early reuse.
15+
16+
The implementation question for `M9-T08` is narrower than the full lease-model transition:
17+
18+
- how revoke enters the current core cleanly
19+
- how reclaim becomes the only point where reuse is allowed
20+
- how to keep the reservation-era compatibility surface from drifting away from the accepted
21+
lease-centric semantics
22+
23+
## Slice Goal
24+
25+
Implement the minimum revoke/reclaim behavior needed to preserve the late-not-early reuse rule in
26+
the current execution path, including the already-approved crash/retry/failover safety contract for
27+
`M9-T08`.
28+
29+
For this slice, "done" means:
30+
31+
- the core can log and apply `revoke` and `reclaim`
32+
- stale holders lose authority as soon as revoke commits
33+
- resources remain unavailable until reclaim commits
34+
- exact retries stay deterministic
35+
36+
## In Scope
37+
38+
`M9-T08` should include:
39+
40+
1. one explicit revoke command in the trusted core
41+
2. one explicit reclaim command in the trusted core
42+
3. one live non-terminal state for revoked-but-not-yet-reusable ownership
43+
4. one terminal revoked outcome that preserves history after reclaim
44+
5. the minimum executor, persistence, and replay plumbing needed so committed revoke/reclaim
45+
outcomes survive live apply, restart, and the existing failover contract
46+
6. invariant, negative-path, retry, and crash-recovery tests for the new safety rule
47+
48+
## Out Of Scope
49+
50+
`M9-T08` should not expand into:
51+
52+
- new replication protocol design, failover refactors, or replicated-surface expansion beyond what
53+
is needed to preserve committed revoke/reclaim outcomes under the existing path
54+
- WAL or snapshot reshaping beyond the exact command/state support required for revoke/reclaim
55+
- broader public API and transport cleanup beyond the narrow compatibility bridge already required
56+
by this slice
57+
- heartbeat ingestion or wall-clock reclaim logic inside the state machine
58+
- policy reasons or operator metadata attached to revoke/reclaim
59+
- holder transfer or shared-resource semantics
60+
61+
Those belong to later slices, primarily `M9-T09` through `M9-T11`.
62+
63+
## Compatibility Rule
64+
65+
The accepted model is lease-centric, but the current implementation is still reservation-centric in
66+
spelling and data layout.
67+
68+
For `M9-T08`, that bridge is allowed under one rule:
69+
70+
- reservation-era names may remain temporarily, but revoke/reclaim behavior must match the
71+
authoritative lease semantics exactly
72+
73+
That means:
74+
75+
- the current `reservation_id` may continue to serve as the implementation anchor for `lease_id`
76+
- the existing `confirmed` state may continue as the compatibility spelling for authoritative
77+
`active`
78+
- the slice must not introduce reservation-era shortcuts that would be invalid in the final lease
79+
model
80+
81+
## Exact Command Semantics For This Slice
82+
83+
### Revoke
84+
85+
Implementation intent:
86+
87+
```text
88+
revoke(lease_id)
89+
```
90+
91+
Current compatibility spelling may still route this through the reservation-era implementation, but
92+
the effect must be:
93+
94+
- precondition: live lease exists and is currently `active`
95+
- success: lease moves to `revoking`
96+
- success: `lease_epoch` increments immediately
97+
- success: member resources stay unavailable and keep pointing at the same live owner
98+
- success: resource state becomes `revoking`
99+
- success: no retirement is scheduled yet
100+
101+
Failure behavior:
102+
103+
- `lease_not_found` if the lease never existed
104+
- `lease_retired` if retained history says the live record is already gone
105+
- `invalid_state` for `reserved`, `revoking`, `released`, `expired`, or `revoked`
106+
107+
Duplicate behavior:
108+
109+
- exact retry with the same `operation_id` must return the cached original result
110+
- a later distinct revoke with a different `operation_id` must not invent a second success; once a
111+
lease is already `revoking` or terminal, the answer is `invalid_state`
112+
113+
### Reclaim
114+
115+
Implementation intent:
116+
117+
```text
118+
reclaim(lease_id)
119+
```
120+
121+
Effect:
122+
123+
- precondition: live lease exists and is currently `revoking`
124+
- success: lease moves to terminal `revoked`
125+
- success: member resources return to `available`
126+
- success: per-resource current owner pointers clear
127+
- success: retirement is scheduled through the normal bounded history path
128+
129+
Failure behavior:
130+
131+
- `lease_not_found`
132+
- `lease_retired`
133+
- `invalid_state` for `reserved`, `active`, `released`, `expired`, or already `revoked`
134+
135+
Duplicate behavior:
136+
137+
- exact retry with the same `operation_id` must return the cached original result
138+
- a later distinct reclaim on an already terminal record must not produce a second success
139+
140+
## Required Safety Properties
141+
142+
`M9-T08` must preserve these invariants:
143+
144+
1. revoke removes holder authority before reuse is possible
145+
2. reclaim is the only transition that makes a revoked resource reusable
146+
3. active or revoking leases are never freed by timer
147+
4. late external reclaim is acceptable; early reclaim is not
148+
5. holder-authorized commands that arrive after revoke with the old epoch fail deterministically
149+
6. replay of committed revoke/reclaim commands yields the same resource availability outcome
150+
151+
## Implementation Boundaries
152+
153+
The slice should be built in this order:
154+
155+
1. add core state and command variants for revoke/reclaim
156+
2. apply revoke/reclaim through the same executor path already used by reserve/confirm/release
157+
3. add only the exact codec, snapshot, and recovery support required so live apply and replay
158+
preserve committed revoke/reclaim outcomes
159+
4. preserve the current no-early-reuse contract under crash, retry, and failover without broadening
160+
the replication surface in this slice
161+
5. add resource-state and lease-state invariants for `revoking` and `revoked`
162+
6. add retry and stale-holder regression coverage
163+
164+
Important boundary:
165+
166+
- if broader WAL/snapshot cleanup, transport normalization, or replication-surface redesign becomes
167+
necessary, keep only the revoke/reclaim unblocker here and defer the broader cleanup to `M9-T09`
168+
and `M9-T10`
169+
170+
## Tests This Slice Should Add
171+
172+
Minimum test set:
173+
174+
- revoke on active lease moves the lease to `revoking` and bumps `lease_epoch`
175+
- revoke does not free member resources
176+
- holder `release` or `confirm` with the old epoch after revoke fails deterministically
177+
- reclaim from `revoking` returns resources to `available` and records terminal `revoked` history
178+
- reclaim before revoke is `invalid_state`
179+
- exact duplicate revoke and reclaim requests return cached committed results
180+
- reserved, active, and revoking resources cannot be reused early
181+
- crash/restart replay preserves `revoking` vs `revoked` outcomes
182+
- committed revoke/reclaim outcomes preserve the same no-early-reuse behavior across the current
183+
failover path
184+
185+
## Exit Condition
186+
187+
`M9-T08` is ready to hand off when:
188+
189+
- the exact revoke/reclaim behavior above is implemented or explicitly mapped to narrower code
190+
tasks
191+
- `docs/status.md` points at `#85` instead of stale `#84` language
192+
- the slice still satisfies the existing `M9-T08` crash/retry/failover acceptance criteria without
193+
silently expanding into broader `M9-T09` or `M9-T10` cleanup
194+
- later work is cleanly reserved for `M9-T09` through `M9-T11`

docs/status.md

Lines changed: 19 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@
1414
- `M6` replication design: implemented
1515
- `M7` replicated core prototype: in progress
1616
- `M8` external cluster validation: in progress
17-
- `M9` generic lease-kernel follow-on: `T06` in progress on issue branch
17+
- `M9` generic lease-kernel follow-on: `T07` merged, `T08` planning in progress on issue branch
1818
- Latest completed implementation chunks:
1919
- `4156a80` `Bootstrap AllocDB core and docs`
2020
- `f84a641` `Add WAL file and snapshot recovery primitives`
@@ -32,8 +32,9 @@
3232
with divergent prepared-suffix discard; promoted partition and primary-crash scenarios that
3333
preserve fail-closed behavior and retry/read continuity after failover; the local
3434
three-replica cluster runner, fault-control harness, and QEMU testbed around the real replica
35-
daemon; and the first trusted-core bundle-commit slice with bundle membership, bundle-aware
36-
confirm/release/expire, and bundle regression coverage
35+
daemon; the first trusted-core bundle-commit slice with bundle membership, bundle-aware
36+
confirm/release/expire, and bundle regression coverage; and the first fencing slice with
37+
lease-epoch propagation, stale-holder rejection, and epoch-aware retry/read coverage
3738

3839
## What Exists
3940

@@ -85,9 +86,11 @@
8586
- one draft lease-kernel design-decision document that chooses a first-class lease authority
8687
object, bundle size `1` as the single-resource semantic special case, a lease-scoped fencing
8788
token, and a two-stage `revoke -> reclaim` safety model
88-
- one active authoritative-docs pass under issue `#80` that is rewriting semantics, API,
89+
- one merged authoritative-docs pass under issue `#80` that rewrote semantics, API,
8990
architecture, and fault-model docs to the approved lease-centric contract while keeping the
9091
current reservation-centric implementation explicitly marked as compatibility surface
92+
- one active `M9-T08` planning note that narrows revoke/reclaim implementation scope before the
93+
next code-bearing lease-kernel branch
9194
- Replication design draft:
9295
- VSR-style primary/backup replicated log with fixed membership and majority quorums
9396
- primary-only reads in the first replicated release
@@ -198,13 +201,15 @@
198201
- PR `#89` merged `M9-T06` on `main`: the trusted core now supports atomic bundle reservation,
199202
explicit bundle membership records, bundle-aware confirm/release/expire, and bundle-aware
200203
snapshot/codec coverage while preserving the existing reservation compatibility surface
201-
- issue `#84` / `M9-T07` is the active implementation slice on the current branch: lease epochs
202-
now flow through holder-authorized commands and command outcomes, the core rejects stale holder
203-
epochs deterministically, and read/retry surfaces expose the current authority token for active
204-
reservations
205-
- validation for the active `#84` branch currently includes
206-
`cargo test -p allocdb-core -- --nocapture`, `cargo test -p allocdb-node api -- --nocapture`,
207-
`cargo test -p allocdb-node engine -- --nocapture`, and
208-
`cargo test --workspace --no-run`
209-
- the next planned slices after `#84` remain `M9-T08` revoke/safe reuse, `M9-T09` persistence and
210-
transport extension, `M9-T10` replication preservation, and `M9-T11` broader regression coverage
204+
- PR `#90` merged `M9-T07` on `main`: lease epochs now flow through holder-authorized commands and
205+
command outcomes, the core rejects stale holder epochs deterministically, and read/retry
206+
surfaces expose the current authority token for active reservations
207+
- issue `#85` / `M9-T08` is the active planning slice on the current branch: the local scope is
208+
being narrowed to explicit `revoke` and `reclaim`, `revoking` as the only non-reusable
209+
post-authority state, and the minimum replay-safe bridge from the reservation-era implementation
210+
to the accepted lease-centric semantics
211+
- the active `#85` planning branch is defining exactly what belongs in revoke/reclaim now versus
212+
what stays deferred to `M9-T09` through `M9-T11`, especially around WAL/snapshot broadening,
213+
transport cleanup, and replication preservation
214+
- the next planned code-bearing slices after `#85` remain `M9-T09` persistence and transport
215+
extension, `M9-T10` replication preservation, and `M9-T11` broader regression coverage

0 commit comments

Comments
 (0)