Skip to content

Commit c664964

Browse files
authored
Merge pull request #119 from skel84/feat/reservation-runtime-seam-evaluation
docs: add reservation runtime seam evaluation
2 parents 3d59b0e + 360b49c commit c664964

File tree

3 files changed

+195
-5
lines changed

3 files changed

+195
-5
lines changed

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
- [Quota Runtime Seam Evaluation](./quota-runtime-seam-evaluation.md)
2727
- [Reservation Engine Plan](./reservation-engine-plan.md)
2828
- [Reservation Engine Semantics](./reservation-semantics.md)
29+
- [Reservation Runtime Seam Evaluation](./reservation-runtime-seam-evaluation.md)
2930
- [Revoke Safety Slice](./revoke-safety-slice.md)
3031
- [Operator Runbook](./operator-runbook.md)
3132
- [KubeVirt Jepsen Report](./kubevirt-jepsen-report.md)
Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
# Reservation Runtime Seam Evaluation
2+
3+
## Purpose
4+
5+
This document closes `M11-T05` by evaluating the shared-runtime boundary after the third engine
6+
proof:
7+
8+
- `allocdb-core`
9+
- `quota-core`
10+
- `reservation-core`
11+
12+
The question is no longer whether there is an engine family here. That is now proven strongly
13+
enough. The question is whether the repository has crossed the line from "family of engines" to
14+
"extract a reusable runtime now."
15+
16+
## Decision
17+
18+
Do not extract a broad shared runtime crate yet.
19+
20+
Do prepare the first tiny internal extraction slice now.
21+
22+
The right outcome after the third engine is:
23+
24+
- still no `dsm-runtime` or public database-building library on `main`
25+
- still keep command surfaces, snapshot schemas, recovery entry points, and state-machine logic
26+
engine-local
27+
- now treat `retire_queue` as an immediate extraction candidate
28+
- keep `wal`, `wal_file`, and `snapshot_file` as the next likely candidates only after the first
29+
micro-extraction lands cleanly
30+
31+
So the third engine changed the answer from "defer everything" to "defer the big runtime, but start
32+
one tiny internal extraction."
33+
34+
## What The Third Engine Proved
35+
36+
The engine thesis is now materially stronger than it was after `quota-core`.
37+
38+
All three engines now demonstrate the same trusted-core discipline:
39+
40+
- bounded in-memory hot-path structures
41+
- WAL-backed durable ordering
42+
- snapshot plus WAL replay through the live apply path
43+
- logical `request_slot`
44+
- bounded retry and retirement state
45+
- fail-closed recovery on corruption or monotonicity violations
46+
47+
That is enough to say `allocdb` contains a real deterministic-engine family, not just one lease
48+
allocator plus one adjacent experiment.
49+
50+
## What Is Now Mechanically Shared
51+
52+
### `retire_queue`
53+
54+
`retire_queue` is now the strongest extraction candidate.
55+
56+
It is byte-identical across:
57+
58+
- `crates/allocdb-core/src/retire_queue.rs`
59+
- `crates/quota-core/src/retire_queue.rs`
60+
- `crates/reservation-core/src/retire_queue.rs`
61+
62+
This is the first module that is no longer just "similar." It is the same substrate in all three
63+
engines.
64+
65+
### `wal`
66+
67+
`wal.rs` is byte-identical between:
68+
69+
- `crates/quota-core/src/wal.rs`
70+
- `crates/reservation-core/src/wal.rs`
71+
72+
That is a stronger seam than `M10` had, but it is still not universal across all three engines
73+
because `allocdb-core` carries a richer record surface and recovery/reporting contract.
74+
75+
### `wal_file`
76+
77+
`wal_file.rs` is now extremely close between `quota-core` and `reservation-core`.
78+
79+
The remaining delta is small and concrete:
80+
81+
- `reservation-core` refreshes the append handle after truncation
82+
- the test fixtures differ only in engine-local payloads
83+
84+
This is close to extractable, but still wants one more deliberate pass rather than a speculative
85+
generic crate.
86+
87+
### `snapshot_file`
88+
89+
The runtime discipline is effectively the same between `quota-core` and `reservation-core`.
90+
91+
Most visible diffs are test-fixture and domain-shape differences. The remaining constructor/path
92+
surface still differs from `allocdb-core`, so this is a later extraction candidate, not the first
93+
one.
94+
95+
## What Stayed Engine-Specific
96+
97+
The third engine did not make these safer to extract.
98+
99+
### Snapshot schema
100+
101+
Do not extract the snapshot schema layer.
102+
103+
`reservation-core` widened the divergence:
104+
105+
- active-hold rebuild only for `held` records
106+
- deadline-based expiry semantics
107+
- hold/pool state that has nothing to do with quota buckets or lease reservations
108+
109+
The persistence discipline is shared. The schema is not.
110+
111+
### Recovery orchestration
112+
113+
Do not extract the top-level recovery API yet.
114+
115+
The replay skeleton is recognizably similar, but the third engine made the semantic hooks more
116+
obvious, not less:
117+
118+
- overdue-hold expiry on later request slots
119+
- held-only rebuild behavior on restore
120+
- torn-tail proof around expiry boundaries
121+
- engine-specific replay and mutation contracts
122+
123+
There may be helper seams later, but the public recovery entry points are still engine-local.
124+
125+
### State-machine substrate above collections
126+
127+
Do not extract state-machine traits or generic apply plumbing.
128+
129+
The shared truth is still at the discipline level:
130+
131+
- bounded state
132+
- deterministic apply
133+
- logical time
134+
- retry retirement
135+
136+
The actual mutation logic diverged further:
137+
138+
- `allocdb-core` is identity-heavy and fence-heavy
139+
- `quota-core` is arithmetic-heavy and refill-heavy
140+
- `reservation-core` is lifecycle-heavy and expiry-heavy
141+
142+
That divergence is healthy. It means the engines are real.
143+
144+
## Why A Broad Runtime Is Still Premature
145+
146+
A broad extraction would still create cost too early:
147+
148+
- more crate boundaries
149+
- more generic traits and type plumbing
150+
- more internal APIs to stabilize
151+
- more coordination every time an engine evolves
152+
153+
The third engine improved the evidence, but not enough for a full runtime crate:
154+
155+
- one module is now fully mechanical across all three
156+
- two to three more modules are close only within the smaller quota/reservation pair
157+
- snapshot, recovery, config, command, and state-machine layers still diverge in important ways
158+
159+
So the right move is smaller than "extract the runtime."
160+
161+
## Recommended Next Step
162+
163+
Close `M11` after this readout.
164+
165+
Then start one narrowly scoped extraction slice:
166+
167+
1. extract `retire_queue` into one tiny internal shared crate or module
168+
2. prove that no engine behavior changes
169+
3. only after that, reassess `wal`, `wal_file`, and `snapshot_file`
170+
171+
Do not start with:
172+
173+
- snapshot schemas
174+
- recovery entry points
175+
- command codecs
176+
- generic state-machine traits
177+
- a public library story
178+
179+
## Library Thesis Status
180+
181+
The third engine strengthened the thesis, but it did not yet produce a reusable "database-building
182+
library."
183+
184+
The honest status is:
185+
186+
- there is now enough common substrate to justify the first tiny internal extraction
187+
- there is still not enough stable shared shape to claim a general DB-construction framework
188+
189+
That is progress, but it is not the same thing as "the library is ready."

docs/status.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# AllocDB Status
22
## Current State
3-
- Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, M10 second-engine proof merged, and M11 third-engine planning staged
3+
- Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, M10 second-engine proof merged, and M11 third-engine proof merged
44
- Planning IDs: tasks use `M#-T#`; spikes use `M#-S#`
55
- Current milestone status:
66
- `M0` semantics freeze: complete enough for core work
@@ -15,7 +15,7 @@
1515
- `M8` external cluster validation: in progress
1616
- `M9` generic lease-kernel follow-on: implementation merged on `main`
1717
- `M10` second-engine proof: merged on `main`; shared runtime extraction deferred
18-
- `M11` third-engine proof: reservation-core planning staged; implementation not started
18+
- `M11` third-engine proof: merged on `main`; broad shared runtime still deferred, first micro-extraction now justified
1919
- Latest completed implementation chunks:
2020
- `4156a80` `Bootstrap AllocDB core and docs`
2121
- `f84a641` `Add WAL file and snapshot recovery primitives`
@@ -205,8 +205,7 @@
205205
- repo gate: `scripts/preflight.sh`
206206
## Current Focus
207207
- PR `#82` merged the `#70` maintainability follow-up, including live KubeVirt `reservation_contention-control` and full `1800s` `reservation_contention-crash-restart` reruns on `allocdb-a` with `blockers=0`
208-
- `M9-T01` through `M9-T05` are merged on `main` via PR `#81`, and the planning issues are
209-
closed on the `AllocDB` project
208+
- `M9-T01` through `M9-T05` are merged on `main` via PR `#81`, and the planning issues are closed on the `AllocDB` project
210209
- PRs `#89`, `#90`, `#92`, `#93`, `#94`, and `#95` merged the full `M9-T06` through `M9-T11`
211210
implementation chain on `main`: bundle commit, lease-epoch fencing, explicit `revoke` /
212211
`reclaim`, lease-shaped node API exposure, replication-preserved failover behavior, and broader
@@ -217,4 +216,5 @@
217216
both with `blockers=0`
218217
- the next recommended step remains downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine
219218
- PR `#107` merged the `M10` quota-engine proof on `main`: `quota-core` now proves a second deterministic engine in-repo with bounded `CreateBucket` / `Debit`, logical-slot refill, and snapshot/WAL recovery; the `M10-T05` seam evaluation still concludes that shared runtime extraction is premature, with `retire_queue` the closest candidate and the rest still engine-local
220-
- the next engine recommendation, if the goal is library extraction truth rather than immediate product pull, is `reservation-core`: the v1 plan narrows the third-engine experiment to `CreatePool`, `PlaceHold`, `ConfirmHold`, `ReleaseHold`, and logical-slot `ExpireHold` to pressure expiry, terminal-state, and recovery seams harder than `quota-core` did
219+
- PRs `#116`, `#117`, and `#118` merged the full `M11` reservation-core chain on `main`: scaffold, deterministic hold lifecycle, logical-slot overdue expiry, and expiry/recovery proof are now in the mainline implementation
220+
- PR `#118` also closes the third-engine readout: `retire_queue` is now the first justified internal extraction candidate across all three engines, while a broad `dsm-runtime` or public DB-building library is still premature; `wal`, `wal_file`, and `snapshot_file` are the next likely internal seams only after that micro-extraction lands

0 commit comments

Comments
 (0)