Skip to content

Commit 37c29e1

Browse files
authored
Merge pull request #108 from skel84/feat/quota-runtime-seam-evaluation
docs: evaluate quota runtime seams after second-engine proof
2 parents 65a0905 + e2e829a commit 37c29e1

File tree

4 files changed

+180
-5
lines changed

4 files changed

+180
-5
lines changed

docs/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
- [Lease Kernel Follow-On](./lease-kernel-follow-on.md)
2424
- [Quota Engine Plan](./quota-engine-plan.md)
2525
- [Quota Engine Semantics](./quota-semantics.md)
26+
- [Quota Runtime Seam Evaluation](./quota-runtime-seam-evaluation.md)
2627
- [Revoke Safety Slice](./revoke-safety-slice.md)
2728
- [Operator Runbook](./operator-runbook.md)
2829
- [KubeVirt Jepsen Report](./kubevirt-jepsen-report.md)

docs/quota-engine-plan.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,9 @@ Exit criteria:
293293

294294
Only after both engines stabilize should extraction be considered.
295295

296+
The first completed readout for this phase lives in
297+
[`quota-runtime-seam-evaluation.md`](./quota-runtime-seam-evaluation.md).
298+
296299
### Extraction rule
297300

298301
Extract shared runtime code only when all of the following are true:
Lines changed: 171 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,171 @@
1+
# Quota Runtime Seam Evaluation
2+
3+
## Purpose
4+
5+
This document closes `M10-T05` by evaluating whether `allocdb-core` and `quota-core` now justify a
6+
shared runtime extraction.
7+
8+
The answer is based on the code as merged in `allocdb#107`, not on a framework-first plan.
9+
10+
## Decision
11+
12+
Do not extract a shared runtime crate yet.
13+
14+
The second-engine proof succeeded, but the overlap is not yet stable enough to justify a
15+
`dsm-runtime` or similar crate on `main`.
16+
17+
The correct outcome at this point is:
18+
19+
- keep `allocdb-core` and `quota-core` as sibling engines in the same repository
20+
- keep copied runtime pieces local to each engine for now
21+
- record which seams look real
22+
- defer extraction until repeated maintenance pressure proves it is worth the churn
23+
24+
## What The Second Engine Proved
25+
26+
The engine thesis is now materially stronger than it was before `quota-core` existed.
27+
28+
Both engines now demonstrate the same execution discipline:
29+
30+
- bounded in-memory hot-path structures
31+
- WAL-backed durable ordering
32+
- snapshot plus WAL replay through the live apply path
33+
- logical `request_slot`
34+
- operation dedupe with bounded retirement
35+
- fail-closed recovery on corruption or monotonicity violations
36+
37+
That is enough to say there is a real engine family here, not just one special-case lease kernel.
38+
39+
## What Is Actually Shared
40+
41+
### Clearly shared discipline
42+
43+
The following ideas are genuinely common across both engines:
44+
45+
- ordered frame append and replay
46+
- bounded probe-table storage
47+
- bounded retirement queues
48+
- snapshot plus WAL recovery orchestration
49+
- monotonic LSN and request-slot enforcement
50+
- deterministic retry semantics
51+
52+
### Closest to mechanical extraction
53+
54+
These modules are the closest to being extractable later with low semantic risk:
55+
56+
- `retire_queue`
57+
- parts of `fixed_map`
58+
- parts of `wal`
59+
- parts of `wal_file`
60+
61+
`retire_queue` is the strongest example. It is effectively the same data structure in both
62+
engines, differing only in the surrounding key types and local tests.
63+
64+
`fixed_map`, `wal`, and `wal_file` do the same job in both engines, but they already diverge in
65+
details that matter:
66+
67+
- `fixed_map` in `allocdb-core` carries richer trace logging and more key types
68+
- `wal` and `wal_file` are similar in shape, but their error surfaces and tests already differ
69+
- the extraction point would need to preserve boundedness and fail-closed behavior without
70+
introducing generic abstraction noise
71+
72+
## What Is Not Shared Enough
73+
74+
The following should remain engine-local.
75+
76+
### Command and result surfaces
77+
78+
Do not extract:
79+
80+
- command enums
81+
- command codecs
82+
- result codes and domain outcomes
83+
- config types
84+
85+
These are runtime-adjacent, but they are still domain contracts, not generic substrate.
86+
87+
### Snapshot schema
88+
89+
Do not extract snapshot encoding logic yet.
90+
91+
The persistence discipline is shared, but the actual on-disk schema is not:
92+
93+
- `allocdb-core` carries a richer allocator-specific snapshot layout
94+
- `quota-core` carries a much smaller bucket/operation layout
95+
- forcing a generic snapshot schema would either add indirection or erase useful domain structure
96+
97+
The most that could be extracted later is helper machinery, not the schema itself.
98+
99+
### Recovery API surface
100+
101+
Do not extract recovery orchestration yet.
102+
103+
The top-level recovery flow is recognizably similar, but the differences are already meaningful:
104+
105+
- `allocdb-core` has richer replay error variants and slot-overflow reporting
106+
- `allocdb-core` has more operational logging
107+
- the restore path and replay details are still closely tied to engine-specific command decoding and
108+
state-machine APIs
109+
110+
There may be a later helper seam here, but not a good generic crate boundary today.
111+
112+
### State machine logic
113+
114+
Do not extract any state-machine layer.
115+
116+
The commonality is only at the discipline level:
117+
118+
- deterministic apply
119+
- bounded state
120+
- retry cache
121+
- logical time
122+
123+
The actual state transitions, invariants, and read models are completely different and should stay
124+
separate.
125+
126+
## Why Extraction Is Premature Now
127+
128+
Extraction would create cost immediately:
129+
130+
- more crate boundaries
131+
- more generic traits and type plumbing
132+
- more public internal APIs to stabilize
133+
- more coordination every time one engine evolves
134+
135+
But the benefit is still limited:
136+
137+
- only one module is basically mechanical today
138+
- most other overlap is still “same shape, different details”
139+
- there is not yet repeated maintenance pain from fixing the same bug in both engines over time
140+
141+
So the code is similar enough to reveal seams, but not similar enough to deserve a shared runtime
142+
crate yet.
143+
144+
## Extraction Triggers Later
145+
146+
Revisit extraction only when one of these becomes true:
147+
148+
- the same runtime bug or improvement lands independently in both engines more than once
149+
- `fixed_map`, `wal`, or `wal_file` stay structurally stable across several follow-on slices
150+
- a third engine appears and wants the same substrate
151+
- the repo starts paying obvious maintenance cost for duplicated runtime fixes
152+
153+
Until then, duplication is cheaper than premature abstraction.
154+
155+
## Recommended Next Step
156+
157+
Treat `M10` as complete.
158+
159+
The next step is not more framework work. The next step is either:
160+
161+
- stop here and keep both engines local while they stabilize, or
162+
- start a new domain/engine experiment only if there is a strong reason to test a third point in
163+
the design space
164+
165+
If extraction is revisited later, start with the smallest possible mechanical move:
166+
167+
1. `retire_queue`
168+
2. selected `fixed_map` helpers
169+
3. selected `wal` / `wal_file` helpers
170+
171+
Do not start with snapshot schemas, command codecs, or state-machine traits.

docs/status.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,6 @@
11
# AllocDB Status
22
## Current State
3-
- Phase: replicated implementation with external Jepsen gate closed and M9 lease-kernel follow-on
4-
implemented and live-validated
3+
- Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, and M10 second-engine proof merged
54
- Planning IDs: tasks use `M#-T#`; spikes use `M#-S#`
65
- Current milestone status:
76
- `M0` semantics freeze: complete enough for core work
@@ -15,6 +14,7 @@
1514
- `M7` replicated core prototype: in progress
1615
- `M8` external cluster validation: in progress
1716
- `M9` generic lease-kernel follow-on: implementation merged on `main`
17+
- `M10` second-engine proof: merged on `main`; shared runtime extraction deferred
1818
- Latest completed implementation chunks:
1919
- `4156a80` `Bootstrap AllocDB core and docs`
2020
- `f84a641` `Add WAL file and snapshot recovery primitives`
@@ -215,6 +215,6 @@
215215
reserve, revoke/reclaim, and stale-holder lease paths, then closing the loop with live KubeVirt
216216
`lease_safety-control` and full `1800s` `lease_safety-crash-restart` evidence on `allocdb-a`,
217217
both with `blockers=0`
218-
- the next recommended step is downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work
219-
- the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine
220-
- `M10` is now in implementation and review on PR `#107`: `quota-core` has a bounded in-repo second-engine proof with deterministic `CreateBucket` / `Debit`, refill, snapshot/WAL recovery, and the next remaining readout is whether any shared-runtime seam is actually justified after the second engine exists
218+
- the next recommended step remains downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine
219+
- PR `#107` merged the `M10` quota-engine proof on `main`: `quota-core` now proves a second deterministic engine in-repo with bounded `CreateBucket` / `Debit`, logical-slot refill, and snapshot/WAL recovery
220+
- the `M10-T05` seam evaluation concludes that a shared runtime crate is still premature: `retire_queue` is the closest mechanical extraction candidate, but `fixed_map`, `wal`, `wal_file`, recovery, snapshot schema, and all state-machine layers should remain engine-local until repeated maintenance pressure justifies extraction

0 commit comments

Comments
 (0)