diff --git a/docs/README.md b/docs/README.md index a10306f..a4403af 100644 --- a/docs/README.md +++ b/docs/README.md @@ -17,6 +17,7 @@ ## Engineering Docs - [Architecture](./architecture.md) +- [Engine Authoring Boundary](./engine-authoring-boundary.md) - [Fault Model](./fault-model.md) - [Jepsen Refactor Plan](./jepsen-refactor-plan.md) - [Lease Kernel Design Decisions](./lease-kernel-design.md) diff --git a/docs/engine-authoring-boundary.md b/docs/engine-authoring-boundary.md new file mode 100644 index 0000000..82818db --- /dev/null +++ b/docs/engine-authoring-boundary.md @@ -0,0 +1,201 @@ +# Engine Authoring Boundary + +## Purpose + +This document closes the first `M13` question: after the `M12` micro-extractions, what is the +stable boundary between the shared runtime substrate and engine-local semantics? + +This is an internal authoring rule, not a public framework story. + +## Decision + +The shared runtime owns only mechanically shared durability and bounded-state substrate. + +Everything that defines domain meaning stays engine-local. + +That means: + +- use the extracted runtime crates where the seam is already proven +- keep new engine semantics local by default +- treat any new generic abstraction above the current substrate as suspect until repeated pressure + proves otherwise + +## What The Shared Runtime Owns Today + +The shared runtime currently owns only the modules that are already shared on `main`. + +### Shared and stable enough now + +- `allocdb-retire-queue` + - bounded retirement queue discipline + - no domain meaning beyond ordered retirement bookkeeping +- `allocdb-wal-frame` + - WAL frame versioning + - frame header/footer validation + - checksum verification + - torn-tail and corruption detection at the frame level +- `allocdb-wal-file` + - append-only durable file handle + - replace/rewrite discipline + - truncation and reopen behavior + +### What these modules are allowed to know + +Only substrate concerns: + +- bytes +- lengths +- checksums +- file paths and file handles +- bounded queue mechanics +- ordering and truncation discipline + +These modules must not know: + +- command schemas +- result codes +- resource, bucket, pool, or hold semantics +- snapshot schemas +- engine-specific invariants + +## What Stays Engine-Local + +Each engine still owns the parts that define the database itself. + +### Domain contract surface + +Keep local: + +- command enums +- command codecs above raw frame transport +- result codes and read models +- config surfaces tied to domain semantics + +### Persistence schema + +Keep local: + +- snapshot encoding and decoding +- snapshot file wrappers while file formats still differ +- engine-specific recovery error surfaces + +### State machine semantics + +Keep local: + +- apply rules +- invariants +- derived indexes +- logical-slot effects such as refill, expiry, revoke, reclaim, or fencing +- any internal command semantics above raw WAL framing + +### Recovery entry points + +Keep local: + +- top-level recovery APIs +- replay orchestration that depends on engine-specific command decoding +- operational logging tied to one engine's semantics + +## Authoring Rules For Future Work + +### Rule 1: Start local unless the seam is already proven + +When adding a new engine or engine slice: + +- use the shared runtime crates only for seams already extracted +- keep new runtime-adjacent code local until at least two engines want the same thing in the same + shape + +### Rule 2: Do not generalize state-machine APIs + +Do not introduce: + +- generic state-machine traits +- generic apply pipelines +- generic snapshot schemas +- generic recovery entry points + +Those layers still carry domain meaning and would create abstraction debt faster than maintenance +relief. + +### Rule 3: Extract only below the semantic line + +A module is a good runtime candidate only if it can stay below the line where domain meaning starts. + +Good examples: + +- bytes-on-disk framing +- bounded retirement bookkeeping +- file rewrite/truncate mechanics + +Bad examples: + +- "generic reserve/confirm/release" APIs +- "generic bucket/pool/resource" models +- "generic engine config" layers + +### Rule 4: Prefer duplication over dishonest abstraction + +If a candidate seam requires: + +- engine-specific branches +- feature flags that mirror engine names +- generic types that only one engine can actually use + +then it is not ready. + +### Rule 5: New extractions need multi-engine pressure + +Do not extract a new runtime module unless at least one of these is true: + +- the code is already mechanically identical across engines +- the same fix or improvement is landing independently in multiple engines +- a new engine authoring pass clearly pays less copy-paste by using the shared layer + +## Current Boundary Map + +### Shared runtime + +- `allocdb-retire-queue` +- `allocdb-wal-frame` +- `allocdb-wal-file` + +### Deferred seams + +- `snapshot_file` + - deferred because the seam is still only clean inside the `quota-core` / `reservation-core` + pair +- bounded collections beyond `retire_queue` + - still need proof that the common surface is stable enough +- recovery helpers above file/frame mechanics + - still too tied to engine-local replay contracts + +### Explicit non-goals + +- no public database-building library claim yet +- no renaming the repository around framework identity +- no generic engine kit above the current substrate + +## Practical Consequence + +A future engine author should think in this order: + +1. write engine-local semantics first +2. consume the existing shared runtime only for proven substrate +3. copy new runtime-adjacent code locally if the seam is not already explicit +4. extract later only if repeated pressure proves the boundary + +That keeps the repository honest: + +- shared where the code is actually shared +- local where the semantics are still the database + +## Next Step + +With this boundary in place, the next `M13` step is narrower: + +1. write the focused runtime-vs-engine contract note +2. decide whether that contract already makes a reduced-copy proof likely enough +3. only then choose whether `M14` still needs a full fourth-engine or can use a smaller engine + slice proof diff --git a/docs/status.md b/docs/status.md index 7a6563a..5c1f972 100644 --- a/docs/status.md +++ b/docs/status.md @@ -217,4 +217,4 @@ - the next recommended step remains downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine - PR `#107` merged the `M10` quota-engine proof on `main`, and PRs `#116`, `#117`, and `#118` merged the full `M11` reservation-core chain on `main`: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs - PRs `#132`, `#133`, and `#134` merged the first `M12` runtime extractions on `main`: `retire_queue`, `wal`, and `wal_file` are now shared internal substrate instead of copied engine-local modules, while `M12-T04` closed as a defer decision because `snapshot_file` is still only a clean seam inside the `quota-core` / `reservation-core` pair and `allocdb-core` keeps the simpler file format -- the next roadmap step is now `M13`: define the internal engine authoring boundary in `runtime-extraction-roadmap.md` and stop extraction pressure until that contract is written down +- the next roadmap step is now `M13`: define the internal engine authoring boundary in `runtime-extraction-roadmap.md` and stop extraction pressure until that contract is written down; the authoring rule is to keep shared runtime below the semantic line and keep command surfaces, snapshot schemas, recovery entry points, and state-machine meaning engine-local