Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@
## Engineering Docs

- [Architecture](./architecture.md)
- [Engine Authoring Boundary](./engine-authoring-boundary.md)
- [Fault Model](./fault-model.md)
- [Jepsen Refactor Plan](./jepsen-refactor-plan.md)
- [Lease Kernel Design Decisions](./lease-kernel-design.md)
Expand Down
201 changes: 201 additions & 0 deletions docs/engine-authoring-boundary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Engine Authoring Boundary

## Purpose

This document closes the first `M13` question: after the `M12` micro-extractions, what is the
stable boundary between the shared runtime substrate and engine-local semantics?

This is an internal authoring rule, not a public framework story.

## Decision

The shared runtime owns only mechanically shared durability and bounded-state substrate.

Everything that defines domain meaning stays engine-local.

That means:

- use the extracted runtime crates where the seam is already proven
- keep new engine semantics local by default
- treat any new generic abstraction above the current substrate as suspect until repeated pressure
proves otherwise

## What The Shared Runtime Owns Today

The shared runtime currently owns only the modules that are already shared on `main`.

### Shared and stable enough now

- `allocdb-retire-queue`
- bounded retirement queue discipline
- no domain meaning beyond ordered retirement bookkeeping
- `allocdb-wal-frame`
- WAL frame versioning
- frame header/footer validation
- checksum verification
- torn-tail and corruption detection at the frame level
- `allocdb-wal-file`
- append-only durable file handle
- replace/rewrite discipline
- truncation and reopen behavior

### What these modules are allowed to know

Only substrate concerns:

- bytes
- lengths
- checksums
- file paths and file handles
- bounded queue mechanics
- ordering and truncation discipline

These modules must not know:

- command schemas
- result codes
- resource, bucket, pool, or hold semantics
- snapshot schemas
- engine-specific invariants

## What Stays Engine-Local

Each engine still owns the parts that define the database itself.

### Domain contract surface

Keep local:

- command enums
- command codecs above raw frame transport
- result codes and read models
- config surfaces tied to domain semantics

### Persistence schema

Keep local:

- snapshot encoding and decoding
- snapshot file wrappers while file formats still differ
- engine-specific recovery error surfaces

### State machine semantics

Keep local:

- apply rules
- invariants
- derived indexes
- logical-slot effects such as refill, expiry, revoke, reclaim, or fencing
- any internal command semantics above raw WAL framing

### Recovery entry points

Keep local:

- top-level recovery APIs
- replay orchestration that depends on engine-specific command decoding
- operational logging tied to one engine's semantics

## Authoring Rules For Future Work

### Rule 1: Start local unless the seam is already proven

When adding a new engine or engine slice:

- use the shared runtime crates only for seams already extracted
- keep new runtime-adjacent code local until at least two engines want the same thing in the same
shape

### Rule 2: Do not generalize state-machine APIs

Do not introduce:

- generic state-machine traits
- generic apply pipelines
- generic snapshot schemas
- generic recovery entry points

Those layers still carry domain meaning and would create abstraction debt faster than maintenance
relief.

### Rule 3: Extract only below the semantic line

A module is a good runtime candidate only if it can stay below the line where domain meaning starts.

Good examples:

- bytes-on-disk framing
- bounded retirement bookkeeping
- file rewrite/truncate mechanics

Bad examples:

- "generic reserve/confirm/release" APIs
- "generic bucket/pool/resource" models
- "generic engine config" layers

### Rule 4: Prefer duplication over dishonest abstraction

If a candidate seam requires:

- engine-specific branches
- feature flags that mirror engine names
- generic types that only one engine can actually use

then it is not ready.

### Rule 5: New extractions need multi-engine pressure

Do not extract a new runtime module unless at least one of these is true:

- the code is already mechanically identical across engines
- the same fix or improvement is landing independently in multiple engines
- a new engine authoring pass clearly pays less copy-paste by using the shared layer

## Current Boundary Map

### Shared runtime

- `allocdb-retire-queue`
- `allocdb-wal-frame`
- `allocdb-wal-file`

### Deferred seams

- `snapshot_file`
- deferred because the seam is still only clean inside the `quota-core` / `reservation-core`
pair
- bounded collections beyond `retire_queue`
- still need proof that the common surface is stable enough
- recovery helpers above file/frame mechanics
- still too tied to engine-local replay contracts

### Explicit non-goals

- no public database-building library claim yet
- no renaming the repository around framework identity
- no generic engine kit above the current substrate

## Practical Consequence

A future engine author should think in this order:

1. write engine-local semantics first
2. consume the existing shared runtime only for proven substrate
3. copy new runtime-adjacent code locally if the seam is not already explicit
4. extract later only if repeated pressure proves the boundary

That keeps the repository honest:

- shared where the code is actually shared
- local where the semantics are still the database

## Next Step

With this boundary in place, the next `M13` step is narrower:

1. write the focused runtime-vs-engine contract note
2. decide whether that contract already makes a reduced-copy proof likely enough
3. only then choose whether `M14` still needs a full fourth-engine or can use a smaller engine
slice proof
Comment on lines +1 to +201
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Verify whether roadmap includes/links the new boundary contract and reflects canonical ownership.

fd runtime-extraction-roadmap.md
rg -n -C2 'engine-authoring-boundary|semantic line|M13|authoring boundary|canonical' docs/runtime-extraction-roadmap.md docs/status.md docs/README.md

Repository: skel84/allocdb

Length of output: 2922


Update docs/runtime-extraction-roadmap.md to reference engine-authoring-boundary.md as the canonical M13 authoring contract.

The roadmap doc currently describes M13 as "Internal Engine Authoring Contract" but does not reference or delegate to the newly added engine-authoring-boundary.md file. Meanwhile, status.md still refers to the roadmap as the location where this boundary is defined. Per the coding guidelines, documentation must stay aligned with design and code changes. The roadmap should explicitly reference the boundary contract document so readers know where the canonical authoring rules live.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/engine-authoring-boundary.md` around lines 1 - 201, Add an explicit
pointer from the runtime-extraction-roadmap document to the new
engine-authoring-boundary document by updating the section that describes M13
("Internal Engine Authoring Contract") to reference or link to
engine-authoring-boundary.md as the canonical M13 authoring contract (look for
the M13 heading/title in runtime-extraction-roadmap.md). Also update any mention
in status.md that currently claims the roadmap defines the boundary so it
instead points readers to engine-authoring-boundary.md as the authoritative
source (search for the phrase that references the roadmap/boundary in
status.md).

Comment on lines +196 to +201
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

“Next Step” section contradicts this newly added contract doc.

Line 198 says the next step is to write the contract note, but this file is already that contract. Please update this section to the post-contract decision point to avoid sequencing ambiguity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/engine-authoring-boundary.md` around lines 196 - 201, The "Next Step"
section currently instructs the reader to "write the focused runtime-vs-engine
contract note" even though this file is that contract; update the paragraph and
bullet list so it reads as the post-contract decision point: remove or replace
the first bullet ("write the focused runtime-vs-engine contract note") and
rephrase the opening sentence to "With this contract in place, the next M13 step
is narrower:" (or similar), then ensure bullets only cover: decide whether the
contract makes a reduced-copy proof likely enough, and only then choose whether
M14 still needs a full fourth-engine or can use a smaller engine-slice proof;
preserve references to M13/M14 and the decision sequencing in the updated text.

2 changes: 1 addition & 1 deletion docs/status.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,4 +217,4 @@
- the next recommended step remains downstream real-cluster e2e work such as `gpu_control_plane`, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster `StatefulSet` shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish `skel84/allocdb` from GitHub Actions rather than relying on the local Docker engine
- PR `#107` merged the `M10` quota-engine proof on `main`, and PRs `#116`, `#117`, and `#118` merged the full `M11` reservation-core chain on `main`: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs
- PRs `#132`, `#133`, and `#134` merged the first `M12` runtime extractions on `main`: `retire_queue`, `wal`, and `wal_file` are now shared internal substrate instead of copied engine-local modules, while `M12-T04` closed as a defer decision because `snapshot_file` is still only a clean seam inside the `quota-core` / `reservation-core` pair and `allocdb-core` keeps the simpler file format
- the next roadmap step is now `M13`: define the internal engine authoring boundary in `runtime-extraction-roadmap.md` and stop extraction pressure until that contract is written down
- the next roadmap step is now `M13`: define the internal engine authoring boundary in `runtime-extraction-roadmap.md` and stop extraction pressure until that contract is written down; the authoring rule is to keep shared runtime below the semantic line and keep command surfaces, snapshot schemas, recovery entry points, and state-machine meaning engine-local
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

docs/status.md next-step text is stale after this PR.

Line 220 still says the next step is to define the boundary contract, but this PR already introduces that contract document. Please advance this line to the next actionable M13 step (e.g., contract validation / M14 decision gate) so status remains a true snapshot.

As per coding guidelines: "Keep docs/status.md current as the single-file progress snapshot for the repository. Update it whenever milestone state, implementation coverage, or the recommended next step materially changes."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@docs/status.md` at line 220, Update docs/status.md: the current "next roadmap
step" referencing M13 and runtime-extraction-roadmap.md is stale because this PR
added that contract; change the single-line text (around the M13 mention) to the
next actionable step (for example: "M13: validate contract implementation / run
contract validation tests and prepare M14 decision gate" or explicitly move to
"M14 decision gate") so the file reflects that the contract document exists and
lists the next work item (contract validation or M14 decision gate) as the
actionable next step.

Loading