- Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, M10 second-engine proof merged, M11 third-engine proof merged, and M12 runtime-extraction roadmap staged
- Planning IDs: tasks use
M#-T#; spikes useM#-S# - Current milestone status:
M0semantics freeze: complete enough for core workM1pure state machine: implementedM1Hconstant-time core hardening: completeM2durability and recovery: implementedM3submission pipeline: implementedM4simulation: implementedM5single-node alpha surface: implementedM6replication design: implementedM7replicated core prototype: in progressM8external cluster validation: in progressM9generic lease-kernel follow-on: implementation merged onmainM10second-engine proof: merged onmain; shared runtime extraction deferredM11third-engine proof: merged onmain; broad shared runtime still deferred, first micro-extraction now justifiedM12first internal runtime extractions: planned
- Latest completed implementation chunks:
4156a80Bootstrap AllocDB core and docsf84a641Add WAL file and snapshot recovery primitivesd87c9a7Add repo guardrails and status tracking79ae34fAdd snapshot persistence and replay recovery1583d67Use fixed-capacity maps in allocator core3d6ff0fFail closed on WAL corruption39f103bDefer conditional confirm and add health metrics82cb8d8Add single-node submission engine crate- current validated chunk: seeded crash-point and WAL-fault coverage across submit, checkpoint,
and recovery boundaries; checked slot and LSN overflow handling; deterministic simulation over
contention, retry timing, and due-expiration ordering; replicated metadata bootstrap and
fail-closed faulted-state entry; majority-backed quorum writes with primary-only reads,
quorum-loss demotion, and higher-view takeover; suffix and snapshot-based stale-replica rejoin
with divergent prepared-suffix discard; promoted partition and primary-crash scenarios that
preserve fail-closed behavior and retry/read continuity after failover; the local
three-replica cluster runner, fault-control harness, and QEMU testbed around the real replica
daemon; the first trusted-core bundle-commit slice with bundle membership, bundle-aware
confirm/release/expire, and bundle regression coverage; the first fencing slice with
lease-epoch propagation, stale-holder rejection, and epoch-aware retry/read coverage; explicit
revoke/reclaim with late-not-early reuse preserved across replay and failover; lease-shaped
node API exposure for bundle membership and authority state; replicated preservation for
committed bundle membership and stale-holder rejection across failover and suffix/snapshot
rejoin; and live KubeVirt Jepsen lease-safety control and
1800scrash-restart runs withblockers=0
- Trusted-core crate:
crates/allocdb-core - Single-node wrapper crate:
crates/allocdb-node - Benchmark harness crate:
crates/allocdb-bench - In-memory deterministic allocator:
- deterministic fixed-capacity open-addressed resource, reservation, and operation tables
- bounded reservation and operation retirement queues
- bounded timing-wheel expiration index
create_resource,reserve,confirm,release,revoke,reclaim,expire- bounded health snapshot with logical slot lag, expiration backlog, and operation-table utilization
- In-process submission engine:
- typed and encoded request validation before commit
- bounded submission queue with deterministic overload behavior
- LSN assignment, WAL append, sync, and live apply
- definite pre-commit rejection for request slots whose derived deadline, history, or dedupe
windows would overflow
u64 - pre-sequencing duplicate lookup for applied and already-queued
operation_id - strict-read fence by applied LSN
- restart path from snapshot plus WAL
- explicit definite-vs-indefinite submission error categorization
- explicit restart-and-retry handling for ambiguous WAL failures within the dedupe window
- explicit
lsn_exhaustedwrite rejection after the engine commits the last representable LSN - node-level metrics for queue pressure, write acceptance, startup recovery status, and active snapshot anchor
- Deterministic benchmark harness:
- CLI entrypoint at
cargo run -p allocdb-bench -- --scenario all - one-resource-many-contenders scenario for hot-spot reserve contention
- high-retry-pressure scenario for duplicate replay, conflict replay, full dedupe table rejection, and post-window recovery
- scenario reports include elapsed time, throughput, metrics snapshots, and WAL byte counts
- CLI entrypoint at
- Alpha API surface:
- transport-neutral request and response types in
crates/allocdb-node::api - binary request and response codec with fixed-width little-endian encoding
- explicit wire-level mapping for definite vs indefinite submission failures
- strict-read fence responses plus halt-safe read rejection for resource and reservation queries
- retired reservation lookups remain distinct from
not_foundacross later writes and snapshot restore through bounded retired-watermark metadata - bounded
tick_expirationsmaintenance request for live TTL enforcement - metrics exposure through the same API boundary
- transport-neutral request and response types in
- Operator documentation:
- operator-facing runbook for the single-node alpha, local replicated cluster runner, local QEMU testbed, and first Kubernetes deployment shape
- Kubernetes deployment packaging:
- one container build, one DNS-backed layout generator for
cluster-layout.txt, and one firstdeploy/kubernetesinstall shape with a bootstrap-primary service and per-replica PVCs - one GitHub Actions image-publish workflow for Docker Hub staging and release tags
- one container build, one DNS-backed layout generator for
- Follow-on planning:
- one draft lease-kernel follow-on plan that narrows the next trusted-core additions to bundle ownership, fencing, revoke, and an explicit liveness boundary, framed as generic scarce-resource semantics rather than product-specific behavior
- one draft lease-kernel design-decision document that chooses a first-class lease authority
object, bundle size
1as the single-resource semantic special case, a lease-scoped fencing token, and a two-stagerevoke -> reclaimsafety model - one merged authoritative-docs pass under issue
#80that rewrote semantics, API, architecture, and fault-model docs to the approved lease-centric contract while keeping the current reservation-centric implementation explicitly marked as compatibility surface - one merged
M9-T08planning note that narrows revoke/reclaim implementation scope before the code-bearing revoke branch
- Replication design draft:
- VSR-style primary/backup replicated log with fixed membership and majority quorums
- primary-only reads in the first replicated release
- protocol invariants that preserve single-node idempotency, strict-read, TTL, and reservation-ID semantics across failover
- Replicated validation planning:
- deterministic cluster-simulation plan that extends seeded simulation to partitions, primary crash, and rejoin without a mock semantics layer
- Jepsen gate with explicit contention, ambiguity, failover, and expiration workloads
- supplementary Jepsen lease-safety coverage for bundle reserve, revoke/reclaim, and stale-holder rejection without changing the documented release-gate matrix
- retry-aware history interpretation and release-blocking invariants for duplicate execution, stale successful reads, double allocation, early reuse, and stale-holder acceptance
- Host-side Jepsen harness slice:
- one release-gate matrix planner, one retry-aware history codec/analyzer, one host-side artifact bundler for duplicate-execution, double-allocation, stale-read, early-expiration, unresolved-ambiguity, and fetched external-cluster log checks, plus explicit
verify-qemu-surfaceandverify-kubevirt-surfaceprobes that exercise one real metrics round trip on every replica and one real primary submit/read round trip through the live replicated protocol surface - one supplementary
lease_safetyworkload family with control and crash-restart runs that exercises bundle reserve, explicit revoke/reclaim, and stale-holder release against the live Jepsen surface without promoting that workload into the release-blocking matrix yet - one real
run-qemuand one realrun-kubevirtexecutor for the full documented release-gate matrix, with persisted histories and artifact bundles for control, crash-restart, partition-heal, and mixed-failover runs, plus host-side failover/rejoin orchestration built from replica workspace export/import and stagedReplicaNode::recover(...)rewrites - one
capture-kubevirt-layouthelper that records the live KubeVirt VM IPs, namespace, helper-pod settings, and SSH key path needed to drive the matrix from the host
- one release-gate matrix planner, one retry-aware history codec/analyzer, one host-side artifact bundler for duplicate-execution, double-allocation, stale-read, early-expiration, unresolved-ambiguity, and fetched external-cluster log checks, plus explicit
- Replicated node scaffolding:
- dedicated replica metadata file with temp-write, rename, and directory-sync durability
- persisted replica identity, role, view, commit point, snapshot anchor, last-normal view, and optional durable vote metadata
- startup bootstrap for missing metadata on both fresh-open and recover paths
- fail-closed
faultedstate when metadata bytes are corrupt, identity is mismatched, or local applied/snapshot state contradicts the persisted replicated metadata - configurable normal-mode
primaryandbackuproles for one current view - explicit
view_uncertainrole plus durable higher-view voting for replicas that lost quorum or are participating in failover - durable prepared-entry sidecar for pre-commit replicated client commands
- prepare append, commit-through, and strict primary-read guards built around the existing single-node executor rather than a second apply path
- Local multi-process cluster runner:
- CLI entrypoint at
cargo run -p allocdb-node --bin allocdb-local-cluster -- <start|stop|status|crash|restart|isolate|heal> ...with one persistedcluster-layout.txt - stable replica identities, local bounds, and three external replica processes from one command surface
- per-replica loopback
control,client, andprotocollisteners withstatusandstophooks oncontrol - per-replica pid, log, WAL, snapshot, metadata, and prepared-log paths exposed through
status, with restart through the realReplicaNode::recoverpath and stable durable workspace reuse - one persisted
cluster-faults.txtfile that marks whole-replica client/protocol isolation without affecting control reachability, plus one append-onlycluster-timeline.logfor later checker/debug reuse - reserved
clientandprotocollisteners now fail with explicit isolation errors when the local fault harness marks that replica isolated - real primary-side client/protocol transport for external
submit,get_resource,get_reservation,get_metrics, and replicatedtick_expirations, with majority append before publish and backup reads still failing closed asnot primary - structured daemon-side logging for successful prepare quorum formation, commit-broadcast acknowledgements, accepted protocol prepare/commit traffic, expiration batch planning, and applied expiration commands
- CLI entrypoint at
- Durability primitives:
- WAL frame codec and recovery scan
- file-backed WAL append, sync, recovery, and torn-tail truncation
- fail-closed recovery on middle-of-log corruption
- fail-closed recovery on non-monotonic WAL replay metadata and malformed decoded snapshot semantics
- fail-closed recovery on replayed commands whose derived slot windows overflow configured bounds
- snapshot encode, decode, capture, restore
- file-backed snapshot write and load
- explicit WAL command payload encoding and live-path replay recovery
- checkpoint path that writes the new snapshot first, then rewrites retained WAL history
- one-checkpoint WAL overlap and
snapshot_markerretention for safe checkpoint replacement
- Deterministic simulation support:
- reusable simulation harness in
crates/allocdb-node/src/simulation.rs - explicit simulated slot advancement under test control, with no wall-clock reads in the exercised engine path
- seeded same-slot ready-set scheduling with reproducible transcripts
- seeded labeled schedule actions that resolve candidate slot windows into replayable submit/tick transcripts
- seeded due-expiration selection over the real internal-expire path, bounded by the production per-tick expiration limit
- seeded one-shot crash plans over named client-submit, internal-apply, checkpoint, and recovery boundaries
- one-shot storage fault helpers over append failure, sync failure, checksum mismatch, and torn-tail WAL mutation against real on-disk recovery
- checkpoint, restart, and live write-fault helpers over the real
SingleNodeEngine - regression coverage for crash-selected post-sync submit replay, crash-after-snapshot-write checkpoint recovery, replay-interrupted recovery restart, sync-failure retry recovery, checksum-corruption fail-closed restart, torn-tail truncation retry, ingress contention winner order, same-deadline expiration order, mixed-deadline earliest-first expiration priority, and retry timing across the dedupe window
- reusable replicated cluster harness in
crates/allocdb-node/src/replicated_simulation.rs - three real
ReplicaNodes with independent WAL, snapshot, and metadata workspaces - explicit replica-to-replica and client-to-replica connectivity matrix under test control
- explicit protocol-message queue plus replayable transcripts for queue, deliver, drop, crash, and restart actions
- real
prepare,prepare_ack, andcommitprotocol payload delivery on that queue - configured-primary client submit flow with result publication only after majority durable append
- retry-aware client submit helper that returns one cached committed result on the current primary instead of assigning a fresh replicated LSN
- backup replicas that durably append prepares but do not apply allocator state until commit
- primary-only resource reads guarded by the existing strict-read fence after local commit
- automatic quorum-loss detection that demotes a stranded primary out of service
- explicit higher-view takeover that records durable votes from a reachable majority, reconstructs the safe committed prefix on the new primary, discards stale uncommitted suffix, and drops old-view protocol messages
- replica crash as loss of volatile state with restart through real
ReplicaNode::recover - checkpoint-assisted rejoin that rewrites one stale replica from suffix-only WAL catch-up or snapshot transfer, then restarts through the real recovery path before returning the replica to backup mode
- regression coverage for quorum-loss fail-closed reads and writes, higher-view takeover with stale-primary read rejection, prepared-suffix recovery from another voter during takeover, isolated-backup partition heal and catch-up, non-quorum split fail-closed behavior with later rejoin convergence, primary crash before quorum append, primary crash after majority append, primary crash after reply, suffix-only rejoin, snapshot-transfer rejoin, and faulted rejoin rejection
- reusable simulation harness in
- Validation:
- core durability:
cargo test -p allocdb-core wal -- --nocapture,cargo test -p allocdb-core snapshot -- --nocapture,cargo test -p allocdb-core recovery -- --nocapture,cargo test -p allocdb-core snapshot_restores_retired_lookup_watermark - node runtime:
cargo test -p allocdb-node api_reservation_reports_retired_history,cargo test -p allocdb-node engine -- --nocapture,cargo test -p allocdb-node replica -- --nocapture - simulation:
cargo test -p allocdb-node simulation -- --nocapture,cargo test -p allocdb-node replicated_simulation -- --nocapture - local cluster, qemu assets, Jepsen harness, and benchmarks:
cargo test -p allocdb-node local_cluster -- --nocapture,cargo test -p allocdb-node qemu_testbed -- --nocapture,cargo test -p allocdb-node jepsen -- --nocapture,cargo test -p allocdb-node --bin allocdb-jepsen -- --nocapture,cargo run -p allocdb-node --bin allocdb-jepsen -- plan,cargo run -p allocdb-bench -- --scenario all - repo gate:
scripts/preflight.sh
- core durability:
- PR
#82merged the#70maintainability follow-up, including live KubeVirtreservation_contention-controland full1800sreservation_contention-crash-restartreruns onallocdb-awithblockers=0 M9-T01throughM9-T05are merged onmainvia PR#81, and the planning issues are closed on theAllocDBproject- PRs
#89,#90,#92,#93,#94, and#95merged the fullM9-T06throughM9-T11implementation chain onmain: bundle commit, lease-epoch fencing, explicitrevoke/reclaim, lease-shaped node API exposure, replication-preserved failover behavior, and broader simulation coverage are now all in the mainline implementation - PR
#97merged issue#96, extending Jepsen history generation and analysis for bundle reserve, revoke/reclaim, and stale-holder lease paths, then closing the loop with live KubeVirtlease_safety-controland full1800slease_safety-crash-restartevidence onallocdb-awithblockers=0 - the next recommended step remains downstream real-cluster e2e work such as
gpu_control_plane, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-clusterStatefulSetshape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publishskel84/allocdbfrom GitHub Actions rather than relying on the local Docker engine - PR
#107merged theM10quota-engine proof onmain, and PRs#116,#117, and#118merged the fullM11reservation-core chain onmain: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs - PRs
#132,#133, and#134merged the firstM12runtime extractions onmain:retire_queue,wal, andwal_fileare now shared internal substrate instead of copied engine-local modules, whileM12-T04closed as a defer decision becausesnapshot_fileis still only a clean seam inside thequota-core/reservation-corepair andallocdb-corekeeps the simpler file format - the next roadmap step is now
M13: define the internal engine authoring boundary inruntime-extraction-roadmap.mdand stop extraction pressure until that contract is written down; the authoring rule is to keep shared runtime below the semantic line and keep command surfaces, snapshot schemas, recovery entry points, and state-machine meaning engine-local