AllocDB Status

Current State

Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, M10 second-engine proof merged, M11 third-engine proof merged, and M12 runtime-extraction roadmap staged
Planning IDs: tasks use M#-T#; spikes use M#-S#
Current milestone status:
- M0 semantics freeze: complete enough for core work
- M1 pure state machine: implemented
- M1H constant-time core hardening: complete
- M2 durability and recovery: implemented
- M3 submission pipeline: implemented
- M4 simulation: implemented
- M5 single-node alpha surface: implemented
- M6 replication design: implemented
- M7 replicated core prototype: in progress
- M8 external cluster validation: in progress
- M9 generic lease-kernel follow-on: implementation merged on main
- M10 second-engine proof: merged on main; shared runtime extraction deferred
- M11 third-engine proof: merged on main; broad shared runtime still deferred, first micro-extraction now justified
- M12 first internal runtime extractions: planned
Latest completed implementation chunks:
- 4156a80 Bootstrap AllocDB core and docs
- f84a641 Add WAL file and snapshot recovery primitives
- d87c9a7 Add repo guardrails and status tracking
- 79ae34f Add snapshot persistence and replay recovery
- 1583d67 Use fixed-capacity maps in allocator core
- 3d6ff0f Fail closed on WAL corruption
- 39f103b Defer conditional confirm and add health metrics
- 82cb8d8 Add single-node submission engine crate
- current validated chunk: seeded crash-point and WAL-fault coverage across submit, checkpoint, and recovery boundaries; checked slot and LSN overflow handling; deterministic simulation over contention, retry timing, and due-expiration ordering; replicated metadata bootstrap and fail-closed faulted-state entry; majority-backed quorum writes with primary-only reads, quorum-loss demotion, and higher-view takeover; suffix and snapshot-based stale-replica rejoin with divergent prepared-suffix discard; promoted partition and primary-crash scenarios that preserve fail-closed behavior and retry/read continuity after failover; the local three-replica cluster runner, fault-control harness, and QEMU testbed around the real replica daemon; the first trusted-core bundle-commit slice with bundle membership, bundle-aware confirm/release/expire, and bundle regression coverage; the first fencing slice with lease-epoch propagation, stale-holder rejection, and epoch-aware retry/read coverage; explicit revoke/reclaim with late-not-early reuse preserved across replay and failover; lease-shaped node API exposure for bundle membership and authority state; replicated preservation for committed bundle membership and stale-holder rejection across failover and suffix/snapshot rejoin; and live KubeVirt Jepsen lease-safety control and 1800s crash-restart runs with blockers=0

What Exists

Trusted-core crate: crates/allocdb-core
Single-node wrapper crate: crates/allocdb-node
Benchmark harness crate: crates/allocdb-bench
In-memory deterministic allocator:
- deterministic fixed-capacity open-addressed resource, reservation, and operation tables
- bounded reservation and operation retirement queues
- bounded timing-wheel expiration index
- create_resource, reserve, confirm, release, revoke, reclaim, expire
- bounded health snapshot with logical slot lag, expiration backlog, and operation-table utilization
In-process submission engine:
- typed and encoded request validation before commit
- bounded submission queue with deterministic overload behavior
- LSN assignment, WAL append, sync, and live apply
- definite pre-commit rejection for request slots whose derived deadline, history, or dedupe windows would overflow u64
- pre-sequencing duplicate lookup for applied and already-queued operation_id
- strict-read fence by applied LSN
- restart path from snapshot plus WAL
- explicit definite-vs-indefinite submission error categorization
- explicit restart-and-retry handling for ambiguous WAL failures within the dedupe window
- explicit lsn_exhausted write rejection after the engine commits the last representable LSN
- node-level metrics for queue pressure, write acceptance, startup recovery status, and active snapshot anchor
Deterministic benchmark harness:
- CLI entrypoint at cargo run -p allocdb-bench -- --scenario all
- one-resource-many-contenders scenario for hot-spot reserve contention
- high-retry-pressure scenario for duplicate replay, conflict replay, full dedupe table rejection, and post-window recovery
- scenario reports include elapsed time, throughput, metrics snapshots, and WAL byte counts
Alpha API surface:
- transport-neutral request and response types in crates/allocdb-node::api
- binary request and response codec with fixed-width little-endian encoding
- explicit wire-level mapping for definite vs indefinite submission failures
- strict-read fence responses plus halt-safe read rejection for resource and reservation queries
- retired reservation lookups remain distinct from not_found across later writes and snapshot restore through bounded retired-watermark metadata
- bounded tick_expirations maintenance request for live TTL enforcement
- metrics exposure through the same API boundary
Operator documentation:
- operator-facing runbook for the single-node alpha, local replicated cluster runner, local QEMU testbed, and first Kubernetes deployment shape
Kubernetes deployment packaging:
- one container build, one DNS-backed layout generator for cluster-layout.txt, and one first deploy/kubernetes install shape with a bootstrap-primary service and per-replica PVCs
- one GitHub Actions image-publish workflow for Docker Hub staging and release tags
Follow-on planning:
- one draft lease-kernel follow-on plan that narrows the next trusted-core additions to bundle ownership, fencing, revoke, and an explicit liveness boundary, framed as generic scarce-resource semantics rather than product-specific behavior
- one draft lease-kernel design-decision document that chooses a first-class lease authority object, bundle size 1 as the single-resource semantic special case, a lease-scoped fencing token, and a two-stage revoke -> reclaim safety model
- one merged authoritative-docs pass under issue #80 that rewrote semantics, API, architecture, and fault-model docs to the approved lease-centric contract while keeping the current reservation-centric implementation explicitly marked as compatibility surface
- one merged M9-T08 planning note that narrows revoke/reclaim implementation scope before the code-bearing revoke branch
Replication design draft:
- VSR-style primary/backup replicated log with fixed membership and majority quorums
- primary-only reads in the first replicated release
- protocol invariants that preserve single-node idempotency, strict-read, TTL, and reservation-ID semantics across failover
Replicated validation planning:
- deterministic cluster-simulation plan that extends seeded simulation to partitions, primary crash, and rejoin without a mock semantics layer
- Jepsen gate with explicit contention, ambiguity, failover, and expiration workloads
- supplementary Jepsen lease-safety coverage for bundle reserve, revoke/reclaim, and stale-holder rejection without changing the documented release-gate matrix
- retry-aware history interpretation and release-blocking invariants for duplicate execution, stale successful reads, double allocation, early reuse, and stale-holder acceptance
Host-side Jepsen harness slice:
- one release-gate matrix planner, one retry-aware history codec/analyzer, one host-side artifact bundler for duplicate-execution, double-allocation, stale-read, early-expiration, unresolved-ambiguity, and fetched external-cluster log checks, plus explicit verify-qemu-surface and verify-kubevirt-surface probes that exercise one real metrics round trip on every replica and one real primary submit/read round trip through the live replicated protocol surface
- one supplementary lease_safety workload family with control and crash-restart runs that exercises bundle reserve, explicit revoke/reclaim, and stale-holder release against the live Jepsen surface without promoting that workload into the release-blocking matrix yet
- one real run-qemu and one real run-kubevirt executor for the full documented release-gate matrix, with persisted histories and artifact bundles for control, crash-restart, partition-heal, and mixed-failover runs, plus host-side failover/rejoin orchestration built from replica workspace export/import and staged ReplicaNode::recover(...) rewrites
- one capture-kubevirt-layout helper that records the live KubeVirt VM IPs, namespace, helper-pod settings, and SSH key path needed to drive the matrix from the host
Replicated node scaffolding:
- dedicated replica metadata file with temp-write, rename, and directory-sync durability
- persisted replica identity, role, view, commit point, snapshot anchor, last-normal view, and optional durable vote metadata
- startup bootstrap for missing metadata on both fresh-open and recover paths
- fail-closed faulted state when metadata bytes are corrupt, identity is mismatched, or local applied/snapshot state contradicts the persisted replicated metadata
- configurable normal-mode primary and backup roles for one current view
- explicit view_uncertain role plus durable higher-view voting for replicas that lost quorum or are participating in failover
- durable prepared-entry sidecar for pre-commit replicated client commands
- prepare append, commit-through, and strict primary-read guards built around the existing single-node executor rather than a second apply path
Local multi-process cluster runner:
- CLI entrypoint at cargo run -p allocdb-node --bin allocdb-local-cluster -- <start|stop|status|crash|restart|isolate|heal> ... with one persisted cluster-layout.txt
- stable replica identities, local bounds, and three external replica processes from one command surface
- per-replica loopback control, client, and protocol listeners with status and stop hooks on control
- per-replica pid, log, WAL, snapshot, metadata, and prepared-log paths exposed through status, with restart through the real ReplicaNode::recover path and stable durable workspace reuse
- one persisted cluster-faults.txt file that marks whole-replica client/protocol isolation without affecting control reachability, plus one append-only cluster-timeline.log for later checker/debug reuse
- reserved client and protocol listeners now fail with explicit isolation errors when the local fault harness marks that replica isolated
- real primary-side client/protocol transport for external submit, get_resource, get_reservation, get_metrics, and replicated tick_expirations, with majority append before publish and backup reads still failing closed as not primary
- structured daemon-side logging for successful prepare quorum formation, commit-broadcast acknowledgements, accepted protocol prepare/commit traffic, expiration batch planning, and applied expiration commands
Durability primitives:
- WAL frame codec and recovery scan
- file-backed WAL append, sync, recovery, and torn-tail truncation
- fail-closed recovery on middle-of-log corruption
- fail-closed recovery on non-monotonic WAL replay metadata and malformed decoded snapshot semantics
- fail-closed recovery on replayed commands whose derived slot windows overflow configured bounds
- snapshot encode, decode, capture, restore
- file-backed snapshot write and load
- explicit WAL command payload encoding and live-path replay recovery
- checkpoint path that writes the new snapshot first, then rewrites retained WAL history
- one-checkpoint WAL overlap and snapshot_marker retention for safe checkpoint replacement
Deterministic simulation support:
- reusable simulation harness in crates/allocdb-node/src/simulation.rs
- explicit simulated slot advancement under test control, with no wall-clock reads in the exercised engine path
- seeded same-slot ready-set scheduling with reproducible transcripts
- seeded labeled schedule actions that resolve candidate slot windows into replayable submit/tick transcripts
- seeded due-expiration selection over the real internal-expire path, bounded by the production per-tick expiration limit
- seeded one-shot crash plans over named client-submit, internal-apply, checkpoint, and recovery boundaries
- one-shot storage fault helpers over append failure, sync failure, checksum mismatch, and torn-tail WAL mutation against real on-disk recovery
- checkpoint, restart, and live write-fault helpers over the real SingleNodeEngine
- regression coverage for crash-selected post-sync submit replay, crash-after-snapshot-write checkpoint recovery, replay-interrupted recovery restart, sync-failure retry recovery, checksum-corruption fail-closed restart, torn-tail truncation retry, ingress contention winner order, same-deadline expiration order, mixed-deadline earliest-first expiration priority, and retry timing across the dedupe window
- reusable replicated cluster harness in crates/allocdb-node/src/replicated_simulation.rs
- three real ReplicaNodes with independent WAL, snapshot, and metadata workspaces
- explicit replica-to-replica and client-to-replica connectivity matrix under test control
- explicit protocol-message queue plus replayable transcripts for queue, deliver, drop, crash, and restart actions
- real prepare, prepare_ack, and commit protocol payload delivery on that queue
- configured-primary client submit flow with result publication only after majority durable append
- retry-aware client submit helper that returns one cached committed result on the current primary instead of assigning a fresh replicated LSN
- backup replicas that durably append prepares but do not apply allocator state until commit
- primary-only resource reads guarded by the existing strict-read fence after local commit
- automatic quorum-loss detection that demotes a stranded primary out of service
- explicit higher-view takeover that records durable votes from a reachable majority, reconstructs the safe committed prefix on the new primary, discards stale uncommitted suffix, and drops old-view protocol messages
- replica crash as loss of volatile state with restart through real ReplicaNode::recover
- checkpoint-assisted rejoin that rewrites one stale replica from suffix-only WAL catch-up or snapshot transfer, then restarts through the real recovery path before returning the replica to backup mode
- regression coverage for quorum-loss fail-closed reads and writes, higher-view takeover with stale-primary read rejection, prepared-suffix recovery from another voter during takeover, isolated-backup partition heal and catch-up, non-quorum split fail-closed behavior with later rejoin convergence, primary crash before quorum append, primary crash after majority append, primary crash after reply, suffix-only rejoin, snapshot-transfer rejoin, and faulted rejoin rejection
Validation:
- core durability: cargo test -p allocdb-core wal -- --nocapture, cargo test -p allocdb-core snapshot -- --nocapture, cargo test -p allocdb-core recovery -- --nocapture, cargo test -p allocdb-core snapshot_restores_retired_lookup_watermark
- node runtime: cargo test -p allocdb-node api_reservation_reports_retired_history, cargo test -p allocdb-node engine -- --nocapture, cargo test -p allocdb-node replica -- --nocapture
- simulation: cargo test -p allocdb-node simulation -- --nocapture, cargo test -p allocdb-node replicated_simulation -- --nocapture
- local cluster, qemu assets, Jepsen harness, and benchmarks: cargo test -p allocdb-node local_cluster -- --nocapture, cargo test -p allocdb-node qemu_testbed -- --nocapture, cargo test -p allocdb-node jepsen -- --nocapture, cargo test -p allocdb-node --bin allocdb-jepsen -- --nocapture, cargo run -p allocdb-node --bin allocdb-jepsen -- plan, cargo run -p allocdb-bench -- --scenario all
- repo gate: scripts/preflight.sh

Current Focus

PR #82 merged the #70 maintainability follow-up, including live KubeVirt reservation_contention-control and full 1800s reservation_contention-crash-restart reruns on allocdb-a with blockers=0
M9-T01 through M9-T05 are merged on main via PR #81, and the planning issues are closed on the AllocDB project
PRs #89, #90, #92, #93, #94, and #95 merged the full M9-T06 through M9-T11 implementation chain on main: bundle commit, lease-epoch fencing, explicit revoke / reclaim, lease-shaped node API exposure, replication-preserved failover behavior, and broader simulation coverage are now all in the mainline implementation
PR #97 merged issue #96, extending Jepsen history generation and analysis for bundle reserve, revoke/reclaim, and stale-holder lease paths, then closing the loop with live KubeVirt lease_safety-control and full 1800s lease_safety-crash-restart evidence on allocdb-a with blockers=0
the next recommended step remains downstream real-cluster e2e work such as gpu_control_plane, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster StatefulSet shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish skel84/allocdb from GitHub Actions rather than relying on the local Docker engine
PR #107 merged the M10 quota-engine proof on main, and PRs #116, #117, and #118 merged the full M11 reservation-core chain on main: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs
PRs #132, #133, and #134 merged the first M12 runtime extractions on main: retire_queue, wal, and wal_file are now shared internal substrate instead of copied engine-local modules, while M12-T04 closed as a defer decision because snapshot_file is still only a clean seam inside the quota-core / reservation-core pair and allocdb-core keeps the simpler file format
the next roadmap step is now M13: define the internal engine authoring boundary in runtime-extraction-roadmap.md and stop extraction pressure until that contract is written down; the authoring rule is to keep shared runtime below the semantic line and keep command surfaces, snapshot schemas, recovery entry points, and state-machine meaning engine-local

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AllocDB Status

Current State

What Exists

Current Focus

FilesExpand file tree

status.md

Latest commit

History

status.md

File metadata and controls

AllocDB Status

Current State

What Exists

Current Focus