Skip to content

Latest commit

 

History

History
220 lines (218 loc) · 18.7 KB

File metadata and controls

220 lines (218 loc) · 18.7 KB

AllocDB Status

Current State

  • Phase: replicated implementation with external Jepsen gate closed, M9 lease-kernel follow-on live-validated, M10 second-engine proof merged, M11 third-engine proof merged, and M12 runtime-extraction roadmap staged
  • Planning IDs: tasks use M#-T#; spikes use M#-S#
  • Current milestone status:
    • M0 semantics freeze: complete enough for core work
    • M1 pure state machine: implemented
    • M1H constant-time core hardening: complete
    • M2 durability and recovery: implemented
    • M3 submission pipeline: implemented
    • M4 simulation: implemented
    • M5 single-node alpha surface: implemented
    • M6 replication design: implemented
    • M7 replicated core prototype: in progress
    • M8 external cluster validation: in progress
    • M9 generic lease-kernel follow-on: implementation merged on main
    • M10 second-engine proof: merged on main; shared runtime extraction deferred
    • M11 third-engine proof: merged on main; broad shared runtime still deferred, first micro-extraction now justified
    • M12 first internal runtime extractions: planned
  • Latest completed implementation chunks:
    • 4156a80 Bootstrap AllocDB core and docs
    • f84a641 Add WAL file and snapshot recovery primitives
    • d87c9a7 Add repo guardrails and status tracking
    • 79ae34f Add snapshot persistence and replay recovery
    • 1583d67 Use fixed-capacity maps in allocator core
    • 3d6ff0f Fail closed on WAL corruption
    • 39f103b Defer conditional confirm and add health metrics
    • 82cb8d8 Add single-node submission engine crate
    • current validated chunk: seeded crash-point and WAL-fault coverage across submit, checkpoint, and recovery boundaries; checked slot and LSN overflow handling; deterministic simulation over contention, retry timing, and due-expiration ordering; replicated metadata bootstrap and fail-closed faulted-state entry; majority-backed quorum writes with primary-only reads, quorum-loss demotion, and higher-view takeover; suffix and snapshot-based stale-replica rejoin with divergent prepared-suffix discard; promoted partition and primary-crash scenarios that preserve fail-closed behavior and retry/read continuity after failover; the local three-replica cluster runner, fault-control harness, and QEMU testbed around the real replica daemon; the first trusted-core bundle-commit slice with bundle membership, bundle-aware confirm/release/expire, and bundle regression coverage; the first fencing slice with lease-epoch propagation, stale-holder rejection, and epoch-aware retry/read coverage; explicit revoke/reclaim with late-not-early reuse preserved across replay and failover; lease-shaped node API exposure for bundle membership and authority state; replicated preservation for committed bundle membership and stale-holder rejection across failover and suffix/snapshot rejoin; and live KubeVirt Jepsen lease-safety control and 1800s crash-restart runs with blockers=0

What Exists

  • Trusted-core crate: crates/allocdb-core
  • Single-node wrapper crate: crates/allocdb-node
  • Benchmark harness crate: crates/allocdb-bench
  • In-memory deterministic allocator:
    • deterministic fixed-capacity open-addressed resource, reservation, and operation tables
    • bounded reservation and operation retirement queues
    • bounded timing-wheel expiration index
    • create_resource, reserve, confirm, release, revoke, reclaim, expire
    • bounded health snapshot with logical slot lag, expiration backlog, and operation-table utilization
  • In-process submission engine:
    • typed and encoded request validation before commit
    • bounded submission queue with deterministic overload behavior
    • LSN assignment, WAL append, sync, and live apply
    • definite pre-commit rejection for request slots whose derived deadline, history, or dedupe windows would overflow u64
    • pre-sequencing duplicate lookup for applied and already-queued operation_id
    • strict-read fence by applied LSN
    • restart path from snapshot plus WAL
    • explicit definite-vs-indefinite submission error categorization
    • explicit restart-and-retry handling for ambiguous WAL failures within the dedupe window
    • explicit lsn_exhausted write rejection after the engine commits the last representable LSN
    • node-level metrics for queue pressure, write acceptance, startup recovery status, and active snapshot anchor
  • Deterministic benchmark harness:
    • CLI entrypoint at cargo run -p allocdb-bench -- --scenario all
    • one-resource-many-contenders scenario for hot-spot reserve contention
    • high-retry-pressure scenario for duplicate replay, conflict replay, full dedupe table rejection, and post-window recovery
    • scenario reports include elapsed time, throughput, metrics snapshots, and WAL byte counts
  • Alpha API surface:
    • transport-neutral request and response types in crates/allocdb-node::api
    • binary request and response codec with fixed-width little-endian encoding
    • explicit wire-level mapping for definite vs indefinite submission failures
    • strict-read fence responses plus halt-safe read rejection for resource and reservation queries
    • retired reservation lookups remain distinct from not_found across later writes and snapshot restore through bounded retired-watermark metadata
    • bounded tick_expirations maintenance request for live TTL enforcement
    • metrics exposure through the same API boundary
  • Operator documentation:
    • operator-facing runbook for the single-node alpha, local replicated cluster runner, local QEMU testbed, and first Kubernetes deployment shape
  • Kubernetes deployment packaging:
    • one container build, one DNS-backed layout generator for cluster-layout.txt, and one first deploy/kubernetes install shape with a bootstrap-primary service and per-replica PVCs
    • one GitHub Actions image-publish workflow for Docker Hub staging and release tags
  • Follow-on planning:
    • one draft lease-kernel follow-on plan that narrows the next trusted-core additions to bundle ownership, fencing, revoke, and an explicit liveness boundary, framed as generic scarce-resource semantics rather than product-specific behavior
    • one draft lease-kernel design-decision document that chooses a first-class lease authority object, bundle size 1 as the single-resource semantic special case, a lease-scoped fencing token, and a two-stage revoke -> reclaim safety model
    • one merged authoritative-docs pass under issue #80 that rewrote semantics, API, architecture, and fault-model docs to the approved lease-centric contract while keeping the current reservation-centric implementation explicitly marked as compatibility surface
    • one merged M9-T08 planning note that narrows revoke/reclaim implementation scope before the code-bearing revoke branch
  • Replication design draft:
    • VSR-style primary/backup replicated log with fixed membership and majority quorums
    • primary-only reads in the first replicated release
    • protocol invariants that preserve single-node idempotency, strict-read, TTL, and reservation-ID semantics across failover
  • Replicated validation planning:
    • deterministic cluster-simulation plan that extends seeded simulation to partitions, primary crash, and rejoin without a mock semantics layer
    • Jepsen gate with explicit contention, ambiguity, failover, and expiration workloads
    • supplementary Jepsen lease-safety coverage for bundle reserve, revoke/reclaim, and stale-holder rejection without changing the documented release-gate matrix
    • retry-aware history interpretation and release-blocking invariants for duplicate execution, stale successful reads, double allocation, early reuse, and stale-holder acceptance
  • Host-side Jepsen harness slice:
    • one release-gate matrix planner, one retry-aware history codec/analyzer, one host-side artifact bundler for duplicate-execution, double-allocation, stale-read, early-expiration, unresolved-ambiguity, and fetched external-cluster log checks, plus explicit verify-qemu-surface and verify-kubevirt-surface probes that exercise one real metrics round trip on every replica and one real primary submit/read round trip through the live replicated protocol surface
    • one supplementary lease_safety workload family with control and crash-restart runs that exercises bundle reserve, explicit revoke/reclaim, and stale-holder release against the live Jepsen surface without promoting that workload into the release-blocking matrix yet
    • one real run-qemu and one real run-kubevirt executor for the full documented release-gate matrix, with persisted histories and artifact bundles for control, crash-restart, partition-heal, and mixed-failover runs, plus host-side failover/rejoin orchestration built from replica workspace export/import and staged ReplicaNode::recover(...) rewrites
    • one capture-kubevirt-layout helper that records the live KubeVirt VM IPs, namespace, helper-pod settings, and SSH key path needed to drive the matrix from the host
  • Replicated node scaffolding:
    • dedicated replica metadata file with temp-write, rename, and directory-sync durability
    • persisted replica identity, role, view, commit point, snapshot anchor, last-normal view, and optional durable vote metadata
    • startup bootstrap for missing metadata on both fresh-open and recover paths
    • fail-closed faulted state when metadata bytes are corrupt, identity is mismatched, or local applied/snapshot state contradicts the persisted replicated metadata
    • configurable normal-mode primary and backup roles for one current view
    • explicit view_uncertain role plus durable higher-view voting for replicas that lost quorum or are participating in failover
    • durable prepared-entry sidecar for pre-commit replicated client commands
    • prepare append, commit-through, and strict primary-read guards built around the existing single-node executor rather than a second apply path
  • Local multi-process cluster runner:
    • CLI entrypoint at cargo run -p allocdb-node --bin allocdb-local-cluster -- <start|stop|status|crash|restart|isolate|heal> ... with one persisted cluster-layout.txt
    • stable replica identities, local bounds, and three external replica processes from one command surface
    • per-replica loopback control, client, and protocol listeners with status and stop hooks on control
    • per-replica pid, log, WAL, snapshot, metadata, and prepared-log paths exposed through status, with restart through the real ReplicaNode::recover path and stable durable workspace reuse
    • one persisted cluster-faults.txt file that marks whole-replica client/protocol isolation without affecting control reachability, plus one append-only cluster-timeline.log for later checker/debug reuse
    • reserved client and protocol listeners now fail with explicit isolation errors when the local fault harness marks that replica isolated
    • real primary-side client/protocol transport for external submit, get_resource, get_reservation, get_metrics, and replicated tick_expirations, with majority append before publish and backup reads still failing closed as not primary
    • structured daemon-side logging for successful prepare quorum formation, commit-broadcast acknowledgements, accepted protocol prepare/commit traffic, expiration batch planning, and applied expiration commands
  • Durability primitives:
    • WAL frame codec and recovery scan
    • file-backed WAL append, sync, recovery, and torn-tail truncation
    • fail-closed recovery on middle-of-log corruption
    • fail-closed recovery on non-monotonic WAL replay metadata and malformed decoded snapshot semantics
    • fail-closed recovery on replayed commands whose derived slot windows overflow configured bounds
    • snapshot encode, decode, capture, restore
    • file-backed snapshot write and load
    • explicit WAL command payload encoding and live-path replay recovery
    • checkpoint path that writes the new snapshot first, then rewrites retained WAL history
    • one-checkpoint WAL overlap and snapshot_marker retention for safe checkpoint replacement
  • Deterministic simulation support:
    • reusable simulation harness in crates/allocdb-node/src/simulation.rs
    • explicit simulated slot advancement under test control, with no wall-clock reads in the exercised engine path
    • seeded same-slot ready-set scheduling with reproducible transcripts
    • seeded labeled schedule actions that resolve candidate slot windows into replayable submit/tick transcripts
    • seeded due-expiration selection over the real internal-expire path, bounded by the production per-tick expiration limit
    • seeded one-shot crash plans over named client-submit, internal-apply, checkpoint, and recovery boundaries
    • one-shot storage fault helpers over append failure, sync failure, checksum mismatch, and torn-tail WAL mutation against real on-disk recovery
    • checkpoint, restart, and live write-fault helpers over the real SingleNodeEngine
    • regression coverage for crash-selected post-sync submit replay, crash-after-snapshot-write checkpoint recovery, replay-interrupted recovery restart, sync-failure retry recovery, checksum-corruption fail-closed restart, torn-tail truncation retry, ingress contention winner order, same-deadline expiration order, mixed-deadline earliest-first expiration priority, and retry timing across the dedupe window
    • reusable replicated cluster harness in crates/allocdb-node/src/replicated_simulation.rs
    • three real ReplicaNodes with independent WAL, snapshot, and metadata workspaces
    • explicit replica-to-replica and client-to-replica connectivity matrix under test control
    • explicit protocol-message queue plus replayable transcripts for queue, deliver, drop, crash, and restart actions
    • real prepare, prepare_ack, and commit protocol payload delivery on that queue
    • configured-primary client submit flow with result publication only after majority durable append
    • retry-aware client submit helper that returns one cached committed result on the current primary instead of assigning a fresh replicated LSN
    • backup replicas that durably append prepares but do not apply allocator state until commit
    • primary-only resource reads guarded by the existing strict-read fence after local commit
    • automatic quorum-loss detection that demotes a stranded primary out of service
    • explicit higher-view takeover that records durable votes from a reachable majority, reconstructs the safe committed prefix on the new primary, discards stale uncommitted suffix, and drops old-view protocol messages
    • replica crash as loss of volatile state with restart through real ReplicaNode::recover
    • checkpoint-assisted rejoin that rewrites one stale replica from suffix-only WAL catch-up or snapshot transfer, then restarts through the real recovery path before returning the replica to backup mode
    • regression coverage for quorum-loss fail-closed reads and writes, higher-view takeover with stale-primary read rejection, prepared-suffix recovery from another voter during takeover, isolated-backup partition heal and catch-up, non-quorum split fail-closed behavior with later rejoin convergence, primary crash before quorum append, primary crash after majority append, primary crash after reply, suffix-only rejoin, snapshot-transfer rejoin, and faulted rejoin rejection
  • Validation:
    • core durability: cargo test -p allocdb-core wal -- --nocapture, cargo test -p allocdb-core snapshot -- --nocapture, cargo test -p allocdb-core recovery -- --nocapture, cargo test -p allocdb-core snapshot_restores_retired_lookup_watermark
    • node runtime: cargo test -p allocdb-node api_reservation_reports_retired_history, cargo test -p allocdb-node engine -- --nocapture, cargo test -p allocdb-node replica -- --nocapture
    • simulation: cargo test -p allocdb-node simulation -- --nocapture, cargo test -p allocdb-node replicated_simulation -- --nocapture
    • local cluster, qemu assets, Jepsen harness, and benchmarks: cargo test -p allocdb-node local_cluster -- --nocapture, cargo test -p allocdb-node qemu_testbed -- --nocapture, cargo test -p allocdb-node jepsen -- --nocapture, cargo test -p allocdb-node --bin allocdb-jepsen -- --nocapture, cargo run -p allocdb-node --bin allocdb-jepsen -- plan, cargo run -p allocdb-bench -- --scenario all
    • repo gate: scripts/preflight.sh

Current Focus

  • PR #82 merged the #70 maintainability follow-up, including live KubeVirt reservation_contention-control and full 1800s reservation_contention-crash-restart reruns on allocdb-a with blockers=0
  • M9-T01 through M9-T05 are merged on main via PR #81, and the planning issues are closed on the AllocDB project
  • PRs #89, #90, #92, #93, #94, and #95 merged the full M9-T06 through M9-T11 implementation chain on main: bundle commit, lease-epoch fencing, explicit revoke / reclaim, lease-shaped node API exposure, replication-preserved failover behavior, and broader simulation coverage are now all in the mainline implementation
  • PR #97 merged issue #96, extending Jepsen history generation and analysis for bundle reserve, revoke/reclaim, and stale-holder lease paths, then closing the loop with live KubeVirt lease_safety-control and full 1800s lease_safety-crash-restart evidence on allocdb-a with blockers=0
  • the next recommended step remains downstream real-cluster e2e work such as gpu_control_plane, not more unplanned lease-kernel semantics work; the current deployment slice covers a first in-cluster StatefulSet shape, but bootstrap-primary routing, failover/rejoin orchestration, and background maintenance remain operator work, and the current staging unblock path is to publish skel84/allocdb from GitHub Actions rather than relying on the local Docker engine
  • PR #107 merged the M10 quota-engine proof on main, and PRs #116, #117, and #118 merged the full M11 reservation-core chain on main: the repository now has a second and third deterministic engine with bounded command sets, logical-slot refill/expiry, and snapshot/WAL recovery proofs
  • PRs #132, #133, and #134 merged the first M12 runtime extractions on main: retire_queue, wal, and wal_file are now shared internal substrate instead of copied engine-local modules, while M12-T04 closed as a defer decision because snapshot_file is still only a clean seam inside the quota-core / reservation-core pair and allocdb-core keeps the simpler file format
  • the next roadmap step is now M13: define the internal engine authoring boundary in runtime-extraction-roadmap.md and stop extraction pressure until that contract is written down; the authoring rule is to keep shared runtime below the semantic line and keep command surfaces, snapshot schemas, recovery entry points, and state-machine meaning engine-local