This document defines the execution model, logical-time handling, boundedness rules, and trusted-core system shape for the approved lease-centric follow-on surface.
The current implementation anchor is still reservation-centric in places. The architecture here
describes the contract the implementation is expected to converge toward as M9 lands.
The implementation target remains intentionally narrow:
- single process
- single shard
- single writer and executor thread
- deterministic expiration through logged events
- The WAL is the source of truth.
- Only the executor mutates allocation state.
- Every state transition must be replayable from persisted input.
- The state machine must not read wall-clock time, random numbers, or thread interleavings.
- All hot-path queues, maps, tables, buffers, and retention windows must have explicit bounds.
- Expected operating failures return deterministic result codes. Assertions are for programmer error, data corruption, and broken invariants.
- Correctness beats reclaim latency. A resource may become reusable late, but never early.
- The trusted core targets allocation-free steady-state execution after startup.
- Liveness observation stays outside the trusted core; the core consumes explicit ownership
transitions such as
revokeandreclaim.
- API ingress
- bounded submission queue
- sequencer and WAL writer
- executor
- expiration scheduler
- snapshot writer
Ingress or networking code may be async if needed. The trusted core boundary is synchronous and explicit: once a command enters the sequencer and executor path, no async suspension or lock-based interleaving is allowed in the core state machine.
client
-> ingress validation
-> submission queue
-> sequencer assigns lsn and request_slot
-> append WAL record
-> fsync or group commit
-> executor applies state transition
-> publish result
Rules:
- the executor is single-threaded for one shard
- live execution and replay use the same apply logic
- publish never rewrites the result after the command is applied
- bundle visibility is all-or-nothing at the executor boundary
- holder-authorized mutations validate the current
(lease_id, lease_epoch)before state changes
Current implementation anchor:
allocdb_node::SingleNodeEngineincrates/allocdb-node/src/engine.rsallocdb_node::apiincrates/allocdb-node/src/api.rsfor the current transport-neutral alpha request and response boundary
client
-> read request
-> check applied_lsn >= required_lsn
-> answer from in-memory state or return fence_not_applied
For the trusted core, this is enough for strict reads.
The approved public read shapes are:
get_resource(resource_id)get_lease(lease_id)
If the live engine halts after a WAL-path ambiguity, reads must fail closed until recovery reconstructs memory from durable state.
slot ticker
-> inspect due reserved leases
-> enqueue internal expire commands
-> append to WAL
-> executor applies expire
Rules:
- the scheduler never mutates allocation state directly
- at most
MAX_EXPIRATIONS_PER_TICKexpirations are enqueued - only
reservedleases are eligible for expiration activeorrevokingleases are never freed by timer- lag must be observable as an explicit metric outside the trusted core
external observer
-> decides holder authority should be withdrawn
-> submit revoke(lease_id)
-> executor moves lease to revoking and bumps lease_epoch
-> later submit reclaim(lease_id) when reuse is safe
-> executor returns member resources to available
Rules:
- external systems observe heartbeats, node state, pod state, or other liveness signals
- the trusted core does not inspect those signals directly
- revoke may happen before reuse is safe
- reclaim is the explicit point where reuse becomes allowed
The state machine never reads the system clock directly.
Required configuration:
slot_duration_ms : u64
max_ttl_slots : u64
max_client_retry_window_slots : u64
lease_history_window_slots : u64
max_expiration_bucket_len : u32
max_bundle_size : u32
Rules:
- external APIs may accept
ttl_ms, but the WAL and executor operate only on slots max_ttl_slots * slot_duration_ms <= 3_600_000lease_history_window_slots <= max_ttl_slots- TTL applies to
reservedleases only
Crossing a deadline does not instantly free resources. Resources become reusable only after the
corresponding expire or reclaim command is committed and applied.
The design uses:
- one fixed-capacity lease table for active and recently terminal leases
- one fixed-capacity lease-member table for bundle membership
Rules:
- active and revoking leases occupy entries until they terminate
- terminal leases keep their entry until
retire_after_slot - member records retire with the parent lease
- retirement frees table slots for reuse
- retirement also advances a bounded retired-lookup watermark so later lease lookups stay distinct
from
not_foundafter the full record is dropped
This keeps history bounded and prevents the product-level history policy from silently making the core unbounded.
At minimum define:
MAX_SUBMISSION_QUEUEMAX_BATCH_SIZEMAX_COMMAND_BYTESMAX_RESOURCESMAX_LEASESMAX_LEASE_MEMBERSMAX_BUNDLE_SIZEMAX_OPERATION_RECORDSMAX_TTL_SLOTSLEASE_HISTORY_WINDOW_SLOTSMAX_EXPIRATION_BUCKET_LENMAX_EXPIRATIONS_PER_TICK
Expected behavior under pressure:
- new writes fail fast with
overloadedor a more specific capacity error - reads remain available where possible
- expirations may lag, but lag must be observable
- revoke or reclaim may be delayed externally, but the kernel must not guess
Required operational signals:
logical_slot_lag = max(0, current_wall_clock_slot - last_request_slot)- expiration backlog, for example the number of due expirations not yet applied
operation_table_utilizationlease_table_utilizationlease_member_table_utilization- recovery and checkpoint status, including:
- how the current process started (
fresh_start,wal_only,snapshot_only, orsnapshot_and_wal) - which snapshot LSN was loaded at startup, if any
- how many WAL frames were replayed at startup
- what snapshot LSN is currently the active durable anchor
- how the current process started (
Delayed expiration and delayed reclaim are acceptable. Premature reuse is not.
The expiration index is a fixed-capacity timing wheel keyed by deadline_slot.
Rules:
- each slot holds a bounded list of reserved lease references
- the wheel size is derived from
MAX_TTL_SLOTS - if a slot bucket reaches
MAX_EXPIRATION_BUCKET_LEN, new reserve commands fail fast withexpiration_index_full
This is a fundamental design decision, not an open question.