Draft. This document defines when throwaway experiments are appropriate and which spikes are currently justified.
Spikes are for implementation uncertainty, not semantic uncertainty.
They exist to answer questions like:
- which fixed-capacity table shape is simplest and safest in Rust?
- what timing-wheel structure is easiest to keep bounded and readable?
- what WAL framing shape makes torn-tail recovery simplest?
They do not exist to answer questions like:
- what
reservemeans - whether
holder_idis required - whether retention is bounded
- whether indefinite outcomes exist
Those are design decisions already captured elsewhere.
Every spike must be:
- time-boxed
- narrowly scoped
- disposable by default
- documented with the decision it informs
Every spike must end in one of three outcomes:
- choose an implementation direction
- reject an implementation direction
- identify a genuine design gap that must be resolved in docs before coding continues
If a spike starts changing semantics, stop and move the issue back into the design docs.
Spike code should:
- live in an obviously non-production location such as
scratch/orexperiments/ - be deleted once the decision is made, unless a piece is directly promoted into production code
- never quietly become trusted-core code without review and tests
Question:
- what table shape best supports bounded resource, reservation, and operation storage in Rust?
Why a spike is justified:
- this is an implementation-shape question with strong effects on safety, clarity, and allocation
Current chosen direction:
- the first implementation slice used sorted fixed-capacity
Vecstores to keep the prototype small and deterministic - the production direction is now deterministic fixed-capacity open-addressed tables in the trusted core
Question:
- what timing-wheel bucket layout makes expiration, overflow, and retirement simplest to reason about?
Why a spike is justified:
- the design decision is fixed, but the implementation shape is still uncertain
Current chosen direction:
- preallocated timing-wheel buckets
- explicit
MAX_EXPIRATION_BUCKET_LENper slot - deterministic sorted bucket contents
Question:
- what binary frame layout makes corruption detection and torn-tail recovery simplest?
Why a spike is justified:
- a short experiment can eliminate format complexity before the real codec is written
Current chosen direction:
- explicit little-endian binary frame layout
- per-frame CRC32C checksum
- recovery scan stops at the last valid frame boundary
Question:
- what simulator shape can drive the real trusted core with seeded slot advancement and crash injection?
Why a spike is justified:
- this is a harness-design question and should be proven early
Current chosen direction:
- a scripted single-node harness around the real
SingleNodeEngine - simulated slot lives in the harness and advances only when the test driver says so
- seeded choice is used only to order ready ingress at one logical slot; state-machine and recovery semantics remain the production implementations
- crash, restart, checkpoint, and persist-failure events stay explicit driver actions rather than hidden behind fake clock or storage traits
Evidence gathered:
crates/allocdb-node/src/simulation.rsandcrates/allocdb-node/src/simulation_tests.rsnow carry the promoted harness shape selected by the spike: the real engine, seeded same-slot ordering, and explicit slot advancement- the spike proves one restart path with checkpoint, logged expiration, injected WAL ambiguity, and recovery from snapshot plus WAL
Reuse for M4-T01 through M4-T04:
- the external-driver shape
- explicit simulated-slot state owned by the harness
- seeded ready-set ordering for same-slot ingress
- temp WAL/snapshot lifecycle and restart helpers
Discard after the spike:
- the ad hoc test-only API names and one-off helper layout
- the exact PRNG choice used only to prove reproducibility
- any expectation that all future scenarios fit one linear script helper without refinement
Next step:
- add crash-point and storage-fault coverage on top of the promoted simulation support during
M4-T02andM4-T03
Do not spike:
- command semantics
- result-code meanings
- retention rules
- fault-model rules
- whether replication changes single-node guarantees
Those issues belong in the docs and review process, not in throwaway code.