This is an experimental LSM-tree storage engine written in Rust that leverages Linux io_uring for high-performance asynchronous I/O. The design goals are:
- Keep all hot paths (WAL appends, memtable flush, point lookups, compaction) fully asynchronous via
io_uring - Minimize page-cache round-trips using registered files/buffers and SQE chaining
- Leverage SQPOLL and IOPOLL for lowest latency on critical paths
- Demonstrate on-disk structures (SST blocks, indexes, Bloom filters) optimized for io_uring data paths
| Mode | Throughput | Latency/1000 ops | Relative | Durability |
|---|---|---|---|---|
| meme group_commit | 27,000 ops/s | 37ms | 30x faster | Batched fsync |
| meme sync | 890 ops/s | 1.12s | baseline | Per-write fsync |
| RocksDB sync | 890 ops/s | 1.12s | same | Per-write fsync |
| RocksDB async | 280,000 ops/s | 3.5ms | 315x | No fsync |
Group Commit provides 30x write throughput improvement while maintaining durability guarantees.
The key insight: fsync is the bottleneck (~1.1ms per call on NVMe). Group commit batches multiple writes before fsync, amortizing the cost across many operations.
This is the most important comparison - RocksDB in its typical production configuration:
| Configuration | Latency | Throughput | Relative |
|---|---|---|---|
| meme (O_DIRECT + 1MB cache) | 648µs | 1.54 Melem/s | 22% faster |
| RocksDB production (buffered + 8MB cache) | 794µs | 1.26 Melem/s | baseline |
meme with O_DIRECT + user-space cache is 22% faster than RocksDB with buffered I/O + block cache.
| Configuration | Latency | Throughput |
|---|---|---|
| meme (1MB cache) | 648µs | 1.54 Melem/s |
| RocksDB (O_DIRECT + 8MB cache) | 790µs | 1.27 Melem/s |
| RocksDB production (buffered + 8MB cache) | 794µs | 1.26 Melem/s |
| RocksDB (buffered, no block cache) | 2.64ms | 379 Kelem/s |
| Engine | Latency | Throughput | Relative |
|---|---|---|---|
| meme (io_uring + IOPOLL) | 28.1ms | 35.5 Kelem/s | 24% faster |
| RocksDB (pread + O_DIRECT) | 35.0ms | 28.6 Kelem/s | baseline |
io_uring with IOPOLL is ~24% faster than traditional pread for random disk I/O.
- Group Commit: 30x write throughput improvement by batching fsync operations
- vs RocksDB Production: meme is 22% faster than RocksDB's typical production configuration (buffered I/O + block cache)
- Raw Disk I/O: io_uring provides ~24% improvement over pread due to reduced syscall overhead and kernel polling
- Cache Efficiency: meme's simpler code path and pre-parsed block cache provide consistent 22% improvement
- OS Page Cache Overhead: RocksDB with only OS page cache (no block cache) is 4x slower than with block cache
┌─────────────────────────────────────────────────────────────┐
│ LsmDb │
├─────────────────────────────────────────────────────────────┤
│ MemTable │ WAL Writer (Double Buffer + SQPOLL + IOPOLL) │
├────────────┴────────────────────────────────────────────────┤
│ IoScheduler │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ ┌──────┐│
│ │ FG Ring │ │ BG Flush │ │ Compaction │ │ WAL ││
│ │ SQPOLL+IOPOLL│ │ SQPOLL │ │ SQPOLL │ │SQPOLL││
│ │ Point/Scan │ │ SST Flush │ │ Merge+Write │ │IOPOLL││
│ └──────────────┘ └──────────────┘ └──────────────┘ └──────┘│
├─────────────────────────────────────────────────────────────┤
│ Block Cache (Registered Buffers + Huge Pages + ReadFixed) │
├─────────────────────────────────────────────────────────────┤
│ BufRing (Provided Buffers) │
├─────────────────────────────────────────────────────────────┤
│ SST Files (O_DIRECT + Bloom + Index) │
└─────────────────────────────────────────────────────────────┘
| Ring | SQPOLL | IOPOLL | Purpose | Notes |
|---|---|---|---|---|
| Foreground | ✅ | ✅ | Point queries, scans | Lowest latency, read-only |
| Background Flush | ✅ | ❌ | SST flush | fsync + rename incompatible with IOPOLL |
| Compaction | ✅ | ❌ | Background compaction | fsync + rename incompatible with IOPOLL |
| WAL | ✅ | ✅ | WAL writes | O_DSYNC replaces fsync, IOPOLL compatible |
These two features are independent:
- SQPOLL: Kernel thread polls submission queue, reducing syscall overhead. Works with any operation including fsync, rename, etc.
- IOPOLL: Kernel polls for I/O completion, bypassing interrupts. Only works with O_DIRECT read/write operations, not with fsync, rename, fallocate, etc.
Both can be combined for optimal performance.
crate
├── block.rs # Block format with restart points and prefix compression
├── buf_ring.rs # Provided buffer ring (kernel 5.19+)
├── cache.rs # User-space block cache with huge pages support
├── compaction.rs # Background job queue + merge pipeline
├── lru.rs # O(1) LRU cache implementation
├── manifest.rs # MANIFEST persistence and SST metadata
├── memtable.rs # Skiplist memtable and WAL entry encoding
├── scheduler.rs # IoScheduler + IoRingHandle (4-ring architecture)
├── sst.rs # SST block/bloom/index builders and readers
├── storage.rs # Path management (root, wal, sst, manifest)
├── util.rs # Shared filesystem/hash helpers
├── wal.rs # Double-buffered WAL with O_DSYNC + IOPOLL
└── lib.rs # LsmDb API wiring everything together
MemTableis a skiplist-based map storing(key,value)pairs in sorted order- Double-buffered WAL: One buffer encodes while the other flushes, enabling CPU/IO pipelining
- Group commit: Batches writes up to 64KB threshold before issuing write
- O_DSYNC mode: WAL uses O_DSYNC instead of explicit fsync, enabling IOPOLL compatibility
- WAL writes use
WriteFixed(registered buffers) for zero-copy I/O - Dedicated WAL Ring with SQPOLL + IOPOLL: Microsecond-level submission latency
SstFlusherbuilds fixed-size blocks (default 16KiB) viaBlockBuilder- Flush pipeline with SQE chaining:
- Fallocate: Pre-allocates file space to avoid write-time allocation latency
- All blocks (4KB aligned for O_DIRECT)
- Block index (first key + block offset)
- Bloom filter bitset
- 32-byte footer describing index/bloom offsets
- Fsync file: Ensures data durability
- Rename: Atomic file replacement
- Fsync directory: Ensures directory entry persistence
- Complete chain:
fallocate → writes → fsync(file) → rename → fsync(dir)withIO_LINK+SKIP_SUCCESS
- Each SST carries metadata loaded from footer: block index and Bloom filter
- Block Cache: User-space LRU cache with huge pages, uses
ReadFixedfor zero-copy - Pre-parsed entries: Block entries are parsed once and cached for O(log n) binary search
LsmDb::getchecks memtable first, then scans SSTs by level+id priority
- PipelinedIterator: Issues multiple outstanding block reads via buffer-rings
- FADVISE: Uses
POSIX_FADV_SEQUENTIALfor sequential access hints - Configurable prefetch window size
Compactormerges level-0 SSTs usingPipelinedIteratorfor streaming reads- O_DIRECT for compaction reads: Bypasses page cache to avoid pollution
- IOSQE_ASYNC: Offloads large file reads to io-wq threads
CompactionQueueruns in dedicated thread, decoupled from foreground work
| Feature | Description |
|---|---|
| SINGLE_ISSUER | One thread per ring, reduces locking |
| COOP_TASKRUN | Cooperative task running, reduces latency jitter |
| DEFER_TASKRUN | Defers work to io_uring_enter() |
| NO_SQARRAY | Removes SQ array indirection (kernel 6.6+) |
| SUBMIT_ALL | Batch-friendly submission |
| SQPOLL | Kernel polling thread for all 4 rings |
| IOPOLL | Kernel polls for completion (Foreground + WAL rings) |
| Feature | Description |
|---|---|
| Fixed Files | register_files_sparse(64) avoids fd lookup |
| Fixed Buffers | WriteFixed/ReadFixed avoids page pinning |
| Buffer Ring | Provided buffers with user-space recycling |
| Ring FD | Reduces syscall overhead |
| Sparse Buffers | IORING_REGISTER_BUFFERS2 for dynamic registration |
| Feature | Description |
|---|---|
| IO_LINK | Chains fallocate→write→fsync→rename→fsync(dir) |
| SKIP_SUCCESS | Reduces CQE count for linked ops |
| ASYNC | Offloads to io-wq for compaction |
| FADVISE | Sequential read hints |
| FALLOCATE | Pre-allocates file space to reduce tail latency |
| Feature | Description |
|---|---|
| FEAT_NODROP | Guarantees no CQE drops (kernel 5.5+) |
| O_DSYNC | WAL uses O_DSYNC for IOPOLL compatibility |
| Feature | Minimum Version |
|---|---|
| Basic io_uring | 5.1 |
| IOPOLL | 5.1 |
| SQPOLL | 5.4 |
| FEAT_NODROP | 5.5 |
| Buffer Ring | 5.19 |
| SINGLE_ISSUER | 6.0 |
| DEFER_TASKRUN | 6.1 |
| NO_SQARRAY | 6.6 |
Recommended: >= 6.1
IOPOLL mode requires NVMe or other polling-capable block devices. The engine automatically detects the device type on startup.
Supported devices:
- NVMe (major 259) - Full IOPOLL support
- SCSI (major 8) - Warning displayed, may work
- Loop devices (major 7) - For testing
- Device-mapper (major 253) - For testing
io_uring requires locked memory for ring buffers and registered buffers. The default 8MB limit may be insufficient for 4 SQPOLL rings.
# /etc/security/limits.d/99-iouring.conf
* soft memlock unlimited
* hard memlock unlimitedpub struct LsmOptions {
/// Number of 16KB block cache slots (default: 64 = 1MB)
/// Each slot requires locked memory (memlock).
pub block_cache_slots: usize,
/// Force SINGLE_ISSUER flag (default: false)
pub force_single_issuer: bool,
/// WAL durability mode (default: Sync)
pub wal_durability: WalDurability,
}
pub enum WalDurability {
/// Sync after every write (safest, ~900 ops/s)
Sync,
/// Batch writes and sync periodically (~27K ops/s)
/// - timeout: max time before sync (default: 1ms)
/// - batch_size: max entries before sync (default: 32)
GroupCommit { timeout: Duration, batch_size: usize },
/// No sync (fastest, data may be lost on crash)
NoSync,
}| Mode | Throughput | Durability | Use Case |
|---|---|---|---|
| Sync | ~900 ops/s | Per-write | Financial transactions |
| GroupCommit | ~27K ops/s | Batched (1ms window) | General workloads |
| NoSync | ~280K ops/s | None | Caching, ephemeral data |
Recommended: GroupCommit for most workloads. It provides 30x better throughput than Sync while limiting potential data loss to the last 1ms of writes.
| Component | Memory Usage |
|---|---|
| 4 SQPOLL rings | ~4-6 MB |
| Block cache (default 64 slots) | 1 MB (64 × 16KB) |
| Total | ~5-7 MB |
Run with:
cargo bench --bench rocksdb_comparisonKey benchmarks:
point_get/meme_warmed/sequential: Point query with warmed cache (sequential access)point_get/meme_minimal_cache/random: Raw io_uring disk I/O (random access, minimal cache)point_get/rocksdb_no_cache_direct/random: RocksDB O_DIRECT without cache (random access)point_get/rocksdb_buffered_warmed/sequential: RocksDB with OS page cachepoint_get/rocksdb_*_cache_warmed/sequential: RocksDB with warmed user-space cache
- Range iterators: Level-aware scheduling for multi-level merges
- Bloom configuration: Per-SST or per-level size tuning
- Manifest richness: File sizes, creation seqnos, compaction priorities
- Compaction throttling: Rate limiting and IO priority adjustments
- Metrics/observability: Structured tracing for queue depth and latency
- Transactions: Multi-put grouping and commit callbacks
- io-wq affinity:
IORING_REGISTER_IOWQ_AFFfor core binding