-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Context
During shard compaction, the entire shard is rewritten from scratch — even if sorted segments already contain a contiguous prefix and the WAL only adds blocks after it. On slow storage (external HDD), this means compacting a shard with 7000 existing sorted rows + 3000 new WAL rows rewrites all 10000 rows, which can take 30-120 seconds and saturate disk I/O.
Proposal
Detect when the existing sorted segment has a contiguous prefix (offsets 0..N with no gaps) and the WAL only contains blocks at offsets >= N, then append the new rows to the existing segment files instead of rewriting the whole shard.
How to detect
The sorted segments are nippy-jar files with an offset table. The bitset tracks which offsets have data. To check the "clean prefix + WAL tail" pattern:
- Scan the existing bitset to find the highest contiguous offset from 0:
contiguous_end = first offset where bit is NOT set(orshard_sizeif all set) - Check the WAL entries: if all WAL block offsets are
>= contiguous_end, append-in-place is safe - If any WAL entry falls within the existing prefix range, fall back to full rewrite (the WAL entry takes precedence over existing data)
How to append
The nippy-jar segment format is append-friendly:
- Data file: raw bytes concatenated. Just open in append mode and keep writing.
- Offsets file:
[offset_size_byte][offset_0][offset_1].... Append new offsets continuing from the last offset value. - Config file: Update
rowscount andmax_row_size, rewrite the small.conffile.
For gap rows between contiguous_end and the first WAL entry, write empty entries (offset-only, no data bytes) — same as push_empty() does today.
Implementation sketch
In compact_shard() (storage/sharded/mod.rs ~line 1018):
// After building WAL slice index:
let wal_min_offset = wal_slices.min_offset(); // lowest block offset in WAL
let prefix_end = bitset.contiguous_prefix_len(); // first unset bit from 0
if wal_min_offset >= prefix_end && prefix_end > 0 {
// Append path: open existing segments in append mode,
// write rows from prefix_end..max_offset (gaps + WAL data)
// Update bitset, meta, config
} else {
// Full rewrite path (current behavior)
}
When this helps most
- Restart after partial sync: Node synced 7000/10000 blocks in a shard, compacted, restarted, fetched remaining 3000. Currently rewrites all 10000. With this optimization, only writes 3000.
- Follow mode compaction: New blocks always extend the tip, so sorted segments always have a clean prefix.
- Fresh fast-sync: Less benefit since shards fill from scattered peer responses. However, if inline compaction triggers mid-shard (e.g., after a partial fill), subsequent compactions would benefit.
Complexity
Medium. Requires:
Bitset::contiguous_prefix_len()— scan bytes for first non-0xFF, then check bits. ~10 lines.WalSliceIndex::min_offset()— scan entries for firstSome. ~5 lines.- Append-mode writer variant for
SegmentRawWriter— open existing files, seek to end, continue from last offset. ~40-50 lines. - Branch in
compact_shard()for the append path. ~30 lines.
Not in scope
This does NOT address WAL-only compaction (no existing sorted data). That's always a full write, which is fine since it only happens once per shard.
Labels
enhancement, performance