Skip to content

Append-in-place compaction: skip rewriting contiguous prefix of sorted segments #3

@vicnaum

Description

@vicnaum

Context

During shard compaction, the entire shard is rewritten from scratch — even if sorted segments already contain a contiguous prefix and the WAL only adds blocks after it. On slow storage (external HDD), this means compacting a shard with 7000 existing sorted rows + 3000 new WAL rows rewrites all 10000 rows, which can take 30-120 seconds and saturate disk I/O.

Proposal

Detect when the existing sorted segment has a contiguous prefix (offsets 0..N with no gaps) and the WAL only contains blocks at offsets >= N, then append the new rows to the existing segment files instead of rewriting the whole shard.

How to detect

The sorted segments are nippy-jar files with an offset table. The bitset tracks which offsets have data. To check the "clean prefix + WAL tail" pattern:

  1. Scan the existing bitset to find the highest contiguous offset from 0: contiguous_end = first offset where bit is NOT set (or shard_size if all set)
  2. Check the WAL entries: if all WAL block offsets are >= contiguous_end, append-in-place is safe
  3. If any WAL entry falls within the existing prefix range, fall back to full rewrite (the WAL entry takes precedence over existing data)

How to append

The nippy-jar segment format is append-friendly:

  • Data file: raw bytes concatenated. Just open in append mode and keep writing.
  • Offsets file: [offset_size_byte][offset_0][offset_1].... Append new offsets continuing from the last offset value.
  • Config file: Update rows count and max_row_size, rewrite the small .conf file.

For gap rows between contiguous_end and the first WAL entry, write empty entries (offset-only, no data bytes) — same as push_empty() does today.

Implementation sketch

In compact_shard() (storage/sharded/mod.rs ~line 1018):

// After building WAL slice index:
let wal_min_offset = wal_slices.min_offset();  // lowest block offset in WAL
let prefix_end = bitset.contiguous_prefix_len();  // first unset bit from 0

if wal_min_offset >= prefix_end && prefix_end > 0 {
    // Append path: open existing segments in append mode, 
    // write rows from prefix_end..max_offset (gaps + WAL data)
    // Update bitset, meta, config
} else {
    // Full rewrite path (current behavior)
}

When this helps most

  • Restart after partial sync: Node synced 7000/10000 blocks in a shard, compacted, restarted, fetched remaining 3000. Currently rewrites all 10000. With this optimization, only writes 3000.
  • Follow mode compaction: New blocks always extend the tip, so sorted segments always have a clean prefix.
  • Fresh fast-sync: Less benefit since shards fill from scattered peer responses. However, if inline compaction triggers mid-shard (e.g., after a partial fill), subsequent compactions would benefit.

Complexity

Medium. Requires:

  • Bitset::contiguous_prefix_len() — scan bytes for first non-0xFF, then check bits. ~10 lines.
  • WalSliceIndex::min_offset() — scan entries for first Some. ~5 lines.
  • Append-mode writer variant for SegmentRawWriter — open existing files, seek to end, continue from last offset. ~40-50 lines.
  • Branch in compact_shard() for the append path. ~30 lines.

Not in scope

This does NOT address WAL-only compaction (no existing sorted data). That's always a full write, which is fine since it only happens once per shard.

Labels

enhancement, performance

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions