Skip to content

Conversation

@silver-ymz
Copy link
Member

@silver-ymz silver-ymz commented Dec 1, 2025

resolve #94

This PR refactors the posting list serialization to avoid allocating skip_info and block_data pages for small tokens in sealed segments.

For 1m ms marco dataset, it reduces the index size from 16G to 4.7G

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the posting list serialization to avoid allocating skip_info and block_data pages for small tokens in sealed segments. The optimization defers writer initialization until the first block flush, preventing unnecessary page allocation for tokens that fit entirely in the unflushed buffer.

Key changes:

  • Deferred initialization of skip_info and block_data writers until first flush_block call
  • Introduced PageWriteGuard::init_mut<T>() helper to simplify page initialization with Default trait
  • Changed PostingTermMetaData.last_full_block_last_docid from Option<NonZero<u32>> to u32 for simplicity
  • Refactored unflushed block metadata storage to use Option<SkipBlock> instead of separate doc_cnt field and skip_info page entry
  • Added VIRTUAL_INODE page flag for inode pages in the virtual page system
  • Renamed block_parttion to block_partition (typo fix)
  • Converted bm25_page_size() function to BM25_PAGE_SIZE constant for better idiomaticity
  • Minor dependency version bumps (bitflags, bytemuck, generator)

Reviewed changes

Copilot reviewed 16 out of 16 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
src/segment/posting/serializer.rs Defers writer initialization to first flush, stores unflushed skip block in term_meta, adds validation
src/segment/posting/reader.rs Reads unflushed skip block from term_meta, handles optional block_data pages with iterator pattern
src/segment/posting/mod.rs Changes last_full_block_last_docid type, adds unfulled_skip_block field, adds Debug derives
src/segment/posting/append.rs Refactored to use new term_meta structure, defers block_data_writer initialization
src/segment/meta.rs Adds Default implementation for MetaPageData
src/segment/growing.rs Converts bm25_page_size() calls to BM25_PAGE_SIZE constant
src/segment/delete.rs Uses is_multiple_of() method instead of modulo check
src/page/virtual.rs Adds helper functions, uses VIRTUAL_INODE flag for inode pages, adds first_blkno() accessor
src/page/reader.rs Converts page_count() to PAGE_COUNT constant
src/page/postgres.rs Adds init_mut() helper, converts to BM25_PAGE_SIZE constant, adds VIRTUAL_INODE flag, uses is_multiple_of()
src/page/mod.rs Adds inspector module
src/page/inspector.rs New debug inspection function for examining page contents
src/index/vacuum.rs Converts to BM25_PAGE_SIZE constant
src/index/build.rs Uses init_mut() helper for page initialization
src/algorithm/block_wand.rs Minor code style cleanup (removes intermediate variable)
Cargo.toml Updates dependency versions

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: Mingzhuo Yin <[email protected]>
@silver-ymz silver-ymz force-pushed the refactor/sealed-segment-layout branch from 36df518 to c5ff56c Compare December 1, 2025 02:31
@silver-ymz silver-ymz requested a review from VoVAllen December 1, 2025 02:32
@VoVAllen VoVAllen merged commit 9d3014e into tensorchord:main Dec 3, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unexpectedly large BM25 index size

2 participants