Skip to content

Conversation

@prudhvigodithi
Copy link
Contributor

@prudhvigodithi prudhvigodithi commented Oct 30, 2025

Description

Coming from #14485 and #13745 (Initial implementation of intra-segment search concurrency #13542), when splitting a segment into partitions for intra segment search, each partition would create a DocIdSetBuilder that allocates memory based on the entire segment size, even though it only collects documents within a small partition range. This PR adds partition aware support to DocIdSetBuilder which creates bitsets and buffers scoped to its doc ID range instead of the entire segment size, this change will have memory efficiency during intra segment search.

Example for a Segment with 1M documents split into 4 partitions of 250K docs each and now each partition creates a FixedBitSet(1M) which is not required.

PartitionAwareBufferAdder:

  • Filters documents to only accept those within minDocId, maxDocId range.
  • Stores absolute doc IDs in buffers (used for sparse results below threshold) and rejects not part of of the partition range.

PartitionAwareFixedBitSetAdder

  • Filters documents to only accept those within partition range.
  • Uses partition sized bitset instead of segment sized.

OffsetBitDocIdSet & OffsetDocIdSetIterator

    • FixedBitSet uses the doc ID parameter directly as an array index. When we create partition sized bitsets to save memory, we store documents using relative indices (0 to partitionSize-1) internally, but the Lucene API requires iterators to return absolute doc IDs. These wrapper classes handle the conversion automatically.
  • So these wrapper classes adds offset during iteration (when PartitionAwareFixedBitSetAdder is used). This is to convert partition relative indices back to absolute doc IDs.
  • Callers should always receive absolute doc IDs.
Segment: 100,000 documents
Partition: [50,000 to 60,000) - only 10,000 docs

Without Optimization (Old Way):

Create bitset for ENTIRE segment:
FixedBitSet(100,000 bits)

Bit position:  0     1     2  ... 50000 ... 50500 ... 55000 ... 59999 ... 99999
                ↓     ↓     ↓       ↓        ↓         ↓         ↓         ↓
Bit value:      0     0     0       1        1         1         1         0
                               

With Optimization (New Way):

Create bitset with ONLY partition size:
FixedBitSet(10,000 bits)

Bit position:  0    1    2    ... 500  ... 5000 ... 9000 ... 9999
               ↓    ↓    ↓        ↓        ↓        ↓        ↓
Bit value:     0    0    0        1        1        1        0
               └───────────────────────────────────────────────┘
                All bits used efficiently!
                
Storage mapping (with offset):
  Doc 50,000 → Bit[0]     (50,000 - 50,000 = 0)
  Doc 50,500 → Bit[500]   (50,500 - 50,000 = 500)
  Doc 55,000 → Bit[5,000] (55,000 - 50,000 = 5,000)
  Doc 59,999 → Bit[9,999] (59,999 - 50,000 = 9,999)

@prudhvigodithi
Copy link
Contributor Author

prudhvigodithi commented Oct 31, 2025

Hey all, pending to add some tests/validations and code clean up from my end but before this I would like to get some early feedback on the approach to see if the idea would make sense.

@prudhvigodithi prudhvigodithi marked this pull request as ready for review October 31, 2025 15:34
@prudhvigodithi
Copy link
Contributor Author

Adding @jainankitk @getsaurabh02 to the conversation.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

1 similar comment
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

Signed-off-by: Prudhvi Godithi <[email protected]>
@github-actions
Copy link
Contributor

github-actions bot commented Nov 3, 2025

This PR does not have an entry in lucene/CHANGES.txt. Consider adding one. If the PR doesn't need a changelog entry, then add the skip-changelog label to it and you will stop receiving this reminder on future updates to the PR.

@prudhvigodithi
Copy link
Contributor Author

Ok the exists checks and tests are now green, let me add some tests in TestDocIdSetBuilder.

Comment on lines +44 to +48
public sealed interface BulkAdder
permits FixedBitSetAdder,
BufferAdder,
PartitionAwareFixedBitSetAdder,
PartitionAwareBufferAdder {
Copy link
Member

@benwtrent benwtrent Nov 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is now megamorphic :(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. We should run the benchmark to quantify the impact due to virtual calls and megamorphism. Also assuming the impact is significant, I am wondering if we can use directly PartitionAwareFixedBitSetAdder instead of FixedBitSetAdder?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants