Skip to content

Conversation

@shubhamsrkdev
Copy link
Contributor

@shubhamsrkdev shubhamsrkdev commented Oct 7, 2025

Problem

Lucene uses CFS ratio (which is by default 10%) to determine whether to use Compound Files.

Solution

In this PR we are doing the following:

  • Switching from a CFS ratio to a fixed threshold which is:

    • 64MB for byte-size-based merge policies
    • 65,536 docs for doc-based merge policies
  • Moved CFS configuration to CompoundFormat.java from Merge policies.

Fixes #14959

@shubhamsrkdev shubhamsrkdev marked this pull request as ready for review October 7, 2025 14:28
}

// Apply appropriate threshold based on merge policy type
if (mergePolicy instanceof LogDocMergePolicy) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great if we can avoid customizing it for specific policies, otherwise it might be tricky to maintain in the future, if e.g. there is another policy that is based on doc size not bytes.

Maybe we can add a enum and a method to MergePolicy which returns its unit (bytes/docs) , and use it here to decide which threshold to use?

Or do we want to always choose compound format based on size in bytes even for LogDocMergePolicy? In this case we might be able to use merge.getMergeInfo().sizeInBytes() when we call this method and avoid relying on MergePolicy#size all together?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to the idea of Enums, I will wait if anyone else has other suggestions here but having an enum makes most sense to me.

@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Oct 25, 2025
Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had looked at this PR before, but I see it's gone stale. Hoping to revive it.
Surprised how many changes we needed for this!

@github-actions github-actions bot removed the Stale label Nov 11, 2025
Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing my comments! There are two questions left for me from the comments:

  1. Whether the thresholds we're starting with are good enough for a first iteration.
  2. Whether we can avoid branching logic based on which merge policy is in use.

@github-actions
Copy link
Contributor

This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the dev@lucene.apache.org list. Thank you for your contribution!

@github-actions github-actions bot added the Stale label Nov 26, 2025
@github-actions github-actions bot removed the Stale label Jan 21, 2026
@github-actions github-actions bot added this to the 11.0.0 milestone Jan 23, 2026
public void testUniqueValuesCompression() throws IOException {
try (final Directory dir = new ByteBuffersDirectory()) {
final IndexWriterConfig iwc = new IndexWriterConfig(new MockAnalyzer(random()));
iwc.getCodec().compoundFormat().setShouldUseCompoundFile(false);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had to disable compound files here. The reason being, before this change we had a CFS ratio (10%) which in this case would not result in compound file creation. However since we switched to a fixed threshold (64 MB now) these files are way below the threshold leading to compound file creation and thus increase in file sizes. The increase is minimal (4 bytes for me below the original threshold) but enough to fail the tests. Threefore, I have disabled compound files here.

Copy link
Contributor

@stefanvodita stefanvodita left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following up on the remaining comments. I'll try to merge soon to avoid more merge conflicts for this large PR.

@stefanvodita stefanvodita merged commit 6643a5d into apache:main Jan 23, 2026
12 checks passed
@shubhamsrkdev shubhamsrkdev deleted the setCfs branch January 23, 2026 16:06
@romseygeek
Copy link
Contributor

This looks like it's causing failures in some backwards-codecs tests:

TestLucene87StoredFieldsFormatHighCompression > testMergeStability FAILED
--
java.lang.AssertionError: expected:<{fnm=77, fdt=37004, fdx=64, fdm=159}> but was:<{cfs=37366, cfe=150}>
at __randomizedtesting.SeedInfo.seed([DF7E25E7C981E11B:AB3263C8C46BE3AD]:0)
at junit@4.13.2/org.junit.Assert.fail(Assert.java:89)
at junit@4.13.2/org.junit.Assert.failNotEquals(Assert.java:835)
at junit@4.13.2/org.junit.Assert.assertEquals(Assert.java:120)
at junit@4.13.2/org.junit.Assert.assertEquals(Assert.java:146)
at org.apache.lucene.test_framework@11.0.0-SNAPSHOT/org.apache.lucene.tests.index.BaseIndexFileFormatTestCase.testMergeStability(BaseIndexFileFormatTestCase.java:306)
gradlew test --tests TestLucene87StoredFieldsFormatHighCompression.testMergeStability -Dtests.seed=DF7E25E7C981E11B -Dtests.nightly=true -Dtests.locale=ig-Latn-NG -Dtests.timezone=Asia/Qostanay -Dtests.asserts=true -Dtests.file.encoding=UTF-8

I think these are nightly-only so may have been missed by normal test runs.

@stefanvodita
Copy link
Contributor

Thank you for pointing out the issue @romseygeek and thank you for addressing it quickly @shubhamsrkdev!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch to a fixed CFS threshold

4 participants