Skip to content

Conversation

@pkolaczk
Copy link

@pkolaczk pkolaczk commented Jan 7, 2026

When a memory index contains very few rows and is split into
many shards, we can expect a lot of variance in the number of rows
between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows
from the whole index, we risk making a huge estimation error.

This commit changes the algorithm to take as many shards as needed
to collect enough returned or indexed rows. For very
tiny datasets is it's likely to use all shards for estimation.
For big datasets, one shard will likely be enough, speeding up
estimation.

This change also allows to remove one estimtion method.
We no longer need to manually choose between the estimation
from the first shard and from all shards.

When a memory index contains very few rows and is split into
many shards, we can expect a lot of variance in the number of rows
between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows
from the whole index, we risk making a huge estimation error.

This commit changes the algorithm to take as many shards as needed
to collect enough returned or indexed rows. For very
tiny datasets is it's likely to use all shards for estimation.
For big datasets, one shard will likely be enough, speeding up
estimation.

This change also allows to remove one estimtion method.
We no longer need to manually choose between the estimation
from the first shard and from all shards.
@github-actions
Copy link

github-actions bot commented Jan 7, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

@pkolaczk pkolaczk requested a review from k-rus January 7, 2026 17:08
@k-rus
Copy link
Member

k-rus commented Jan 7, 2026

@pkolaczk can you add to the PR description, which issue is going to be fixed by this PR?

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2188 rejected by Butler


3 regressions found
See build details here


Found 3 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.cql.VectorSiftSmallTest.testMultiSegmentBuild[ca false] REGRESSION 🔴 0 / 19
o.a.c.index.sai.metrics.QueryMetricsTest.testQueryKindMetrics (compression) REGRESSION 🔴 0 / 19
o.a.c.index.sai.plan.SingleRestrictionEstimatedRowCountTest.testMemtablesSAI (compression) REGRESSION 🔴 0 / 19

No known test failures found

@pkolaczk
Copy link
Author

pkolaczk commented Jan 8, 2026

@pkolaczk can you add to the PR description, which issue is going to be fixed by this PR?

Linked. https://github.com/riptano/cndb/issues/16363

@k-rus
Copy link
Member

k-rus commented Jan 8, 2026

It would be great to update the PR description with motivation for the work from the issue and that it's blocking for another work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants