CNDB-16363: Improve matched rows estimation accuracy for memory indexes #2188

pkolaczk · 2026-01-07T17:07:36Z

When a memory index contains very few rows and is split into
many shards, we can expect a lot of variance in the number of rows
between the shards. Hence, if we took only one
shard to estimate the number of matched rows, and
extrapolate that on all shards to compute the estimated matching rows
from the whole index, we risk making a huge estimation error.

This commit changes the algorithm to take as many shards as needed
to collect enough returned or indexed rows. For very
tiny datasets is it's likely to use all shards for estimation.
For big datasets, one shard will likely be enough, speeding up
estimation.

This change also allows to remove one estimtion method.
We no longer need to manually choose between the estimation
from the first shard and from all shards.

When a memory index contains very few rows and is split into many shards, we can expect a lot of variance in the number of rows between the shards. Hence, if we took only one shard to estimate the number of matched rows, and extrapolate that on all shards to compute the estimated matching rows from the whole index, we risk making a huge estimation error. This commit changes the algorithm to take as many shards as needed to collect enough returned or indexed rows. For very tiny datasets is it's likely to use all shards for estimation. For big datasets, one shard will likely be enough, speeding up estimation. This change also allows to remove one estimtion method. We no longer need to manually choose between the estimation from the first shard and from all shards.

github-actions · 2026-01-07T17:07:50Z

k-rus · 2026-01-07T17:33:44Z

@pkolaczk can you add to the PR description, which issue is going to be fixed by this PR?

cassci-bot · 2026-01-07T18:29:15Z

❌ Build ds-cassandra-pr-gate/PR-2188 rejected by Butler

3 regressions found
See build details here

Found 3 new test failures

Test	Explanation	Runs	Upstream
o.a.c.index.sai.cql.VectorSiftSmallTest.testMultiSegmentBuild[ca false]	REGRESSION	🔴	0 / 19
o.a.c.index.sai.metrics.QueryMetricsTest.testQueryKindMetrics (compression)	REGRESSION	🔴	0 / 19
o.a.c.index.sai.plan.SingleRestrictionEstimatedRowCountTest.testMemtablesSAI (compression)	REGRESSION	🔴	0 / 19

No known test failures found

pkolaczk · 2026-01-08T08:16:06Z

@pkolaczk can you add to the PR description, which issue is going to be fixed by this PR?

Linked. https://github.com/riptano/cndb/issues/16363

k-rus · 2026-01-08T10:04:52Z

It would be great to update the PR description with motivation for the work from the issue and that it's blocking for another work.

pkolaczk requested a review from k-rus January 7, 2026 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CNDB-16363: Improve matched rows estimation accuracy for memory indexes #2188

CNDB-16363: Improve matched rows estimation accuracy for memory indexes #2188

Uh oh!

pkolaczk commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

k-rus commented Jan 7, 2026

Uh oh!

cassci-bot commented Jan 7, 2026

Uh oh!

pkolaczk commented Jan 8, 2026

Uh oh!

k-rus commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

CNDB-16363: Improve matched rows estimation accuracy for memory indexes #2188

Are you sure you want to change the base?

CNDB-16363: Improve matched rows estimation accuracy for memory indexes #2188

Uh oh!

Conversation

pkolaczk commented Jan 7, 2026

Uh oh!

github-actions bot commented Jan 7, 2026

Checklist before you submit for review

Uh oh!

k-rus commented Jan 7, 2026

Uh oh!

cassci-bot commented Jan 7, 2026

❌ Build ds-cassandra-pr-gate/PR-2188 rejected by Butler

Found 3 new test failures

No known test failures found

Uh oh!

pkolaczk commented Jan 8, 2026

Uh oh!

k-rus commented Jan 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants