Skip to content

Conversation

@michaeljmarshall
Copy link
Member

What is the issue

Fixes: https://github.com/riptano/cndb/issues/16350

What does this PR fix and why was it fixed

ChronicleMap gives us several ways to use lower level APIs to avoid deserializing keys/values and the associated allocation that comes with them. The first key thing to mention is that iteration is very expensive as these maps get big, so we want to avoid it if possible. The second is that if we use the typical map iteration methods, they deserialize the key and the value eagerly. Since the key is typically a high dimensional vector, it is valuable to avoid such deserialization. This change:

  • Removes unnecessary iteration leveraging the fact that compaction is additive
  • Replaces forEach with forEachEntry, which gives better semantics
  • Updates the maybeAddVector method to avoid serializing the vector key twice by using the searchContext. The ChronicleMap#put method uses this pattern internally

I added two sets of benchmarks, however the VectorCompactionBench doesn't seem to register the benefit of the ChronicleMap. I am leaving VectorCompactionBench in place since it is still useful. Likely, this is because ChronicleMap's cost isn't as expensive as graph construction.

…cation

Results from my machine:

     [java] Benchmark                                 (dupeVectorFactor)  Mode  Cnt     Score    Error  Units
     [java] VectorCompactionBench.compactVectorIndex                   0  avgt    5   588.260 ± 50.314  ms/op
     [java] VectorCompactionBench.compactVectorIndex               0.009  avgt    5   577.705 ± 43.717  ms/op
     [java] VectorCompactionBench.compactVectorIndex               0.999  avgt    5   640.866 ± 14.616  ms/op
     [java] VectorCompactionBench.compactVectorIndex                  10  avgt    5  1253.919 ± 27.370  ms/op
     [java] Benchmark                                 (dupeVectorFactor)  Mode  Cnt     Score    Error  Units
     [java] VectorCompactionBench.compactVectorIndex                   0  avgt    5   576.955 ± 21.165  ms/op
     [java] VectorCompactionBench.compactVectorIndex               0.009  avgt    5   562.493 ± 25.208  ms/op
     [java] VectorCompactionBench.compactVectorIndex               0.999  avgt    5   641.528 ± 10.381  ms/op
     [java] VectorCompactionBench.compactVectorIndex                  10  avgt    5  1270.086 ± 15.054  ms/op
This drops the dimension from each key and
then adds a config that ensures chronicle
knows the keys are a fixed size.

     [java] Benchmark                                 (dupeVectorFactor)  Mode  Cnt     Score    Error  Units
     [java] VectorCompactionBench.compactVectorIndex                   0  avgt    5   594.055 ± 88.097  ms/op
     [java] VectorCompactionBench.compactVectorIndex               0.009  avgt    5   564.272 ± 20.759  ms/op
     [java] VectorCompactionBench.compactVectorIndex               0.999  avgt    5   629.680 ± 16.790  ms/op
     [java] VectorCompactionBench.compactVectorIndex                  10  avgt    5  1207.623 ± 17.931  ms/op
benchmark results before change:

     [java] Benchmark                                                   (dimension)  (numVectors)  Mode  Cnt      Score     Error  Units
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768        100000  avgt    5    271.569 ±   3.473  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768       1000000  avgt    5   5452.393 ± 227.905  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536        100000  avgt    5   1392.607 ±  30.388  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536       1000000  avgt    5  11496.696 ± 345.886  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768        100000  avgt    5    242.049 ±  20.708  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768       1000000  avgt    5   2365.691 ±  84.173  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536        100000  avgt    5    265.395 ±   4.167  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536       1000000  avgt    5   3641.557 ± 130.649  ms/op

after change:

     [java] Benchmark                                                   (dimension)  (numVectors)  Mode  Cnt    Score    Error  Units
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768        100000  avgt    5    5.721 ±  1.727  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping            768       1000000  avgt    5  124.536 ± 22.464  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536        100000  avgt    5    5.662 ±  0.610  ms/op
     [java] V5VectorPostingsWriterBench.createGenericIdentityMapping           1536       1000000  avgt    5  122.671 ±  3.343  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768        100000  avgt    5    5.364 ±  1.194  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany          768       1000000  avgt    5  119.449 ±  4.809  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536        100000  avgt    5    5.379 ±  0.552  ms/op
     [java] V5VectorPostingsWriterBench.describeForCompactionOneToMany         1536       1000000  avgt    5  121.293 ±  3.040  ms/op
@michaeljmarshall michaeljmarshall self-assigned this Jan 8, 2026
@github-actions
Copy link

github-actions bot commented Jan 8, 2026

Checklist before you submit for review

  • This PR adheres to the Definition of Done
  • Make sure there is a PR in the CNDB project updating the Converged Cassandra version
  • Use NoSpamLogger for log lines that may appear frequently in the logs
  • Verify test results on Butler
  • Test coverage for new/modified code is > 80%
  • Proper code formatting
  • Proper title for each commit staring with the project-issue number, like CNDB-1234
  • Each commit has a meaningful description
  • Each commit is not very long and contains related changes
  • Renames, moves and reformatting are in distinct commits
  • All new files should contain the DataStax copyright header instead of the Apache License one

Copy link

@eolivelli eolivelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I have one minor comment about "var", not a big deal

public boolean isEmpty()
{
return postingsMap.values().stream().allMatch(VectorPostings::isEmpty);
return rowsAdded == 0;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice trick !

int ordinal = useSyntheticOrdinals ? nextOrdinal++ : segmentRowId;
maxOrdinal = ordinal; // always increasing
var postings = new CompactionVectorPostings(ordinal, segmentRowId);
var data = postingsQueryContext.wrapValueAsData(postings);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to not use "var" here, because it it not really clear from the context which is the type, we are using this unusal QueryContext from CM.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, and data didn't make it any clearer :)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note: Cassandra community decided we are using var only in the tests, so maybe it will be easier when porting or rebasing if we stick to that rule in our fork too?

var trainingVectors = new ArrayList<VectorFloat<?>>(postingsMap.size());
var vectorsByOrdinal = new Int2ObjectHashMap<VectorFloat<?>>();
postingsMap.forEachEntry(entry -> {
// TODO can we skip this copy?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you willing to fix this in this PR ? otherwise let's open a ticket and link it here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I started looking into this last night. There are some cases where we modify an existing object (this is the using != null in the Marshaller. I am pretty sure, but not positive, those are just cases where the key was still in memory and not yet flushed to disk. Planning on checking today to see.

@cassci-bot
Copy link

❌ Build ds-cassandra-pr-gate/PR-2189 rejected by Butler


2 regressions found
See build details here


Found 2 new test failures

Test Explanation Runs Upstream
o.a.c.index.sai.QueryContextTest.testWideTableWithStatics[dc] (compression) REGRESSION 🔴🔵 0 / 21
o.a.c.index.sai.cql.VectorCompaction100dTest.testZeroOrOneToManyCompaction[dc false] REGRESSION 🔴 0 / 21

Found 2 known test failures

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants