CNDB-13952: Handle Chronicle Map entry overflow in vector index compaction #1731

michaeljmarshall · 2025-05-09T21:42:19Z

What is the issue

Fixes: https://github.com/riptano/cndb/issues/13952

What does this PR fix and why was it fixed

We were hitting the entry size limit in some cases (where there were an excessive number of duplicates). This code handles that exception by attempting to reduce the size required for storing those duplicates by writing the dupes as varints instead of plain integers.

Note that most cases have only a handful of duplicated vectors per graph, so we do not optimize for this large number of dupe case. Further, chronicle map allocates a minimum chunk for an entry, and we are often under that size, so there is no reason to only write the ints as varints.

…ction

github-actions · 2025-05-09T21:42:36Z

eolivelli

Overall looks good,
I have left some minor comments

eolivelli · 2025-05-26T11:48:53Z

src/java/org/apache/cassandra/index/sai/disk/v5/V5VectorPostingsWriter.java

+                maxRow = max(maxRow, rowId);
+        }
+
+        assert maxOldOrdinal >= 0;


I think that the previous behavior with "orElseThrow" was to throw if the collection was empty

this now throws an IllegalStateException if the values were not set.

src/java/org/apache/cassandra/index/sai/disk/vector/CompactionGraph.java

eolivelli · 2025-05-26T11:52:12Z

src/java/org/apache/cassandra/index/sai/disk/vector/VectorPostings.java

+                    for (int posting : postings.getPostings())
+                        writer.writeVInt(posting);
+                }
+                catch (Exception e)


(here and below) catching Exception is usually a code smell, and then rethrowing as unchecked RuntimeException is also bad
do we have a better way to rethrow ?

should we also handle InterruptedException ?

I updated it to handle only IOException. We do not expect to hit that condition though. We have this exception because I used the lucene DataOutput class to handle the vint serde.

eolivelli · 2025-05-26T11:53:25Z

src/java/org/apache/cassandra/index/sai/disk/vector/VectorPostings.java

+            }
+
+            @Override
+            public void writeByte(byte b) throws IOException


this cannot throw IOException, maybe we can remove the "throws" clause and simplify the code above ?

You are right that we can (and should) remove this, but it doesn't fix the above code because the class has an exception in the readVInt() method signature.

test/unit/org/apache/cassandra/index/sai/disk/vector/CompactionGraphTest.java

michaeljmarshall · 2025-06-03T19:35:04Z

@eolivelli - this is ready for another review, please take a look

pkolaczk

LGTM, but I have a couple of minor suggestions.

pkolaczk · 2025-06-05T09:08:21Z

src/java/org/apache/cassandra/index/sai/disk/vector/CompactionGraph.java

+            // means that we might have a smaller vector graph than desired, but at least we will not
+            // fail to build the index.
+            value.setShouldCompress(true);
+            map.put(key, value);


Is it possible it sill fails after compression? If so, then what? Maybe we should still provide at least some diagnostic message?

Given my testing and our usage, it's unlikely to fail a second time. We add postings to this map one row at a time on a single thread, so when we cross the threshold for max postings in an array, we do so by 4 bytes. I'll update the debug log line above since that'll likely be sufficient. The other option is to see if there is a better data structure fore us. I heard there was a lucene option that might handle this more gracefully without special encoding.

pkolaczk · 2025-06-05T09:11:26Z

src/java/org/apache/cassandra/index/sai/disk/vector/VectorPostings.java

+                    for (int posting : postings.getPostings())
+                        writer.writeVInt(posting);


Are those postings sorted? Just an idea: if they are sorted, maybe better to use delta-encoding (and if they are not sorted, maybe we could sort them?). Deltas would be usually smaller, hence vints would be smaller as well. Especially if there are lot of duplicates, you'd get many zeroes which compress down to 1 byte. Looks like you don't need a random access to the middle of a posting list on disk, but you deserialize all at once.

They are sorted, yes. We could consider delta encoding too. There are not expected to be duplicates though because a row has at most one vector when constructing a graph during compaction. And you're correct that we consume the list entirely.

My main reason for not going further in the compression is that we flush immediately after hitting this code block, and we only need to find 4 bytes of savings to prevent a subsequent failure on put.

pkolaczk · 2025-06-05T09:20:34Z

src/java/org/apache/cassandra/index/sai/disk/vector/VectorPostings.java

@@ -203,8 +213,25 @@ static class Marshaller implements BytesReader<CompactionVectorPostings>, BytesW
        public void write(Bytes out, CompactionVectorPostings postings) {
            out.writeInt(postings.ordinal);
            out.writeInt(postings.size());
-            for (Integer posting : postings.getPostings()) {
-                out.writeInt(posting);
+            out.writeBoolean(postings.shouldCompress);


One more thing: doesn't it break backwards compatibility? The format is different now. Shouldn't we bump up the version?

This file is a temporary file that does not survive to the next instantiation of the JVM on the same node. See:

cassandra/src/java/org/apache/cassandra/index/sai/disk/vector/CompactionGraph.java

Lines 178 to 187 in ed33431

// the extension here is important to signal to CFS.scrubDataDirectories that it should be removed if present at restart

Component tmpComponent = new Component(Component.Type.CUSTOM, "chronicle" + Descriptor.TMP_EXT);

postingsFile = dd.fileFor(tmpComponent);

postingsMap = ChronicleMapBuilder.of((Class<VectorFloat<?>>) (Class) VectorFloat.class, (Class<CompactionVectorPostings>) (Class) CompactionVectorPostings.class)

.averageKeySize(dimension * Float.BYTES)

.averageValueSize(VectorPostings.emptyBytesUsed() + RamUsageEstimator.NUM_BYTES_OBJECT_REF + 2 * Integer.BYTES)

.keyMarshaller(new VectorFloatMarshaller())

.valueMarshaller(new VectorPostings.Marshaller())

.entries(postingsEntriesAllocated)

.createPersistedTo(postingsFile.toJavaIOFile());

We use the Descriptor.TMP_EXT file extension to ensure that the file is removed if present at restart.

sonarqubecloud · 2025-06-06T21:10:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
83.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

cassci-bot · 2025-06-06T21:14:47Z

✔️ Build ds-cassandra-pr-gate/PR-1731 approved by Butler

Approved by Butler
See build details here

CNDB-13952: Handle Chronicle Map entry overflow in vector index compa…

642c469

…ction

michaeljmarshall requested a review from eolivelli May 9, 2025 21:42

michaeljmarshall self-assigned this May 9, 2025

Minor improvements to reduce disk reads on chronicle map

288557a

eolivelli requested changes May 26, 2025

View reviewed changes

michaeljmarshall added 2 commits May 28, 2025 16:54

Merge remote-tracking branch 'datastax/main' into cndb-13952-v2

7a4c91f

Cleanup after code review

ed33431

michaeljmarshall requested a review from eolivelli May 28, 2025 22:06

eolivelli requested a review from pkolaczk June 5, 2025 08:57

pkolaczk approved these changes Jun 5, 2025

View reviewed changes

pkolaczk reviewed Jun 5, 2025

View reviewed changes

Improve log line

c6062a7

		for (int posting : postings.getPostings())
		writer.writeVInt(posting);

	// the extension here is important to signal to CFS.scrubDataDirectories that it should be removed if present at restart
	Component tmpComponent = new Component(Component.Type.CUSTOM, "chronicle" + Descriptor.TMP_EXT);
	postingsFile = dd.fileFor(tmpComponent);
	postingsMap = ChronicleMapBuilder.of((Class<VectorFloat<?>>) (Class) VectorFloat.class, (Class<CompactionVectorPostings>) (Class) CompactionVectorPostings.class)
	.averageKeySize(dimension * Float.BYTES)
	.averageValueSize(VectorPostings.emptyBytesUsed() + RamUsageEstimator.NUM_BYTES_OBJECT_REF + 2 * Integer.BYTES)
	.keyMarshaller(new VectorFloatMarshaller())
	.valueMarshaller(new VectorPostings.Marshaller())
	.entries(postingsEntriesAllocated)
	.createPersistedTo(postingsFile.toJavaIOFile());

CNDB-13952: Handle Chronicle Map entry overflow in vector index compaction #1731

Are you sure you want to change the base?

CNDB-13952: Handle Chronicle Map entry overflow in vector index compaction #1731

Uh oh!

Conversation

michaeljmarshall commented May 9, 2025

What is the issue

What does this PR fix and why was it fixed

Uh oh!

github-actions bot commented May 9, 2025

Checklist before you submit for review

Uh oh!

eolivelli left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

michaeljmarshall commented Jun 3, 2025

Uh oh!

pkolaczk left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Jun 6, 2025

Quality Gate passed

Uh oh!

cassci-bot commented Jun 6, 2025

✔️ Build ds-cassandra-pr-gate/PR-1731 approved by Butler

Uh oh!

Uh oh!